Is there a simple way in R to extract only the text elements of an HTML page?

I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url.

Duplicate: stackoverflow.com/questions/1844829/…
– Shane
Jul 7 '10 at 14:09

@Shane -- The answer given on that page doesn't seem to work (at least not anymore, though I'm sure it did at the time).
– JoshuaCrove
Jul 7 '10 at 15:28

Then we should fix it, not start a new one. Or else ask a question directly related to how that old answer no longer works.
– Shane
Jul 7 '10 at 15:45

@Shane: I didn't see that original question when I posted mine. I notice you are the same person who answered that question in that post, please know I meant no disrespect, all help is appreciated ofcourse. I think the answer below by Tony is better for what I would like to do. I am new to stackoverflow, still getting the hang of it. :)
– JoshuaCrove
Jul 7 '10 at 15:54

No worries. Tony's answer is great. Just want to be sure that as you learn SO, that searching before posting becomes part of the routine. And in retrospect, these questions are a little different... :)
– Shane
Jul 7 '10 at 15:57

4 Answers
4

I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl) library(RTidyHTML) library(XML)

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r" doc.raw <- getURL(u) doc <- tidyHTML(doc.raw) html <- htmlTreeParse(doc, useInternal = TRUE) txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue) cat(unlist(txt))

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient) u <- "http://stackoverflow.com/questions/tagged?tagnames=r" ie <- COMCreate("InternetExplorer.Application") ie$Navigate(u) txt <- list() txt[[u]] <- ie[["document"]][["body"]][["innerText"]] ie$Quit() print(txt)

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.

stackoverflow.com/questions/31423931/…
– Arun Raja
Jul 15 '15 at 8:21

Great answer, though I'm having problems installing RTidyHTML; I've tried install.packages('http://www.omegahat.net/RTidyHTML/RTidyHTML_0.2-1.tar.gz', repos=NULL) and install_github('omegahat/RTidyHTML') but compilation fails on Windows 10.
– ms609
May 16 at 11:33

install.packages('http://www.omegahat.net/RTidyHTML/RTidyHTML_0.2-1.tar.gz', repos=NULL)

install_github('omegahat/RTidyHTML')

Well it´s not exactly a R way of doing it, but it´s as simple as they come: outwit plugin for firefox. The basic version is for free and helps to extract tables and stuff.

ah and if you really wanna do it the hard way in R, this link is for you:

I've had good luck with the readHTMLTable() function of the XML package. It returns a list of all tables on the page.

> library(XML) > url <- 'http://en.wikipedia.org/wiki/World_population' > allTables <- readHTMLTable(url)

There can be many tables on each page.

> length(allTables) [1] 17

So just select the one you want.

> tbl <- allTables[[3]]

The biggest hassle can be installing the XML package. It's big, and it needs the libxml2 library (and, under Linux, it needs the xml2-config Debian package, too). The second biggest hassle is that HTML tables often contain junk you don't want, besides the data you do want.

The best solution is package htm2txt.

library(htm2txt) url <- 'https://en.wikipedia.org/wiki/Alan_Turing' text <- gettxt(url)

For details, see https://CRAN.R-project.org/package=htm2txt.

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt