Tuesday, 11 June 2013

ScraperWiki, easy Web data scraping

There's gold in them thar' Webs! Yes indeed, out there amongst the LOL Cats and dancing babies are thousands of Web sites with useful data just waiting to be exploited; but how to actually get it?

Well, if you were around in the 1970's and 80's one of the key tools for integrating legacy systems with those new-fangled PCs was "screen scraping."

The idea was fairly simple: A legacy system such as an IBM mainframe could be accessed by a terminal emulator and the displayed data could be captured and re-purposed. This worked like a charm although how smart the screen scraping code was had a huge impact on whether display problems, minor mainframe display changes, or display errors caused the system to break.

Be that as it may have been, screen scraping is still used extensively today on the Web because HTML content is fairly easily decoded interface. If you want to do something like extract a list of all currently listed companies on the Nasdaq what you'll usually do is write a script. But you're highly unlikely to be the first person to try this and even if your script is, in fact, the best ever, then why not share it with others?

The answer is, of course, that there's a Web app for this: ScraperWiki is a free Web service that makes the task of Web scraping really easy.

ScraperWiki is a scripting platform that supports Ruby, PHP and Python, includes an online editor, and provides a public repository of scripts. In fact, on ScraperWiki all scripts are public and the service's license requires that all scripts are licensed under the GNU General Public License.

All scripts run a security "sandbox" and if a script should break in any of the usual ways (throws a fatal error, gets into an infinite loop, etc.), the failure will be detected and the script stopped and its creator automatically notified.

ScraperWiki is only intended for scraping publicly available data sources so extracting data from, say, The New York Times, would violate both the ScraperWiki and NYT terms and conditions.

Once you've developed and tested your script you can set the run frequency (the default is once per day) and, after running, the results are stored in ScraperWiki's database. This output can be downloaded as CSV files or SQLite3 databases (depending on the script design) or as JSON or JSONDICT files via the ScraperWiki API.

The API for ScraperWiki is very sophisticated and there are extensive tutorials and documentation for developing scrapers in each of the supported languages along with specialized built-in ScraperWiki functions (you can, for example, make SQL calls to the service's SQLlite database to retrieve data and save and extract metadata).

The service was designed and is run by people in the UK so the geocoding functions don't work in the US but if you feel compelled to fix this, the project is free open source software licensed under the GNU Affero General Public License so go for it! Give us a U.S version.

ScraperWiki is a great idea and its implementation with the built-in editor and script scheduling is fantastic. Definitely worth adding to your data gathering toolset.



Source: http://www.networkworld.com/newsletters/web/2011/032811web1.html

No comments:

Post a Comment