Randoom a Michael Friis production

Posts Tagged Scraping

Roskilde Festival 2010 Schedule as XML

@mortenjust and @claus have created the excellent Roskilde Festival Pocket Schedule Generator. They gave me access to their schedule data, and I’ve used that to scrape more tidbits from the Roskilde Festival website. Fields are:

Name (all caps)
Stage where band plays
Time of performance (in UNIX and regular datetime)
Roskilde Festival website URL
Countrycode
Myspace URL
Band website URL
Picture URL
Video-embed-html
Tag-line

Get it [...]


Screen scraping flight data from Amadeus checkmytrip.com

checkmytrip.com let’s you input an airplane flight booking reference and your surname in return for a flight itinerary. This is useful for building all sorts of services to travellers. Unfortunately Amadeus doesn’t have an API, nor are their url’s restful. Using Python, mechanize, htm5lib and BeautifulSoup, you can get at the data pretty easy though. [...]


Exchange Rate data

As part of our ongoing efforts at making sense of the Tenders Electronic Daily procurement contracts, I had to get hold of historical exchange rates to convert the values of all the contracts into a comparable form. Professor Werner Antweiler at The University of British Columbia maintains a very impressive, free database of exactly this [...]


Downloading the EU

… or parts of it anyway.
The European Union is generally up to lots of weird and wonderful things, one of the more esoteric being the “Tenders Electronic Daily” (TED) database. Basically, all public procurement above a certain value has to go through this database so that companies from all over the world have a fair [...]


.Net/Firefox Screen Scraping

Need to scrape a website? I have two links for you:
Solvent from the MIT SIMILI project. In combination with Piggy Bank it’s a scraper on it’s own, but I only use for its superb XPath generator. Just activate the sprayer and click on an element you want and Solvent will generate an intelligent XPath expression [...]