Randoom a Michael Friis production

Posts Tagged Scraping

Danish state budget data

A couple of weeks ago, Peter Brodersen asked me whether I had made a tree-map visualization of the 2013 Danish state budget. Here it is. It’s on Many Eyes and requires Java (sorry). You can zoom in on individual spending areas by right-clicking on them: About the data I started scraping and analyzing budget data at [...]


Tax records for Danish companies

This week, the Danish tax-authorities published an interface that lets you browse information on how much tax companies registered in Denmark are paying. I’ve written a scraper that has fetched all the records. I’ve published all 243,711 records as a Google Fusion Table that will let you explore and download the data. If you use this data [...]


Posted
8 May 2012 @ 9pm

Tagged
Scraping

Screen scraping with WatiN

This post describes how to use WatiN to screen scrape web sites that don’t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too. Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause [...]


Raw updated data on Danish business leader groups

Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I’ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github. The code is fairly straightforward. The scraper [...]


Members of Danish VL Groups

Denmark has a semi-formalised system of VL-groups. “VL” is short for “Virksomhedsleder” which translates to “business leader”. The groups select their own members, and the whole thing is organised by the Danish Society for Business Leadership. The groups are not composed only of business people — top civil servants and politicians are also members. The [...]


Roskilde Festival 2010 Schedule as XML

@mortenjust and @claus have created the excellent Roskilde Festival Pocket Schedule Generator. They gave me access to their schedule data, and I’ve used that to scrape more tidbits from the Roskilde Festival website. Fields are: Name (all caps) Stage where band plays Time of performance (in UNIX and regular datetime) Roskilde Festival website URL Countrycode [...]


Screen scraping flight data from Amadeus checkmytrip.com

checkmytrip.com let’s you input an airplane flight booking reference and your surname in return for a flight itinerary. This is useful for building all sorts of services to travellers. Unfortunately Amadeus doesn’t have an API, nor are their url’s restful. Using Python, mechanize, htm5lib and BeautifulSoup, you can get at the data pretty easy though. [...]


Exchange Rate data

As part of our ongoing efforts at making sense of the Tenders Electronic Daily procurement contracts, I had to get hold of historical exchange rates to convert the values of all the contracts into a comparable form. Professor Werner Antweiler at The University of British Columbia maintains a very impressive, free database of exactly this [...]


Downloading the EU

… or parts of it anyway. The European Union is generally up to lots of weird and wonderful things, one of the more esoteric being the “Tenders Electronic Daily” (TED) database. Basically, all public procurement above a certain value has to go through this database so that companies from all over the world have a [...]


.Net/Firefox Screen Scraping

Need to scrape a website? I have two links for you: Solvent from the MIT SIMILI project. In combination with Piggy Bank it’s a scraper on it’s own, but I only use for its superb XPath generator. Just activate the sprayer and click on an element you want and Solvent will generate an intelligent XPath [...]