Randoom a Michael Friis production

Posts Tagged Scraping

Full 2012 Danish company taxes

Two weeks ago I put out a preliminary release of the 2012 taxes. The full dataset with 245,836 companies is now available in this Google Fusion Table. I haven’t done any of the analysis I did last year. Other than a list of top payers, I’ll leave the rest up to you guys. One area of […]


Preliminary 2012 Danish Company Tax Records

The Danish tax authority released the 2012 company tax records yesterday. I’ve scraped a preliminary data set by just getting all the CVR-ids from last year’s scrape. This leaves out any new companies that cropped up since then, but there’d still 228,976 in the set, including subsidiaries. The rest are being scraped as I write […]


Danish state budget data

A couple of weeks ago, Peter Brodersen asked me whether I had made a tree-map visualization of the 2013 Danish state budget. Here it is. It’s on Many Eyes and requires Java (sorry). You can zoom in on individual spending areas by right-clicking on them: About the data I started scraping and analyzing budget data at […]


Tax records for Danish companies

This week, the Danish tax-authorities published an interface that lets you browse information on how much tax companies registered in Denmark are paying. I’ve written a scraper that has fetched all the records. I’ve published all 243,711 records as a Google Fusion Table that will let you explore and download the data. If you use this data […]


Posted
8 May 2012 @ 9pm

Tagged
Scraping

Screen scraping with WatiN

This post describes how to use WatiN to screen scrape web sites that don’t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too. Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause […]


Raw updated data on Danish business leader groups

Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I’ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github. The code is fairly straightforward. The scraper […]


Members of Danish VL Groups

Denmark has a semi-formalised system of VL-groups. “VL” is short for “Virksomhedsleder” which translates to “business leader”. The groups select their own members, and the whole thing is organised by the Danish Society for Business Leadership. The groups are not composed only of business people — top civil servants and politicians are also members. The […]


Roskilde Festival 2010 Schedule as XML

@mortenjust and @claus have created the excellent Roskilde Festival Pocket Schedule Generator. They gave me access to their schedule data, and I’ve used that to scrape more tidbits from the Roskilde Festival website. Fields are: Name (all caps) Stage where band plays Time of performance (in UNIX and regular datetime) Roskilde Festival website URL Countrycode […]


Screen scraping flight data from Amadeus checkmytrip.com

checkmytrip.com let’s you input an airplane flight booking reference and your surname in return for a flight itinerary. This is useful for building all sorts of services to travellers. Unfortunately Amadeus doesn’t have an API, nor are their url’s restful. Using Python, mechanize, htm5lib and BeautifulSoup, you can get at the data pretty easy though. […]


Exchange Rate data

As part of our ongoing efforts at making sense of the Tenders Electronic Daily procurement contracts, I had to get hold of historical exchange rates to convert the values of all the contracts into a comparable form. Professor Werner Antweiler at The University of British Columbia maintains a very impressive, free database of exactly this […]


← Before