2012 – Michael Friis' Blog

Tax records for Danish companies

December 26, 2012

This week, the Danish tax-authorities published an interface that lets you browse information on how much tax companies registered in Denmark are paying. I’ve written a scraper that has fetched all the records. I’ve published all 243,711 records as a Google Fusion Table that will let you explore and download the data. If you use this data for analysis or reporting, please credit Michael Friis, http://friism.com/. The scraper source code is also available if you’re interested.

UPDATE 1/9-12: Niels Teglsbo has exported the data from Google Fusion tables and created a convenient Excel Spreadsheet for download.

The bigger picture

Tax records for individuals (and companies presumably) used to be public in Denmark and still are in Norway and Sweden. If you’re in Denmark, you can probably head down to your local municipality, demand the old tax book and look up how much tax your grandpa paid in 1920. The municipality of Esbjerg publishes old records online in searchable form. Here’s a record of Carpenter N. Møller paying kr. 6.00 in taxes in 1892.

The Danish business lobby complained loudly when the move to publish current tax records was announced. I agree that the release of this information by a center-left government is an example of political demagoguery and that’s yucky, but apart from that, I don’t think there are any good reasons why this information should not be public. It’s also worth noting that publicly listed companies are already required to publish financial statements and non-public ones are required to submit yearly financials to the government which then helpfully resells them to anyone interested.

It’s good that this information is now completely public: Limited liability companies and the privileges and protections offered by these are an awesome invention. In return for those privileges, it’s fair for society to demand information about how a company is being run to see how those privileges are being put to use.

The authorities announced their intention to publish tax records in the summer of 2012 and it has apparently taken them 6 months to build a very limited interface on top of their database. The interface lets you look up individual companies by id (“CVR nummer”) or name and inspect their records. You have to know the name or id of any company that you’re interested in because there’s no way to browse or explore the data. Answering a simple question such as “Which company paid the most taxes in 2011?” is impossible using the interface.

Having said that, I think it’s great whenever governments release data and I commend the Danish tax authorities for making this data available. And even with very limited interfaces like this, it’s generally possible to scrape all data and analyze it in greater detail and that is what I’ve done.

So what’s in there

The tax data-set contains information on 243,711 companies. Note that this data does not contain the names and ids of all companies operating in Denmark in 2011. Some types of corporations (I/S corporations and sole proprietorships for example) have their profits taxed as personal income for the individuals that own them. That means they won’t show up in the data.

UPDATE 12/30-12: Magnus Bjerg pointed out that some companies are duplicated in the data. This seems to be the case at least for all (roughly 48) companies that pay tariffs for extraction of oil and gas. Here are some examples: Shell 1 and Shell 2 and Maersk 1 and Maersk 2. The numbers for these companies look very similar but are not exactly the same. The duplicated companies with different identifiers are likely due to Skat messing up CVR ids and SE ids. Additional details on SE ids can be found here here. My guess is that Skat pulled standard taxes and fossil fuel taxes from two different registries and forgot to merge and check for duplicates.

Here are the Danish companies that reported the greatest profits in 2011. These companies also paid the most taxes:

Here are the companies that booked the greatest losses:

FLSMIDTH & CO. A/S – lost kr. 1,537,929,000.00
Sund og Bælt Holding A/S – lost kr. 1,443,935,000.00
DONG ENERGY A/S – lost kr. 1,354,480,560.00
TAKEDA A/S – lost kr. 786,286,000.00
PFA HOLDING A/S – lost kr. 703,882,104.00

Here are companies that are reporting a lot of profit but paying few or no taxes:

DONG ENERGY A/S – kr. 3,148,994,114.00 profit, kr. 0 tax
TAKEDA A/S – kr. 745,424,000.00 profit, kr. 0 tax
Rockwool International A/S – kr. 284,696,514.00 profit, kr. 0 tax
COWI HOLDING A/S – kr. 177,272,657.00 profit, kr. 2,399,803.00 tax
DANAHER TAX ADMINISTRATION ApS. – kr. 155,222,377.00 profit, kr. 0 tax

Benford’s law

Benford’s law states that numbers in many real-world sources of data are much more likely to start with the digit 1 (30% of numbers) than with the digit 9 (less than 5% of numbers). Here’s the frequency distribution of first-digits of the numbers for profits, losses and taxes as reported by Danish companies plotted against the frequencies predicted by Benford:

The digit distributions perfectly match those predicted by Benford’s law. That’s great news: If Danish companies were systematically doctoring their tax returns and coming up with fake profit numbers, then those numbers would likely be more uniformly distributed and wouldn’t match Benford’s predictions. This is because crooked accountants trying to come up with random-looking numbers will tend to choose numbers starting with digits like 9 too often and numbers starting with the digit 1 too rarely.

UPDATE 12/30-12: It’s important to stress that the fact that the tax numbers conform to Benfords law does not imply that companies always pay the taxes they are due. It does suggest, however, that Danish companies–as a rule–do not put made-up numbers on their tax returns.

Technical details

To scrape the tax website I found two ways to access tax information for a company:

Access an individual company using the x query parameter for the CVR identifier: http://skat.dk/SKAT.aspx?oId=skattelister&x=29604274
Spoof the POST request generated by the UpdatePanel that gets updated when you hit the “søg” button

The former is the simplest approach, but the latter is preferable for a scraper because much less HTML is transferred from the server when updating the panel compared to requesting the page anew for each company.

To get details on a company, one has to know it’s identifier. Unfortunately there’s no authoritative list of CVR identifiers, although the government has promised to publish such a list in 2013. The contents of the entire Danish CVR register was leaked in 2011, so one could presumably harvest identifiers from that data. The most fool-proof method though, is to just brute-force through all possible identifiers. CVR identifiers consist of 7 digits with an 8th checksum-digit. The process of computing the checksum is documented publicly. Here’s my implementation of the checksum computation. Please let me know if you think it’s wrong:

	private static int[] digitWeights = { 2, 7, 6, 5, 4, 3, 2 };

	public static int ToCvr(int serial)
	{
		var digits = serial.ToString().Select(x => int.Parse(x.ToString()));
		var sum = digits.Select((x, y) => x * digitWeights[y]).Sum();
		var modulo = sum % 11;
		if (modulo == 1)
		{
			return -1;
		}
		if (modulo == 0)
		{
			modulo = 11;
		}
		var checkDigit = 11 - modulo;
		return serial * 10 + checkDigit;
	}

My guess is that the lowest serial (without the checksum) is 1,000,000 because that’s the lowest serial that will yield an 8-digit identifier. The largest serial is likely 9,999,999. I could be wrong though, so if you have any insights please let me know. Roughly one in eleven serials are discarded because the checksum is 10, which is invalid. That leaves about 8 million identifiers to be tried. It’s wasteful to have to submit 8 million requests to get records for a couple of hundred thousand companies, but one can hope that 8 million requests will get the governments attention and that they’ll start publishing data more efficiently.

Screen scraping with WatiN

May 8, 2012

This post describes how to use WatiN to screen scrape web sites that don’t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too.

Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause of difficulty is fumbling incompetence on the part of the people that built the site to be scraped. Every once in a while however, you’ll encounter a site openly displaying data to the casual browser, but with measures in place to prevent automatic scraping of that data.

The Danish Patent and Trademark Office is one such site. The people there maintain a searchable database that lets you search and peruse Danish and international patents. Unfortunately, computers are not allowed. If one tries to issue HTTP POST to the resource that generally performs searches and shows patents, an error is returned. If one emulates visiting the site with a real browser by providing a browser-looking User Agent setting, collecting cookies etc. (for example by using a tool like SimpleBrowser), the site sends a made-up 999 HTTP response code and the message “No Hacking”.

Faced with such an obstruction, there are two avenues of attack:

Break out Wireshark or Fiddler and spend a lot of time figuring out what it takes to fabricate requests that fools the site into thinking they originate from a normal browser and not from your bot
Instrument an actual browser so that the site will have no way (other than timing analysis and IP address request rate limiting) of knowing whether requests are from a bot or from a normal client

The second option turns out to be really easy because people have spent lots of time building tools for automatically testing web applications using full browsers, tools like WatiN. For example, successfully scraping the Danish Patent Authorities site using WatiN is as simple as this:

private static void GetPatentsInYear(int year)
{
	using (var browser = new IE("http://onlineweb.dkpto.dk/pvsonline/Patent"))
	{
		// go to the search form
		browser.Button(Find.ByName("menu")).ClickNoWait();

		// fill out search form and submit
		browser.CheckBox(Find.ByName("brugsmodel")).Click();
		browser.SelectList(Find.ByName("datotype")).Select("Patent/reg. dato");
		browser.TextField(Find.ByName("dato")).Value = string.Format("{0}*", year);
		browser.Button(Find.By("type", "submit")).ClickNoWait();
		browser.WaitForComplete();

		// go to first patent found in search result and save it
		browser.Buttons.Filter(Find.ByValue("Vis")).First().Click();
		GetPatentFromPage(browser, year);

		// hit the 'next' button until it's no longer there
		while (GetNextPatentButton(browser).Exists)
		{
			GetNextPatentButton(browser).Click();
			GetPatentFromPage(browser, year);
		}
	}
}

private static Button GetNextPatentButton(IE browser)
{
	return browser.Button(button =>
		button.Value == "Næste" && button.ClassName == "knapanden");
}

Note that in this example, we’re using Internet Explorer because it’s the easiest to setup and use (WatiN also works with Firefox, but only older versions). There’s definitely room for improvement, in particular it’d be interesting to explore parallelizing the scraper to download patents faster. The – still incomplete – project source code is available on Github. I’ll do a post shortly on what interesting data can be extracted from Danish patents.

Raw updated data on Danish business leader groups

March 18, 2012

Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I’ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github.

The code is fairly straightforward. The scraper itself is less than 150 loc. The scraper is configured to be run in a background worker on AppHarbor and will conduct a scrape once a month (I don’t know how often the VL-people update their website, but monthly updates seems sufficient to keep track of coming and goings). The resulting data can be fetched using a simple JSON API. You can find a list of scraped member-batches here (there’s just one at the time of writing). Hitting http://vlgroups.apphb.com/Member will always net you the latest batch.

I was motivated to revisit the code after this week’s dethroning of Anders Eldrup from his position as CEO of Dong Energy. Anders Eldrup sits in VL-gruppe 1, the most prestigious one. Let’s see if he’s still there next time the scraper looks. 14 other Dong Energy executives are members of other groups, although interestingly, Jakob Baruël Poulsen (Eldrup’s handsomely rewarded sidekick) is nowhere to be found. I think data like this in an important piece of the puzzle to figure out what relations exist between business leaders in Denmark and the Anders Eldrup debacle demonstrates why keeping track is important.

Nordic Newshacker

February 26, 2012

The excellent people at the Danish newspaper Information are hosting a competition to promote data journalism. It’s called “Nordisk Nyhedshacker 2012“. Data journalism was what I spent some of my time at Ekstra Bladet doing, and the organizers have been kind enough to put me on the jury. The winner will get a scholarship to go work at The Guardian for a month, sponsored by Google. Frankly, I’d prefer working at Information, but I guess The Guardian will do. If you’re a journalist that can hack or if you’re hacker interested in using your craft to make people more informed about the world we live in, you should use this opportunity to come up with something interesting and be recognized for it.

Hopefully, you already have awesome ideas for what to build. Should you need some inspiration, here a few interesting pieces of data you might want to consider (projects using this data will not be judged differently than others).

Examine the US Embassy Cables released by Wikileaks. I’ve tried to filter out the ones related to Denmark.
Examine the power relationships of members of Danish business leader groups. I have extracted the membership info from their web site. It’d be extra interesting if you combine this information with data about who sits on the boards of big Danish companies, perhaps to make the beginnings of something like LittleSis so that we can keep track of what favours those in power are doing each other.
Do something interesting with the CVR database of Danish companies that was leaked on The Pirate Bay last year.
Ekstra Bladet has been kind enough to let me open source the code for the award-winning Krimikort (Crime Map) I built while working there. It’s not quite ready to be released yet, but we’re making the current data available now. There’s 62,753 nuggets of geo-located and categorised crime ready for you to look at. You can download a rar file (50 MB) here. To use the data, you have to get a free copy of SQL Server Express and mount the database (Google will tell you how).

I’m afraid I won’t be able be participate in many of the activities preceding the actual competition but I can’t wait to see what people come up with!