Michael Friis' Blog

About


Getting ready for 2012 Danish company taxes

This is a follow-up to last year’s “Tax records for Danish companies” post which covered how I screen-scraped and analyzed 2011 tax records for all Danish companies.

I revisited the scraper source code today because the Tax Authority has made it known[dk] that they will be releasing the 2012 data set next week. As I did last year, I want to preface the critique that is going to follow and say that it’s awesome that this information is made public, and that I hope the government will continue to publish it and work on making it more useful.

First some notes regarding the article:

  • It states that, for the first time, it’s possible to determine what part of a company’s taxes are due to profits on oil and gas extraction and what are due to normal profits. That’s strange, since this was also evident from last years data. Maybe they’re trying to say (but the journalist was too dim to understand) that they have solved the problem in the 2011 data that caused oil and gas corporations to be duplicated, as evidenced by the two entries for Maersk: A.P.Møller – Mærsk A/S/ Oil & Gas Activity and A.P. MØLLER – MÆRSK A/S. Note that the two entries have the same CVR identifier.
  • It’s frustrating that announcements like this (that new data is coming next week) are not communicated publicly on Twitter or the web sites of either the Tax Authority or the Ministry of Taxation. Instead, one has to randomly find the news on some random newspaper web site. Maybe it was mentioned in a newsletter I’m not subscribed to – who knows.

Anyway, these are nuisances, now on to the actual problems.

2011 data is going away

The webpage says it beautifully:

De offentliggjorte skatteoplysninger er for indkomståret 2011 og kan ses, indtil oplysningerne for 2012 bliver offentliggjort i slutningen af 2013.

Translated:

The published tax information is for the year 2011 and is available until the 2012 information is published at the end of 2013.

Removing all the 2011 data to make room for the 2012 stuff is very wrong. First off, it’s defective that historical information is not available. Of course, I scraped the information and put it in a Fusion Table for posterity (or at least for as long as Google maintains that product). Even then, it’s wrong of the tax authority to not also publish and maintain historical records.

Second, I suspect that the new 2012 data will be published using the same URI scheme as the 2011 data, i.e.: http://skat.dk/SKAT.aspx?oId=skattelister&x={cvr-id}. So when the new data goes live some time next week, a URI that pointed to the 2011 tax records of the company FORLAGET SOHN ApS will all of a sudden point to the 2012 tax records of that company. That means that all the links I included in last year’s blog post and thought would point to 2011 data in perpetuity now point to 2012 data. This is likely going to be confusing to readers, both of my post, but also for other people following those links from all over the Internet. The semantics of these URIs are arguably coherent if they’re defined to be “the latest tax records for company X”. This is not a very satisfying paradigm though, and it would be much better if /company-tax-records/{year}/{cvr-id} URIs were made available, or if records from all years were available at /SKAT.aspx?oId=skattelister&x={cvr-id} as they became available.

The 2011 data was changed

I discovered this randomly when dusting off last years code. It has a set of integration tests, and the one for Saxo Bank refused to pass. That turns out to be because the numbers reported have changed. When I first scraped the data, Saxo Bank paid kr. 25.426.135 in taxes on profits of kr. 257.969.357. The current numbers are kr. 25.142.333 taxes on kr. 260.131.946 of profits. So it looks like the bank made a cool extra couple millions in 2011 and managed to get their tax bill bumped down a bit.

Some takeaways:

  1. Even though this information is posted almost a full year after the end of 2011, the numbers are not accurate and have to be corrected. This is obviously not only the tax authority’s fault: Companies are given ample time to gather and submit records and even then, they may provide erroneous data.
  2. It’d be interesting to know what caused the correction. Did Saxo Bank not submit everything? Did the tax people miss something? Was Saxo Bank audited?
  3. It’d be nice if these revisions themselves were published in an organised fashion by the tax authorities. Given the ramshackle way they go about publishing the other data, I’m not holding my breath for this to happen.
  4. I have no idea if there are adjustments to other companies and if so, how many. I could try and re-run the scraper on all 243,711 companies to find changes before the 2012 release obliterates the 2011 data but I frankly can’t be bothered. Maybe some journalist can go ask.

That’s it! Provided the tax people don’t change the web interface, the scraper is ready for next week’s 2012 data release. I’ll start running as soon as 2012 numbers show up and publish raw data when I have it.

Heroku .NET buildpack now with nginx

Another weekend, and another bunch of updates to the .NET buildpack. Most significantly, ASP.NET apps now run with  fastcgi-mono-server fronted by an nginx instance in each dyno. This replaces the previous setup which used the XSP development web server. Using nginx is supposedly more production ready.

Other changes include setting the LD_LIBRARY_PATH environment variable and priming the Mono certificate store. This will make apps that use RestSharp, or otherwise exposes Mono’s reliance on native zlib, work out of the box. It also makes calling HTTPS web services form Heroku not fail.

These improvements came about because people tested their .NET apps on Heroku. Please do the same so that we can weed out as many bugs as possible. If you want to contribute, feel free to check out the TODO-list in the README on the GitHub.

Heroku .NET buildpack update to Mono 3.2 and more

Hot on the heels of posts on Running .NET on Heroku and Running OWIN/Katana apps on Heroku, I’m happy to announce a couple of updates to the buildpack:

  1. Now uses the newly released Mono 3.2 runtime. I need to figure out some way for users to select what version they want the same way the Python buildpack uses runtime.txt.
  2. Adds symbolic links from NuGet.targets files to nuget.targets files to account for inconsistent casing practices in NuGet.
  3. Fetches Nuget.targets file that is not broken when doing package restore on Linux. I’m still trying to get Microsoft to accept a pull request to NuGet that fixes this.

I’ve also spent some time trying to switch from xsp4 to nginx and fastcgi-mono-server4, but am currently stuck.

Running OWIN/Katana apps on Heroku

This post demonstrates how to get a an OWIN/Katana sample app running on Heroku. It uses the Mono buildpack that I’ve already covered on the blog. As far as I can determine, Heroku is the only public platform where you can conveniently prototype and share OWIN apps running on Katana.

OWIN and Katana

OWIN is the Open Web Interface for .NET, a specification for how apps and frameworks should interact with servers that host them. It’s very much like Ruby Rack, only for .NET. By decoupling apps from web servers, it should become much easier to write portable middleware that can be reused between apps and  frameworks. There’ already a ton of good OWIN stuff on NuGet.

Katana is a set of components (including a web server) built by Microsoft and others that conforms to the OWIN spec. Here’s a great overview write-up about Katana. There are other OWIN compliant servers (like Kayak) and hopefully more will show up.

OWIN and Microsoft adopting OWIN is pretty exciting: It will make writing good .NET web apps easier and it’s going to free .NET web developers from their historical over-reliance on the IIS web server. With IIS out of the picture, hosting apps on Mono and on open platforms like Heroku will be much easier.

Getting started

The sample demoed here is an expanded version K. Scott Allen’s sample. Here’s how to get started:

  1. Open Visual Studio and Create a C# Console Application and enable NuGet package restore. I used Visual Studio 2012 and named the app KatanaTest.
  2. Install OWIN NuGet packages
    install-package Microsoft.Owin.Hosting -IncludePreRelease
    install-package Microsoft.Owin.Host.HttpListener –IncludePreRelease
    install-package Microsoft.Owin.Diagnostics –IncludePreRelease
    install-package Owin.Extensions -IncludePrerelease
  3. Add a class that configures the app to just display the Katana sample page.
  4. Add a Main entry point that listens for HTTP on a port specified as a command line argument and waits for ctrl+c to exit.
  5. Add a Procfile that tells Heroku how to start the app, passing in the $PORT as an argument.
  6. Fix casing bug and restore command to make project build on Heroku (these will hopefully soon be fixed in NuGet). (Fixed in latest buildpack version)
  7. Deploy to Heroku:
        $ heroku create
        $ heroku config:add BUILDPACK_URL=https://github.com/friism/heroku-buildpack-mono/
        $ git push heroku master

Check out the sample running live on Heroku. The sample project is available on GitHub.

Running .NET on Heroku

Since joining Heroku I’ve been wanting to get .NET running on the platform and I’m happy to report that we now have a reasonably workable Mono buildpack. My goal was to be able to take an ASP.NET MVC solution created with Visual Studio on Windows, create a Heroku app, run git push heroku master and have Heroku build, deploy and run the app using Mono and the XSP web server.

The result is based heavily on previous work by my colleague Brandur.

Getting started

To use the .NET buildpack, create an ASP.NET MVC 4 web site and enable NuGet package restore. There are then a few tweaks required to make the solution palatable to Mono and xbuild (striked-out issues have been fixed in buildpack and are not necessary anymore):

Hopefully, we can get these obstacles eliminated through improvements to either Mono, NuGet or the buildpack.

Now, deploy to Heroku:

    $ heroku create
    $ heroku config:add BUILDPACK_URL=https://github.com/friism/heroku-buildpack-mono/
    $ git push heroku master

I’ve created a few samples that are known to work. TestingMono is an extremely simple MVC site with a background worker that logs a message once every second. To run the background worker, add a Procfile that defines the command worker: mono WorkerTest.exe and scale it to 1 with heroku scale worker=1. The other sample is forked from an AppHarbor sample and demonstrates simple use of a Heroku PostgreSQL database. Note that the connectionstring has to be read from the environment, not Web.config as is usual in .NET. You can find the sample running here.

Overview

Here’s what works:

  • Running ASP.NET MVC 4 on top of Mono 3.0.11 and XSP 3.0.11
  • NuGet package restore so you don’t have to include library dependencies in your repo
  • Caching of build output and incremental builds, including caching of already-downloaded NuGet packages
  • Running executables in background worker dynos

Here’s what needs love:

  • Insertion of config into appSettings in Web.config
  • Make more of the default Visual Studio templates work out of the box
  • Look for XSP replacement (likely nginx)

Also see the TODO in the README. Contributions are very welcome. I hope to cover how buildpack dependencies (Mono and XSP in this case) are generated in a future blog post.

And yes, I’m working on getting Visual Basic running.

Compressed string storage with NHibernate

This blog post demonstrates how to use a IUserType to make NHibernate compress strings before storing them. It also shows how to use an AttributeConvention to configure the relevant type mapping.

By compressing strings before storing them you can save storage space and potentially speed up your app because fewer bits are moved on and off physical storage. In this example, compression is done using the extremely fast LZ4 algorithm so as to not slow data storage and retrieval.

The downside to compressing string stored in the database is that running ad-hoc SQL queries (such is mystring like '%foo%') is not possible.

Background

I was building an app that was downloading and storing lots HTML and for convenience I was storing the HTML in a SQL Server database. SQL Server has no good way to compress stored data so the database files grew very quickly. This bugged me because most of the content would compress well. I was using Entity Framework and started throwing around for ways to hook into how EF serializes data or for a way to create a custom string type that could handle the compression. Even with the EF6 pre-releases, I couldn’t find any such hooks.

NHibernate IUserType

So I migrated to NHibernate which lets you define custom datatypes and control how they’re stored in the database by implementing the IUserType interface. The best tutorial I’ve found for implementing IUserType is this one by Jacob Andersen. You can check out my full implementation of a compressed string IUserType on GitHub. The two most interesting methods are NullSafeGet() and NullSafeSet():

	public object NullSafeGet(IDataReader rs, string[] names, object owner)
	{
		var value = rs[names[0]] as byte[];
		if (value != null)
		{
			var deCompressor = LZ4DecompressorFactory.CreateNew();
			return Encoding.UTF8.GetString(deCompressor.Decompress(value));
		}

		return null;
	}

	public void NullSafeSet(IDbCommand cmd, object value, int index)
	{
		var parameter = (DbParameter)cmd.Parameters[index];

		if (value == null)
		{
			parameter.Value = DBNull.Value;
			return;
		}

		var compressor = LZ4CompressorFactory.CreateNew();
		parameter.Value = compressor.Compress(Encoding.UTF8.GetBytes(value as string));
	}

The actual compression is done by LZ4Sharp which is a .NET implementation of the LZ4 compression algorithm. LZ4 is notable, not for compressing data a lot, but for compressing and uncompressing data extremely quickly. A single modern CPU core can LZ4-compress at up to 300 MB/s and uncompress much faster. This should minimize the overhead of compressing and uncompressing data as it enters and leaves the database.

For SqlTypes we use BinarySqlType(int.MaxValue):

	public SqlType[] SqlTypes
	{
		get { return new[] { new BinarySqlType(int.MaxValue) }; }
	}

This causes the type to be mapped to a varbinary(max) column in the database.

Mapping

To facilitate mapping, we’ll use an Attribute:

	[AttributeUsage(AttributeTargets.Property)]
	public class CompressedAttribute : Attribute
	{
	}

And an AttributeConvention for FluentNHibernate to use:

	public class CompressedAttributeConvention : AttributePropertyConvention
	{
		protected override void Apply(CompressedAttribute attribute, IPropertyInstance instance)
		{
			if (instance.Property.PropertyType != typeof(string))
			{
				throw new ArgumentException();
			}

			instance.CustomType(typeof(CompressedString));
		}
	}

Here’s how to use the convention with AutoMap:

	var autoMap = AutoMap.AssemblyOf()
		.Where(x => typeof(Entity).IsAssignableFrom(x))
		.Conventions.Add(new CompressedAttributeConvention());

The full SessionFactory is on GitHub.

With this, we get nice, clean entity classes with strings that are automatically compressed when stored:

	public class Document : Entity
	{
		[Compressed]
		public virtual string Text { get; set; }
	}

Limitations

As mentioned in the introduction you can’t do ad-hoc SQL queries because compressed strings are stored in the database as binary blobs. Querying with NHibernate as also somewhat limited. Doing document.Text == "foo" actually works because NHibernate runs “Foo” through the compression. Queries that involve Contains() will (silently) not work, unfortunately. This is because NHibernate translates this to a like query, which won’t work with the compressed binary blob. I haven’t looked into hooking into the query engine to fix this.

Danish state budget data

A couple of weeks ago, Peter Brodersen asked me whether I had made a tree-map visualization of the 2013 Danish state budget. Here it is. It’s on Many Eyes and requires Java (sorry). You can zoom in on individual spending areas by right-clicking on them:

MANY EYES SERVICE NO LONGER EXISTS

About the data

I started scraping and analyzing budget data at Ekstra Bladet in 2010. The goal was to find ways to help people understand how the Danish state uses it’s money and to let everyone rearrange and balance out the 15 billion DDK long term deficit that was frequently cited in the run-up to the 2011 parliamentary election. We didn’t get around to this, unfortunately.

The Danish state burns through a lot of money, which is inherently interesting. The budget published online is also very detailed, which is great. Showing off the magnitude and detail in an interesting way turns out to be difficult though, and the best I’ve come up with is the Many Eyes tree-map.

To see if anyone can do a better job, I’m making all the underlying data available in a Google Fusion Table. The data is hierarchical with six levels of detail (this is also why the zoomable tree-map works fairly well). Here’s an example hierarchy, starting from the ministry using money (Ministry of Labor), down to what the money was used for (salaries and benefits):

Beskæftigelsesministeriet
    Arbejdsmiljø
        Arbejdsmarkedets parters arbejdsmiljøindsats
            Videncenter
                Indtægtsdækket virksomhed
                    Lønninger / personaleomkostninger.

In the Fushion table data there’s a line with an amount for each level. That means that the same money shows up six times, once for each level in the hierarchy. To generate the tree-map, one would start with lines at line-level 5 (the most detailed) and use the ParentBudgetLine to find the parent lines in the hierarchy. The C# code that accomplishes this is found here.

The Fushion table contains data for budgets from 2003 to 2013. The “Year” column is the budget year that this line belongs to. “Linecode” is the code used in the budget. “CurrentYearBudget” is the budgeted amount for the year that this particular budget was published (ie. the projected spend in 2013 for the 2013 state budget). Year[1-3]Budget are the projected spends for the coming three years (ie. 2014-2016 for the 2013 budget). PreviousYear[1-2]Budget are the spends actually incurred for the previous two years (ie. 2011 and 2012 for the 2013 budget).

We have data for multiple years and comparing projected numbers in previous years with actual numbers in later years might yield interesting examples of departments going over budget and other irregularities.

Since we have data for multiple years, we can also visualize changes in spending for individual ministries over time. This turns out to be slightly less interesting than one might suspect because changing governments have a tendency to rename, create or close down ministries fairly often. Here’s a time-graph example:

MANY EYES SERVICE NO LONGER EXISTS

The source code that parses the budget and outputs it in various ways can be found on GitHub. The code was written on Ekstra Bladet’s dime.

Dedication: This blog post is dedicated to Aaron Swartz. Aaron committed suicide sometime around January 11th, 2013. He had many cares and labors, and one of them was making data more generally available to the public.

Tax records for Danish companies

This week, the Danish tax-authorities published an interface that lets you browse information on how much tax companies registered in Denmark are paying. I’ve written a scraper that has fetched all the records. I’ve published all 243,711 records as a Google Fusion Table that will let you explore and download the data. If you use this data for analysis or reporting, please credit Michael Friis, http://friism.com/. The scraper source code is also available if you’re interested.

UPDATE 1/9-12: Niels Teglsbo has exported the data from Google Fusion tables and created a convenient Excel Spreadsheet for download.

The bigger picture

Tax records for individuals (and companies presumably) used to be public in Denmark and still are in Norway and Sweden. If you’re in Denmark, you can probably head down to your local municipality, demand the old tax book and look up how much tax your grandpa paid in 1920. The municipality of Esbjerg publishes old records online in searchable form. Here’s a record of Carpenter N. Møller paying kr. 6.00 in taxes in 1892.

The Danish business lobby complained loudly when the move to publish current tax records was announced. I agree that the release of this information by a center-left government is an example of political demagoguery and that’s yucky, but apart from that, I don’t think there are any good reasons why this information should not be public. It’s also worth noting that publicly listed companies are already required to publish financial statements and non-public ones are required to submit yearly financials to the government which then helpfully resells them to anyone interested.

It’s good that this information is now completely public: Limited liability companies and the privileges and protections offered by these are an awesome invention. In return for those privileges, it’s fair for society to demand information about how a company is being run to see how those privileges are being put to use.

The authorities announced their intention to publish tax records in the summer of 2012 and it has apparently taken them 6 months to build a very limited interface on top of their database. The interface lets you look up individual companies by id (“CVR nummer”) or name and inspect their records. You have to know the name or id of any company that you’re interested in because there’s no way to browse or explore the data. Answering a simple question such as “Which company paid the most taxes in 2011?” is impossible using the interface.

Having said that, I think it’s great whenever governments release data and I commend the Danish tax authorities for making this data available. And even with very limited interfaces like this, it’s generally possible to scrape all data and analyze it in greater detail and that is what I’ve done.

So what’s in there

The tax data-set contains information on 243,711 companies. Note that this data does not contain the names and ids of all companies operating in Denmark in 2011. Some types of corporations (I/S corporations and sole proprietorships for example) have their profits taxed as personal income for the individuals that own them. That means they won’t show up in the data.

UPDATE 12/30-12: Magnus Bjerg pointed out that some companies are duplicated in the data. This seems to be the case at least for all (roughly 48) companies that pay tariffs for extraction of oil and gas. Here are some examples: Shell 1 and Shell 2 and Maersk 1 and Maersk 2. The numbers for these companies look very similar but are not exactly the same. The duplicated companies with different identifiers are likely due to Skat messing up CVR ids and SE ids. Additional details on SE ids can be found here here. My guess is that Skat pulled standard taxes and fossil fuel taxes from two different registries and forgot to merge and check for duplicates.

Here are the Danish companies that reported the greatest profits in 2011. These companies also paid the most taxes:
  1. SHELL OLIE- OG GASUDVINDING DANMARK B.V. (HOLLAND), DANSK  FILIAL
  2. A/S Dansk Shell/Eksportvirksomhed
  3. A.P. MØLLER – MÆRSK A/S
  4. A.P.Møller – Mærsk A/S/ Oil & Gas Activity
  5. Novo A/S
Here are the companies that booked the greatest losses:
  1. FLSMIDTH & CO. A/S – lost kr. 1,537,929,000.00
  2. Sund og Bælt Holding A/S – lost kr. 1,443,935,000.00
  3. DONG ENERGY A/S – lost kr. 1,354,480,560.00
  4. TAKEDA A/S – lost kr. 786,286,000.00
  5. PFA HOLDING A/S – lost kr. 703,882,104.00
Here are companies that are reporting a lot of profit but paying few or no taxes:
  1. DONG ENERGY A/S – kr. 3,148,994,114.00 profit, kr. 0 tax
  2. TAKEDA A/S – kr. 745,424,000.00 profit, kr. 0 tax
  3. Rockwool International A/S – kr. 284,696,514.00 profit, kr. 0 tax
  4. COWI HOLDING A/S – kr. 177,272,657.00 profit, kr. 2,399,803.00 tax
  5. DANAHER TAX ADMINISTRATION ApS. – kr. 155,222,377.00 profit, kr. 0 tax

Benford’s law

Benford’s law states that numbers in many real-world sources of data are much more likely to start with the digit 1 (30% of numbers) than with the digit 9 (less than 5% of numbers). Here’s the frequency distribution of first-digits of the numbers for profits, losses and taxes as reported by Danish companies plotted against the frequencies predicted by Benford:

 

The digit distributions perfectly match those predicted by Benford’s law. That’s great news: If Danish companies were systematically doctoring their tax returns and coming up with fake profit numbers, then those numbers would likely be more uniformly distributed and wouldn’t match Benford’s predictions. This is because crooked accountants trying to come up with random-looking numbers will tend to choose numbers starting with digits like 9 too often and numbers starting with the digit 1 too rarely.

UPDATE 12/30-12: It’s important to stress that the fact that the tax numbers conform to Benfords law does not imply that companies always pay the taxes they are due. It does suggest, however, that Danish companies–as a rule–do not put made-up numbers on their tax returns.

Technical details

To scrape the tax website I found two ways to access tax information for a company:
  1. Access an individual company using the x query parameter for the CVR identifier: http://skat.dk/SKAT.aspx?oId=skattelister&x=29604274
  2. Spoof the POST request generated by the UpdatePanel that gets updated when you hit the “søg” button

The former is the simplest approach, but the latter is preferable for a scraper because much less HTML is transferred from the server when updating the panel compared to requesting the page anew for each company.

To get details on a company, one has to know it’s identifier. Unfortunately there’s no authoritative list of CVR identifiers, although the government has promised to publish such a list in 2013. The contents of the entire Danish CVR register was leaked in 2011, so one could presumably harvest identifiers from that data. The most fool-proof method though, is to just brute-force through all possible identifiers. CVR identifiers consist of 7 digits with an 8th checksum-digit. The process of computing the checksum is documented publicly. Here’s my implementation of the checksum computation. Please let me know if you think it’s wrong:

	private static int[] digitWeights = { 2, 7, 6, 5, 4, 3, 2 };

	public static int ToCvr(int serial)
	{
		var digits = serial.ToString().Select(x => int.Parse(x.ToString()));
		var sum = digits.Select((x, y) => x * digitWeights[y]).Sum();
		var modulo = sum % 11;
		if (modulo == 1)
		{
			return -1;
		}
		if (modulo == 0)
		{
			modulo = 11;
		}
		var checkDigit = 11 - modulo;
		return serial * 10 + checkDigit;
	}

My guess is that the lowest serial (without the checksum) is 1,000,000 because that’s the lowest serial that will yield an 8-digit identifier. The largest serial is likely 9,999,999. I could be wrong though, so if you have any insights please let me know. Roughly one in eleven serials are discarded because the checksum is 10, which is invalid. That leaves about 8 million identifiers to be tried. It’s wasteful to have to submit 8 million requests to get records for a couple of hundred thousand companies, but one can hope that 8 million requests will get the governments attention and that they’ll start publishing data more efficiently.

Screen scraping with WatiN

This post describes how to use WatiN to screen scrape web sites that don’t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too.

Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause of difficulty is fumbling incompetence on the part of the people that built the site to be scraped. Every once in a while however, you’ll encounter a site openly displaying data to the casual browser, but with measures in place to prevent automatic scraping of that data.

The Danish Patent and Trademark Office is one such site. The people there maintain a searchable database that lets you search and peruse Danish and international patents. Unfortunately, computers are not allowed. If one tries to issue HTTP POST to the resource that generally performs searches and shows patents, an error is returned. If one emulates visiting the site with a real browser by providing a browser-looking User Agent setting, collecting cookies etc. (for example by using a tool like SimpleBrowser), the site sends a made-up 999 HTTP response code and the message “No Hacking”.

Faced with such an obstruction, there are two avenues of attack:

  1. Break out Wireshark or Fiddler and spend a lot of time figuring out what it takes to fabricate requests that fools the site into thinking they originate from a normal browser and not from your bot
  2. Instrument an actual browser so that the site will have no way (other than timing analysis and IP address request rate limiting) of knowing whether requests are from a bot or from a normal client

The second option turns out to be really easy because people have spent lots of time building tools for automatically testing web applications using full browsers, tools like WatiN. For example, successfully scraping the Danish Patent Authorities site using WatiN is as simple as this:

private static void GetPatentsInYear(int year)
{
	using (var browser = new IE("http://onlineweb.dkpto.dk/pvsonline/Patent"))
	{
		// go to the search form
		browser.Button(Find.ByName("menu")).ClickNoWait();

		// fill out search form and submit
		browser.CheckBox(Find.ByName("brugsmodel")).Click();
		browser.SelectList(Find.ByName("datotype")).Select("Patent/reg. dato");
		browser.TextField(Find.ByName("dato")).Value = string.Format("{0}*", year);
		browser.Button(Find.By("type", "submit")).ClickNoWait();
		browser.WaitForComplete();

		// go to first patent found in search result and save it
		browser.Buttons.Filter(Find.ByValue("Vis")).First().Click();
		GetPatentFromPage(browser, year);

		// hit the 'next' button until it's no longer there
		while (GetNextPatentButton(browser).Exists)
		{
			GetNextPatentButton(browser).Click();
			GetPatentFromPage(browser, year);
		}
	}
}

private static Button GetNextPatentButton(IE browser)
{
	return browser.Button(button =>
		button.Value == "Næste" && button.ClassName == "knapanden");
}

Note that in this example, we’re using Internet Explorer because it’s the easiest to setup and use (WatiN also works with Firefox, but only older versions). There’s definitely room for improvement, in particular it’d be interesting to explore parallelizing the scraper to download patents faster.  The – still incomplete – project source code is available on Github. I’ll do a post shortly on what interesting data can be extracted from Danish patents.

Raw updated data on Danish business leader groups

Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I’ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github.

The code is fairly straightforward. The scraper itself is less than 150 loc. The scraper is configured to be run in a background worker on AppHarbor and will conduct a scrape once a month (I don’t know how often the VL-people update their website, but monthly updates seems sufficient to keep track of coming and goings). The resulting data can be fetched using a simple JSON API. You can find a list of scraped member-batches here (there’s just one at the time of writing). Hitting http://vlgroups.apphb.com/Member will always net you the latest batch.

I was motivated to revisit the code after this week’s dethroning of Anders Eldrup from his position as CEO of Dong Energy. Anders Eldrup sits in VL-gruppe 1, the most prestigious one. Let’s see if he’s still there next time the scraper looks. 14 other Dong Energy executives are members of other groups, although interestingly, Jakob Baruël Poulsen (Eldrup’s handsomely rewarded sidekick) is nowhere to be found. I think data like this in an important piece of the puzzle to figure out what relations exist between business leaders in Denmark and the Anders Eldrup debacle demonstrates why keeping track is important.

Older Posts Newer Posts