Randoom a Michael Friis production

Mono Docker language stack

I couple weeks ago, Docker announced official pre-built Docker images for a bunch of popular programming languages. Each stack generally consists of two Dockerfiles: a base Dockerfile that installs system dependencies required for that language to run, and an onbuild Dockerfile that uses ONBUILD instructions to transform app source code into a runnable Docker image. As an example of the latter, the Ruby onbuild Dockerfile runs bundle install to install libraries specified in an app’s Gemfile.

Managing system dependencies and composing apps from source code is very similar to what we do with Stacks and Buildpacks at Heroku. To better understand the Docker approach, I created a language stack for Mono, the open source implementation of Microsoft’s .NET Framework.

How to use

A working Docker installation is required for this section.

To turn a .NET app into a runnable Docker image, first add a Dockerfile to your app source root. The sample below assumes a simple console app with an output executable name of TestingConsoleApp.exe:

FROM friism/mono:3.10.0-onbuild
CMD [ "mono", "./TestingConsoleApp.exe" ]

Now build the image:

docker build -t my-app .

The friism/mono images are available in the public Docker Registry and your Docker client will fetch them from there. Docker will then execute the onbuild instructions to restore NuGet packages required by the app and use xbuild (the Mono equivalent of msbuild) to compile source code into executables and libraries.

The Docker image with your app is now ready to run:

docker run my-app

If you don’t have an app to test with, you can experiment with this console test app.

Notes

The way Docker languages stacks are split into a base image (that declares system dependencies) and an onbuild Dockerfile (that composes the actual app to be run) is perfect. It allows each language to get just the system libraries and dependencies needed. In contrast, Heroku has only one stack image (in several versions, reflecting underlying Linux distribution versions) that all language buildpacks share. That stack is at once both too thick and too thin: It includes a broad assortment of libraries to make supported languages work, but most buildpack maintainers still have to hand-build dependencies and vendor in the binaries when apps are built.

Docker has no notion of a cache for ONBUILD commands whereas the Heroku buildpack API has a cache interface. No caching makes the Docker stack maintainer’s life easier, but it also makes builds much slower than what’s possible on Heroku. For example, Heroku buildpacks can cache the result of running bundle install (in the case of Ruby) or nuget restore (for Mono), greatly speeding up builds after the first one.

Versioning is another interesting difference. Heroku buildpacks bake support for all supported language versions into a single monolithic release. What language version to use is generally specified by the app being built in a requirements.txt (or similar) file and the buildpack uses that to install the correct packages.

Docker language stacks, on the other hand, support versioning with version tags. The app chooses what stack version to use with the FROM instruction in the Dockerfile that’s added to the app. Stack versions map to versions of the underlying language or framework (eg. FROM python:3-onbuild gets you Python 3). This approach lets the Python stack, for example, compile Python 2 and 3 apps in different ways without having a bunch of branching logic in the onbuild Dockerfile. On the other hand, pushing an update to all Python stack versions becomes more work because the tags have to be updated individually. There are tradeoffs in both the Docker and Heroku buildpack approaches, I don’t know which is best.

Docker maintains a free, automated build service that churns out hosted Docker images for everyone to use. For my Mono stack, Docker Hub pulls updates from the GitHub repo with the Dockerfiles and builds the relevant tags into images. This is very convenient for stack maintainers. Heroku has no hosted service for building buildpack binaries, although I have documented a (Docker-based) approach to scripting this work.

(Note that, while Heroku buildpacks are wildly successful, it’s an older standard that predates Docker by many years. If it seems like Docker has gotten more things right, it’s probably because that project was informed by Heroku’s experience and by the passage of time).

Finally, and unrelated to Docker and Heroku, the Mono Project now has an APT package repository. This is pretty awesome, and I sincerely hope that the days of having to compile Mono from source are behind us. I don’t know if the repo is quite stable yet (I had to download a key without using SSL, the mono-devel package is versioned 3.10.0-0xamarin1 and the package fails to declare a dependency on udev), but it made the Mono Docker stack image a lot simpler. Check out the diff going from 3.8.0 (compiled from source) to 3.10.0 (installed from APT repo).

 


No Comments Yet

WebP, content negotiation and CloudFront

AWS recently launched improvements to the CloudFront CDN service. The most important change is the option to specify more HTTP headers be part of the cache key for cached content. This lets you run CloudFront in front of apps that do sophisticated Content Negotiation. In this post, we demonstrate how to use this new power to selectively serve WebP or JPEG encoded images through CloudFront depending on client support. It is a follow-up to my previous post on Serving WebP images with ASP.NET. Combining WebP and CloudFront CDN makes your app faster because users with supported browsers get highly compressed WebP images from AWS’ fast and geo-distributed CDN.

The objective of WebP/JPEG content negotiation is to send WebP encoded images to newer browsers that support it, while falling back to JPEG for older browser without WebP support. Browsers communicate WebP support through the Accept HTTP header. If the Accept header value for a request includes image/webp, we can send WebP encoded images in responses.

Prior to the CloudFront improvements, doing WebP/JPEG content negotiation for images fronted with CloudFront CDN was not possible. That’s because CloudFront would only vary response content based on the Accept-Encoding header (and some other unrelated properties). With the new improvements, we can configure arbitrary headers for CloudFront to cache on. For WebP/JPEG content negotiation, we specify Accept in list list of headers to whitelist in the AWS CloudFront console.

Specifying Accept header in AWS console

Once the rest of the origin and behavior configuration is complete, we can use cURL to test that CloudFront correctly caches and serves responses based on the Accept request header:

curl -v http://abcd1234.cloudfront.net/myimageurl > /dev/null
...
< HTTP/1.1 200 OK
< Content-Type: image/jpeg
...

And simulating a client that supports WebP:

curl -v -H "Accept: image/webp" http://abcd1234.cloudfront.net/myimageurl > /dev/null
...
< HTTP/1.1 200 OK
< Content-Type: image/webp
...

Your origin server should set the Vary header to let any caches in the response-path know how they can cache, although I don’t believe it’s required to make CloudFront work. In our case, the correct Vary header value is Accept,Accept-Encoding.

That’s it! Site visitors will now be receiving compact WebP images (if their browsers support them), served by AWS CloudFront CDN.


No Comments Yet

Building Heroku buildpack binaries with Docker

This post covers how binaries are created for the Heroku Mono buildpack, in particular the Mono runtime and XSP/fastcgi-mono-server components that are vendored into app slugs. Binaries were previously built using a custom buildpack with the build running in a Heroku dyno. That took a while though and was error-prone and hard to debug, so I have migrated the build scripts to run inside containers set up with Docker. Some advantages of this approach are:

  • It’s fast, builds can run on my powerful laptop
  • It’s easy to get clean Ubuntu Lucid (10.04) image for reproducible clean-slate builds
  • Debugging is easy too, using an interactive Docker session

The two projects used to build Mono and XSP are on GitHub. Building is a two-step process: First, we do a one-time Docker build to create an image with the bare-minimum (eg. curl, gcc, make) required to perform Mono/XSP builds. s3gof3r is also added – it’s used a the end of the second step to upload the finished binaries to S3 where buildpacks can access them.

The finished Docker images have build scripts and can now be used to run builds at will. Since changes are not persisted to the Docker images each new build happens from a clean slate. This greatly improves consistency and reproducibility. The general phases in the second build step are:

  1. Get source code to build
  2. Run configure or autogen.sh
  3. Run make and make install
  4. Package up result and upload to S3

Notice that the build scripts specify the /app filesystem location as the installation location. This is the filesystem location where slugs are mounted in Heroku dynos (you can check yourself by running heroku run pwd). This may or not be a problem for your buildpack, but Mono framework builds are not path-relative, and expect to eventually be run out of where they were configured to be installed. So during the build, we have to use a configure setting consistent with how Heroku runs slugs.

The build scripts are parametrized to take AWS credentials (for the S3 upload) and the version to build. Building Mono 3.2.8 is done like this:

$ docker run -v ${PWD}/cache:/var/cache -e AWS_ACCESS_KEY_ID=key -e AWS_SECRET_ACCESS_KEY=secret -e VERSION=3.2.8 friism/mono-builder

It’s now trivial to quickly create and upload to S3 binaries for all the versions of Mono I want to support. It’s very easy to experiment with different settings and get binaries that are small (so slugs end up being smaller), fast and free of bugs.

There’s one trick in the docker run command: -v ${PWD}/cache:/var/cache. This mounts the cache folder from the current host directory in the container. The build scripts use that location to store and cache the different versions of downloaded source code. With the source code cache, turnaround times when tweaking build settings are even shorter because I don’t have to wait for the source to re-download on each run.


No Comments Yet

Running .NET apps on Docker

This blog post covers running simple .NET apps in Docker lightweight containers using Mono. I run Docker in a Vagrant/VirtualBox VM on Windows. This works great and is fast. Installation instructions are available on the Docker site.

Building the base image

First order of business is to create a Docker image that has Mono installed. We will use this as the base image for containers that actually run apps. To get the most recent Mono version (3.2.6 at the time of writing) I use packages created by Timotheus Pokorra installed on a Ubuntu 12.04 LTS Docker image. Here’s the Dockerfile for that:

FROM ubuntu:12.04
MAINTAINER friism

RUN apt-get -y -q install wget
RUN wget -q http://download.opensuse.org/repositories/home:tpokorra:mono/xUbuntu_12.04/Release.key -O- | apt-key add -
RUN apt-get remove -y --auto-remove wget
RUN sh -c "echo 'deb http://download.opensuse.org/repositories/home:/tpokorra:/mono/xUbuntu_12.04/ /' >> /etc/apt/sources.list.d/mono-opt.list"
RUN apt-get -q update
RUN apt-get -y -q install mono-opt

Here’s what’s going on:

  1. Install wget
  2. Add repository key to apt-get
  3. remove wget
  4. Add openSUSE repository to sources list
  5. Install Mono from there

At first I did all of the above in one command since these steps represent the single logical step of installing Mono and it seems like they should be just one commit. Nobody will be interested in the commit after wget was installed, for example. I ended up splitting it up into separate RUN commands since that’s what other people seem to do.
With that Dockerfile, we can build an image:

$ docker build -t friism/mono .

We can then run a container using the generated image and check our Mono installation:

vagrant@precise64:~/mono$ docker run -i -t friism/mono bash
root@0bdca65e6e8e:/# /opt/mono/bin/mono --version
Mono JIT compiler version 3.2.6 (tarball Sat Jan 18 16:48:05 UTC 2014)

Note that Mono is installed in /opt and that it works!

Running a console app

First, we’ll deploy a very simple console app:

using System;

namespace HelloWorld
{
	public class Program
	{
		static void Main(string[] args)
		{
			Console.WriteLine("Hello World");
		}
	}
}

For the purpose of this example, we’ll pre-build apps using the VS command prompt and msbuild and then add the output to the container:

msbuild /property:OutDir=C:\tmp\helloworld HelloWorld.sln

Alternatively, we could have invoked xbuild or gmcs from within the container.

The Dockerfile for the container to run the app is extremely simple:

FROM friism/mono
MAINTAINER friism

ADD app/ .
CMD /opt/mono/bin/mono `ls *.exe | head -1`

Note that it relies on the friism/mono image created above. It also expects the compiled app to be in the /app folder, so:

$ ls app
HelloWorld.exe

The CMD will simply use mono to run the first executable found in the build output. Let’s build and run it:

$ docker build -t friism/helloworld-container .
...
$ docker run friism/helloworld-container
Hello World

It worked!

Web app

Running a self-hosting OWIN web app is only slightly more work. For this example I used the sample code from the OWIN/Katana-on-Heroku post.

$ ls app/
HelloWorldWeb.exe  Microsoft.Owin.Diagnostics.dll  Microsoft.Owin.dll  Microsoft.Owin.Host.HttpListener.dll  Microsoft.Owin.Hosting.dll  Owin.dll

The Dockerfile for this exposes port 5000 from the container, and CMD is used to start the web app and specify port 5000 to listen on:

FROM friism/mono
MAINTAINER friism

ADD app/ .
EXPOSE 5000
CMD ["/opt/mono/bin/mono", "HelloWorldWeb.exe", "5000"]

We start the container and map port 5000 to port 80 on the machine running Docker:

$ docker run -p 80:5000 -t friism/mono-hello-world-web

And with that we can visit the OWIN sample site on http://localhost/.

If you’re a .NET developer this post will hopefully have helped you place Docker in the context of your everyday work. If you’re interested in more ideas on how to use Docker to deploy apps, check out this Automated deployment with Docker – lessons learnt post.


3 Comments

Serving WebP images with ASP.NET MVC

Speed is a feature and one thing that can slow down web apps is clients waiting to download images. To speed up image downloads and conserve bandwidth, the good folks of Google have come up with a new image format called WebP (“weppy”). WebP images are around 25% smaller in size than equivalent images encoded with JPEG and PNG (WebP supports both lossy and lossless compression) with no worse perceived quality.

This blog post shows how to dynamically serve WebP-encoded images from ASP.NET to clients that support the new format.

One not-so-great way of doing this is to serve different HTML depending on whether clients supports WebP or not, as described in this blog post. As an example, clients supporting WebP would get HTML with <img src="image.webp"/> while other clients would get <img src="image.jpeg"/>. The reason this sucks is that the same HTML cannot be served to all clients, making caching harder. It will also tend to pollute your view code with concerns about what image formats are supported by browser we’re rendering for right now.

Instead, images in our solution will only ever have one url and the content-type of responses depend on the capabilities of the client sending the request: Browsers that support WebP get image/webp and the rest get image/jpeg.

I’ll first go through creating WebP-encoded images in C#, then tackle the challenge of detecting browser image support and round out the post by discussing implications for CDN use.

Serving WebP with ASP.NET MVC

For the purposes of this article, we’ll assume that we want to serve images from an URI like /images/:id where :id is some unique id of the image requested. The id can be used to fetch a canonical encoding of the image, either from a file system, a database or some other backing store. In the code I wrote to use this, the images are stored in a database. Once fetched from the backing store, the image is re-sized as desired, re-encoded and served to the client.

At this point, some readers are probably in uproar: “Doing on-the-fly image fetching and manipulation is wasteful and slow” they scream. That’s not really the case though, and even if it were, the results can be cached on first request and then served quickly.

Assume we have an Image class and method GetImage(int id) to retrieve images:

private class Image
{
	public int Id { get; set; }
	public DateTime UpdateAt { get; set; }
	public byte[] ImageBytes { get; set; }
}

We’ll now use the managed API from ImageResizer to resize the image to the desired size and then re-encode the result to WebP using Noesis.Drawing.Imaging.WebP.dll (no NuGet package, unfortunately).

public ActionResult Show(int imageId)
{
	var image = GetImage(imageId);

	var resizedImageStream = new MemoryStream();
	ImageBuilder.Current.Build(image.ImageBytes, resizedImageStream, new ResizeSettings
	{
		Width = 500,
		Height = 500,
		Mode = FitMode.Crop,
		Anchor = System.Drawing.ContentAlignment.MiddleCenter,
		Scale = ScaleMode.Both,
	});

	var resultStream = new MemoryStream();
	WebPFormat.SaveToStream(resultStream, new SD.Bitmap(resizedImageStream));
	resultStream.Seek(0, SeekOrigin.Begin);

	return new FileContentResult(resultStream.ToArray(), "image/webp");
}

System.Drawing is referenced using using SD = System.Drawing;. The controller action above is fully functional and can serve up sparkling new WebP-formatted images.

Browser support

Next up is figuring out whether the browser requesting an image actually supports WebP, and if it doesn’t, respond with JPEG. Luckily, this doesn’t involve going back to the bad old days of user-agent sniffing. Modern browsers that support WebP (such as Chrome and Opera) send image/webp in the accept header to indicate support. Ironically given that Google came up with WebP, the Chrome developers took a lot of convincing to set that header in requests, fearing request size bloat. Even now, Chrome only advertises webp support for requests that it thinks is for images. In fact, this is another reason the “different-HTML” approach mentioned in the intro won’t work: Chrome doesn’t advertise WebP support for requests for HTML.

To determine what content encoding to use, we inspect Request.AcceptTypes. The resizing code is unchanged, while the response is generated like this:

	var resultStream = new MemoryStream();
	var webPSupported = Request.AcceptTypes.Contains("image/webp");
	if (webPSupported)
	{
		WebPFormat.SaveToStream(resultStream, new SD.Bitmap(resizedImageStream));
	}
	else
	{
		new SD.Bitmap(resizedImageStream).Save(resultStream, ImageFormat.Jpeg);
	}

	resultStream.Seek(0, SeekOrigin.Begin);
	return new FileContentResult(resultStream.ToArray(), webPSupported ? "image/webp" : "image/jpeg");

That’s it! We now have a functional controller action that responds correctly depending on request accept headers. You gotta love HTTP. You can read more about content negotiation and WebP on lya Grigorik’s blog.

Client Caching and CDNs

Since it does take a little while to perform the resizing and encoding, I recommend storing the output of the transformation in HttpRuntime.Cache and fetching from there in subsequent requests. The details are trivial and omitted from this post.

There is also a bunch of ASP.NET cache configuration we should do to let clients cache images locally:

	Response.Cache.SetExpires(DateTime.Now.AddDays(365));
	Response.Cache.SetCacheability(HttpCacheability.Public);
	Response.Cache.SetMaxAge(TimeSpan.FromDays(365));
	Response.Cache.SetSlidingExpiration(true);
	Response.Cache.SetOmitVaryStar(true);
	Response.Headers.Set("Vary",
		string.Join(",", new string[] { "Accept", "Accept-Encoding" } ));
	Response.Cache.SetLastModified(image.UpdatedAt.ToLocalTime());

Notice that we set the Vary header value to include “Accept” (as well as “Accept-Encoding”). This tells CDNs and other intermediary caching proxies that if they try to cache this response, they must vary the cached value based on value of the “Accept” header of the request. This works for “Accept-Encoding”, so that different values can cached based on whether the response is compressed with gzip, deflate or not at all, and all major CDNs support it. Unfortunately, the mainstream CDNs I experimented with (CloudFront and Azure CDN) don’t support other Vary values than “Accept-Encoding”. This is really frustrating, but also somewhat understandable from the standpoint of the CDN folks: If all Vary values are honored, the number of artifacts they have to cache would increase at a combinatorial rate as browsers and servers make use of cleverer caching. Unless you find a CDN that specifically support non-Accept-Encoding Vary values, don’t use a CDN when doing this kind of content negotiation.

That’s it! I hope this post will help you build ASP.NET web apps that serve up WebP images really quickly.


2 Comments

Full 2012 Danish company taxes

Two weeks ago I put out a preliminary release of the 2012 taxes. The full dataset with 245,836 companies is now available in this Google Fusion Table. I haven’t done any of the analysis I did last year. Other than a list of top payers, I’ll leave the rest up to you guys. One area of interest might be to compare the new data with the data from last year: What new companies have showed up that have lots of profit or pay lots of tax? What high-paying companies went away or stopped paying anything? Who’s taxes and profits changed the most? You can find last year’s data via the blog post on the 2011 data.

Here are the top 2012 corporate tax payers:

  1. Novo A/S: kr. 3,747,186,913.00
  2. KIRKBI A/S: kr. 2,044,585,056.00
  3. NORDEA BANK DANMARK A/S: kr. 1,669,087,041.00
  4. TDC A/S: kr. 1,587,748,550.00
  5. Coloplast A/S: kr. 607,238,513.00

Ping me on Twitter if you build something cool with this data. The tax guys were kind enough to say that they’re looking into my finding that the 2011 data had been changed and address my critique of the fact that the 2011 data is no longer available. I’m still hoping for a response.

How to read the 2012 data

Each company record exposes up to 9 values. I’ve given them English column names in the Fusion Table. As an example of what goes where, check out A/S Dansk Shell/Eksportvirksomhed:

  • Profit (“Skattepligtig indkomst for 2012″): kr. 528.440.304
  • Losses (“Underskud, der er trukket fra indkomsten”): kr. 225.655.303
  • Tax (“Selskabsskatten for 2012″): kr. 132.110.075
  • FossilProfit (“Skattepligtig kulbrinteindkomst for 2012″): kr. 9.757.362.931
  • FossilLosses (“Underskud, der er trukket fra kulbrinteindkomsten”): kr. 0
  • FossilTax (“Kulbrinteskatten for 2012″): kr. 5.073.828.724
  • FossilCorporateProfit (“Skattepligtig selskabsindkomst, jf. kulbrinteskatteloven for 2012″): kr. 14.438.589.854
  • FossilCorporateLosses (“Underskud, der er trukket fra kulbrinteindkomsten”): no value
  • FossilCorporateTax (“Selskabsskat 2012 af kulbrinteindkomst”): kr. 3.609.647.450

A better way to release tax data

Computerworld has a interesting article with details from the process of making the 2011 data available. The article is mostly about the fact that various Danish business lobbies managed to convince the tax authority to bundle fossil extraction taxes and taxes on profit into one number. This had the effect of cleverly obfuscating the fact that Maersk doesn’t seem to pay very much tax on profits earned in the parts of their business that aren’t about extracting oil.

More interesting for us, the article also mentions that the tax authority spent about kr. 1 mio to develop the site that makes the data available. That seems generous for a single web page that can look up values in a database, and does so slowly. But it’s probably par for the course in public procurement.

The frustrating part is that the data could be released in a much more simple and useful way: Simply dump it out in an Excel spreadsheet (and a .csv for people that don’t want to use Excel) and make it available for download. I have created these files from the data I’ve scraped, and they’re about 27MB (5MB compressed). It would be trivial and cheap to make the data set available in this format.

Right now, there’s probably a good number of programmers and journalists (me included) that have build scrapers and are hammering the tax website to get all the data. With downloadable spreadsheets, getting all the data would be fast and easy, and it would save a lot of wasted effort. I’m sure employees at the Tax Authority create spreadsheets like this all the time anyway, to do internal analysis and reporting.

Depending on whether data-changes (as seen with last years data) are a fluke or normal, it might be necessary to release new spreadsheets when the data is updated. This would be great too though, because it would make the tracking changes easier and more transparent.


2 Comments

Preliminary 2012 Danish Company Tax Records

The Danish tax authority released the 2012 company tax records yesterday. I’ve scraped a preliminary data set by just getting all the CVR-ids from last year’s scrape. This leaves out any new companies that cropped up since then, but there’d still 228,976 in the set, including subsidiaries.

The rest are being scraped as I write this and should be ready in at most a few days. I’ll publish the full data as soon as I have it, including some commentary on the fact that the tax people spent kr. 1mio [dk] to release the data in the this half assed way.

In the meantime, get the preliminary data in this Google Fusion Table.


3 Comments

Getting ready for 2012 Danish company taxes

This is a follow-up to last year’s “Tax records for Danish companies” post which covered how I screen-scraped and analyzed 2011 tax records for all Danish companies.

I revisited the scraper source code today because the Tax Authority has made it known[dk] that they will be releasing the 2012 data set next week. As I did last year, I want to preface the critique that is going to follow and say that it’s awesome that this information is made public, and that I hope the government will continue to publish it and work on making it more useful.

First some notes regarding the article:

  • It states that, for the first time, it’s possible to determine what part of a company’s taxes are due to profits on oil and gas extraction and what are due to normal profits. That’s strange, since this was also evident from last years data. Maybe they’re trying to say (but the journalist was too dim to understand) that they have solved the problem in the 2011 data that caused oil and gas corporations to be duplicated, as evidenced by the two entries for Maersk: A.P.Møller – Mærsk A/S/ Oil & Gas Activity and A.P. MØLLER – MÆRSK A/S. Note that the two entries have the same CVR identifier.
  • It’s frustrating that announcements like this (that new data is coming next week) are not communicated publicly on Twitter or the web sites of either the Tax Authority or the Ministry of Taxation. Instead, one has to randomly find the news on some random newspaper web site. Maybe it was mentioned in a newsletter I’m not subscribed to – who knows.

Anyway, these are nuisances, now on to the actual problems.

2011 data is going away

The webpage says it beautifully:

De offentliggjorte skatteoplysninger er for indkomståret 2011 og kan ses, indtil oplysningerne for 2012 bliver offentliggjort i slutningen af 2013.

Translated:

The published tax information is for the year 2011 and is available until the 2012 information is published at the end of 2013.

Removing all the 2011 data to make room for the 2012 stuff is very wrong. First off, it’s defective that historical information is not available. Of course, I scraped the information and put it in a Fusion Table for posterity (or at least for as long as Google maintains that product). Even then, it’s wrong of the tax authority to not also publish and maintain historical records.

Second, I suspect that the new 2012 data will be published using the same URI scheme as the 2011 data, i.e.: http://skat.dk/SKAT.aspx?oId=skattelister&x={cvr-id}. So when the new data goes live some time next week, a URI that pointed to the 2011 tax records of the company FORLAGET SOHN ApS will all of a sudden point to the 2012 tax records of that company. That means that all the links I included in last year’s blog post and thought would point to 2011 data in perpetuity now point to 2012 data. This is likely going to be confusing to readers, both of my post, but also for other people following those links from all over the Internet. The semantics of these URIs are arguably coherent if they’re defined to be “the latest tax records for company X”. This is not a very satisfying paradigm though, and it would be much better if /company-tax-records/{year}/{cvr-id} URIs were made available, or if records from all years were available at /SKAT.aspx?oId=skattelister&x={cvr-id} as they became available.

The 2011 data was changed

I discovered this randomly when dusting off last years code. It has a set of integration tests, and the one for Saxo Bank refused to pass. That turns out to be because the numbers reported have changed. When I first scraped the data, Saxo Bank paid kr. 25.426.135 in taxes on profits of kr. 257.969.357. The current numbers are kr. 25.142.333 taxes on kr. 260.131.946 of profits. So it looks like the bank made a cool extra couple millions in 2011 and managed to get their tax bill bumped down a bit.

Some takeaways:

  1. Even though this information is posted almost a full year after the end of 2011, the numbers are not accurate and have to be corrected. This is obviously not only the tax authority’s fault: Companies are given ample time to gather and submit records and even then, they may provide erroneous data.
  2. It’d be interesting to know what caused the correction. Did Saxo Bank not submit everything? Did the tax people miss something? Was Saxo Bank audited?
  3. It’d be nice if these revisions themselves were published in an organised fashion by the tax authorities. Given the ramshackle way they go about publishing the other data, I’m not holding my breath for this to happen.
  4. I have no idea if there are adjustments to other companies and if so, how many. I could try and re-run the scraper on all 243,711 companies to find changes before the 2012 release obliterates the 2011 data but I frankly can’t be bothered. Maybe some journalist can go ask.

That’s it! Provided the tax people don’t change the web interface, the scraper is ready for next week’s 2012 data release. I’ll start running as soon as 2012 numbers show up and publish raw data when I have it.


No Comments Yet

Heroku .NET buildpack now with nginx

Another weekend, and another bunch of updates to the .NET buildpack. Most significantly, ASP.NET apps now run with  fastcgi-mono-server fronted by an nginx instance in each dyno. This replaces the previous setup which used the XSP development web server. Using nginx is supposedly more production ready.

Other changes include setting the LD_LIBRARY_PATH environment variable and priming the Mono certificate store. This will make apps that use RestSharp, or otherwise exposes Mono’s reliance on native zlib, work out of the box. It also makes calling HTTPS web services form Heroku not fail.

These improvements came about because people tested their .NET apps on Heroku. Please do the same so that we can weed out as many bugs as possible. If you want to contribute, feel free to check out the TODO-list in the README on the GitHub.


3 Comments

Posted
28 July 2013 @ 9pm

Categories
C#, Heroku

1 Comment

Heroku .NET buildpack update to Mono 3.2 and more

Hot on the heels of posts on Running .NET on Heroku and Running OWIN/Katana apps on Heroku, I’m happy to announce a couple of updates to the buildpack:

  1. Now uses the newly released Mono 3.2 runtime. I need to figure out some way for users to select what version they want the same way the Python buildpack uses runtime.txt.
  2. Adds symbolic links from NuGet.targets files to nuget.targets files to account for inconsistent casing practices in NuGet.
  3. Fetches Nuget.targets file that is not broken when doing package restore on Linux. I’m still trying to get Microsoft to accept a pull request to NuGet that fixes this.

I’ve also spent some time trying to switch from xsp4 to nginx and fastcgi-mono-server4, but am currently stuck.


1 Comment

← Before