Michael Friis' Blog

About


Getting ready for 2012 Danish company taxes

This is a follow-up to last year’s “Tax records for Danish companies” post which covered how I screen-scraped and analyzed 2011 tax records for all Danish companies.

I revisited the scraper source code today because the Tax Authority has made it known[dk] that they will be releasing the 2012 data set next week. As I did last year, I want to preface the critique that is going to follow and say that it’s awesome that this information is made public, and that I hope the government will continue to publish it and work on making it more useful.

First some notes regarding the article:

  • It states that, for the first time, it’s possible to determine what part of a company’s taxes are due to profits on oil and gas extraction and what are due to normal profits. That’s strange, since this was also evident from last years data. Maybe they’re trying to say (but the journalist was too dim to understand) that they have solved the problem in the 2011 data that caused oil and gas corporations to be duplicated, as evidenced by the two entries for Maersk: A.P.Møller – Mærsk A/S/ Oil & Gas Activity and A.P. MØLLER – MÆRSK A/S. Note that the two entries have the same CVR identifier.
  • It’s frustrating that announcements like this (that new data is coming next week) are not communicated publicly on Twitter or the web sites of either the Tax Authority or the Ministry of Taxation. Instead, one has to randomly find the news on some random newspaper web site. Maybe it was mentioned in a newsletter I’m not subscribed to – who knows.

Anyway, these are nuisances, now on to the actual problems.

2011 data is going away

The webpage says it beautifully:

De offentliggjorte skatteoplysninger er for indkomståret 2011 og kan ses, indtil oplysningerne for 2012 bliver offentliggjort i slutningen af 2013.

Translated:

The published tax information is for the year 2011 and is available until the 2012 information is published at the end of 2013.

Removing all the 2011 data to make room for the 2012 stuff is very wrong. First off, it’s defective that historical information is not available. Of course, I scraped the information and put it in a Fusion Table for posterity (or at least for as long as Google maintains that product). Even then, it’s wrong of the tax authority to not also publish and maintain historical records.

Second, I suspect that the new 2012 data will be published using the same URI scheme as the 2011 data, i.e.: http://skat.dk/SKAT.aspx?oId=skattelister&x={cvr-id}. So when the new data goes live some time next week, a URI that pointed to the 2011 tax records of the company FORLAGET SOHN ApS will all of a sudden point to the 2012 tax records of that company. That means that all the links I included in last year’s blog post and thought would point to 2011 data in perpetuity now point to 2012 data. This is likely going to be confusing to readers, both of my post, but also for other people following those links from all over the Internet. The semantics of these URIs are arguably coherent if they’re defined to be “the latest tax records for company X”. This is not a very satisfying paradigm though, and it would be much better if /company-tax-records/{year}/{cvr-id} URIs were made available, or if records from all years were available at /SKAT.aspx?oId=skattelister&x={cvr-id} as they became available.

The 2011 data was changed

I discovered this randomly when dusting off last years code. It has a set of integration tests, and the one for Saxo Bank refused to pass. That turns out to be because the numbers reported have changed. When I first scraped the data, Saxo Bank paid kr. 25.426.135 in taxes on profits of kr. 257.969.357. The current numbers are kr. 25.142.333 taxes on kr. 260.131.946 of profits. So it looks like the bank made a cool extra couple millions in 2011 and managed to get their tax bill bumped down a bit.

Some takeaways:

  1. Even though this information is posted almost a full year after the end of 2011, the numbers are not accurate and have to be corrected. This is obviously not only the tax authority’s fault: Companies are given ample time to gather and submit records and even then, they may provide erroneous data.
  2. It’d be interesting to know what caused the correction. Did Saxo Bank not submit everything? Did the tax people miss something? Was Saxo Bank audited?
  3. It’d be nice if these revisions themselves were published in an organised fashion by the tax authorities. Given the ramshackle way they go about publishing the other data, I’m not holding my breath for this to happen.
  4. I have no idea if there are adjustments to other companies and if so, how many. I could try and re-run the scraper on all 243,711 companies to find changes before the 2012 release obliterates the 2011 data but I frankly can’t be bothered. Maybe some journalist can go ask.

That’s it! Provided the tax people don’t change the web interface, the scraper is ready for next week’s 2012 data release. I’ll start running as soon as 2012 numbers show up and publish raw data when I have it.

Nordic Newshacker

The excellent people at the Danish newspaper Information are hosting a competition to promote data journalism. It’s called “Nordisk Nyhedshacker 2012“. Data journalism was what I spent some of my time at Ekstra Bladet doing, and the organizers have been kind enough to put me on the jury. The winner will get a scholarship to go work at The Guardian for a month, sponsored by Google. Frankly, I’d prefer working at Information, but I guess The Guardian will do. If you’re a journalist that can hack or if you’re hacker interested in using your craft to make people more informed about the world we live in, you should use this opportunity to come up with something interesting and be recognized for it.

Hopefully, you already have awesome ideas for what to build. Should you need some inspiration, here a few interesting pieces of data you might want to consider (projects using this data will not be judged differently than others).

  • Examine the US Embassy Cables released by Wikileaks. I’ve tried to filter out the ones related to Denmark.
  • Examine the power relationships of members of Danish business leader groups. I have extracted the membership info from their web site. It’d be extra interesting if you combine this information with data about who sits on the boards of big Danish companies, perhaps to make the beginnings of something like LittleSis so that we can keep track of what favours those in power are doing each other.
  • Do something interesting with the CVR database of Danish companies that was leaked on The Pirate Bay last year.
  • Ekstra Bladet has been kind enough to let me open source the code for the award-winning Krimikort (Crime Map) I built while working there. It’s not quite ready to be released yet, but we’re making the current data available now. There’s 62,753 nuggets of geo-located and categorised crime ready for you to look at. You can download a rar file (50 MB) here. To use the data, you have to get a free copy of SQL Server Express and mount the database (Google will tell you how).

I’m afraid I won’t be able be participate in many of the activities preceding the actual competition but I can’t wait to see what people come up with!

US Embassy Cables Related to Denmark

As you may know, Wikileaks has released the full, un-redacted database of US Embassy cables. A torrent file useful for downloading all the data is available from Wikileaks, at the bottom of this page. It’s a PostgreSQL data dump. Danish journalists seem to be completely occupied producing vacuous election coverage, so to help out, I’ve filtered out the Denmark-related cables and are making them available as Google Spreadsheets/Fusiontables.

The first set (link) are cables (146 in all) from the US Embassy in Copenhagen, with all the “UNCLASSIFIED” ones filtered out (since they are typically trivial, if entertaining in their triviality). Here’s the query:

copy (
	select * 
	from cable 
	where origin = 'Embassy Copenhagen' 
		and classification not like '%UNCLASSIFIED%'
	order by date desc)
to 'C:/data/cph_embassy_confidential.csv' with csv header

The second set, at 1438 rows, (link) mention either “Denmark” or “Danish”, are from embassies other than the one in Copenhagen and are not “UNCLASSIFIED”. Query:


copy (
	select * 
	from cable 
	where origin != 'Embassy Copenhagen' 
		and classification not like '%UNCLASSIFIED%'
 		and (
 			content like '%Danish%' or
 			content like '%Denmark%'
 		)
	order by date desc
)
to 'C:/data/not_cph_embassy_confidential.csv' 
	with csv header 
	force quote content
	escape '"'

Facebook Open Graph at ekstrabladet.dk

(This post is a straight-up translation from Danish of a post on the Ekstra Bladet development blog)

Right before the 2010 World Cup started, ekstrabladet.dk (the Danish tabloid where I work) managed to get an interesting implementation of the new Facebook Open Graph protocol up and running. This blog post describes what this feature does for our users and what possibilities we think Open Graph holds. I will write a post detailing the technical side of the implementation shortly.

The Open Graph protocol involves adding mark-up to pages on your site so that Facebook users can ‘like’ them in the same way that you can like fan-pages on Facebook. A simple Open Graph implementation for a news-website might involve markup-additions that let users like individual articles, sections and the frontpage. We went a bit further and our readers can now ‘like’ the 700-800 soccer players competing in the World Cup. The actual liking works by hovering over linkified player-names in articles. You can try it out in this article (which tells our readers about the new feature, in Danish) or check out the action-shot below.

When a reader likes a player, Facebook sticks a notice in that users feed, similar to the ones you get when you like normal Facebook pages. The clever bit is that we at Ekstra Bladet can now — using the Facebook publishing API — automatically post updates to Facebook users that like soccer players on ekstrabladet.dk. For example “Nicklas Bendtner on ekstrabladet.dk” (a Danish striker) will post an update to his fans every time we write a new article about him, and so will all the other players. Below, you can see what this looks like in peoples Facebook feeds (in this case it is Lionel Messi posting to his fans).

Behind the scenes the players are stored using a semantic-web/linked-data datastore so that we know that the “Lionel Messi” currently playing for the Argentinian National Team is the same “Lionel Messi” that will be playing for FC Barcelona in the fall.

Our hope is that we can use the Open Graph implementation to give our readers prompt news about stuff they like, directly in their Facebook news feeds. We are looking at what other use we can make of this privileged access to users feeds. One option would be to give our users selected betting suggestions for matches that involve teams they are fans of (this would be similar to a service we currently provide on ekstrabladet.dk).

We have already expanded what our readers can like to bands playing at this years Roskilde Festival (see this article) and we would like to expand further to stuff like consumer products, brands and vacation destinations. We could then use access to liking users feeds to advertise for those product or do affiliate marketing (although I have to check Danish law and Facebook terms before embarking on this). In general, Facebook Open Graph is a great way for us to learn about our readers’ wants and desires and it is a great channel for delivering personalized content in their Facebook feeds.

Are there no drawbacks? Certainly. We haven’t quite figured out how to best let our readers like terms in article texts. Our first try involved linkifying the terms and sticking a small Facebook thumb-icon after it. Some of our users found that to ruin the reading experience however (even if you don’t know Danish, you might be able to catch the meaning of the comments below this article). Now the thumb is gone, but the blue link remains. As a replacement for the thumb, we are contemplating adding a box at the bottom of articles, listing the terms used in that article for easy liking.

Another drawback is the volume of updates we are pushing to our users. During the World Cup we might write 5 articles with any one player appearing in them over the course of a day and our readers may be subscribed to updates from several players. Facebook does a pretty good job of aggregating posts, but it is not perfect. We are contemplating doing daily digests to avoid swamping peoples news feeds.

A third drawback is that it is not Ekstra Bladet that is accumulating information about our readers, but Facebook. Even though we are pretty good at reader identity via our “The Nation!” initiative, we have to recognize that the audience is much larger when using Facebook. Using Facebook also gives us superb access to reader social graph and news feeds, something we could likely not built ourselves. A mitigating factor is that Facebook gives us pretty good APIs for pulling information about how readers interact with content on our site.

Stay tuned if you want to know more about our Facebook Open Graph efforts.

News Essay

(This summer I applied for the “Coders Wanted” Knight Foundation Scholarship at the Medill School of Journalism. In case anyone’s interested, I’m uploading the essays I wrote for my application.)

Question: In the new media landscape, it’s possible for anyone to do the things that professional journalists do: for instance, dig up information other people are interested in, shoot photos or video of newsworthy events, and publish their work for others to see. What is the role of the professional journalist in this world where anyone can publish? What should be the relationship between the professionals and “citizen journalists”?

The relationship between citizen and professional journalist was put to the test by media coverage of the protest and disturbances that followed in the wake the recent Iranian election. Most professional journalists and photographers from western media had been ejected from Iran or were in other ways prevented from filing stories from the country. A lot of media coverage ended up being built on information from Iranian bloggers and Twitter-users and from videos posted to YouTube and similar online services.

While Iran is certainly an extreme case, events there underscore the trend that breaking news coverage is increasingly handled by so called “citizen journalists”. There are several reasons for professional journalists being less around when stuff happens. First, repressive governments, aware of the explosive role of media, may simply ban journalists. This was what happened in Iran and also — arguably — in Gaza in 2008-2009. Second, news organizations today do not have the resources to support a large and dispersed network of journalists deployed around the world. Last, the probability that a citizen will be on the spot with an Internet-connected videophone when something newsworthy happens, is just much greater than a news-team being nearby.

Because “breaking news” and setting the agenda with “exclusive” stories has traditionally been a point of competition among professional journalists, this development seems to be causing cases of Twitter-envy at some news-organizations. The result can be thinly sourced news based on random tweets and un-dated YouTube videos. Indeed, the news-hounds on Twitter and other places demand that journalists pick up these stories, as coverage on, say, CNN is considered a validation of the seriousness of what is going on. Journalists and editors at CNN caught an earful from Twitter users and bloggers for a perceived lack of Iran-coverage immediately after the election, in spite of there being preciously little verifiable information to report at the time.

In my opinion, it would behoove journalists and editors to refrain from propagating largely unsubstantiated news found on social media platforms, even when — as is typically done now — they are presented with large disclaimers. The trouble with these stories is that they add value for no one:  The news-junkie with an interest in the topic at hand will invariably already be well informed, while the casual observer will only understand that, apparently someone on Twitter is writing about an event alleged to have happened just now. Worse, even with disclaimers (and partly because of them), the credibility of professional news organisations suffer when some stories turn out to be false or outright scams.

This is not to say that professionals cannot draw on citizen journalists when piecing together stories. A foreign correspondent analyzing the situation in Iran could very well discuss third-hand reports from citizens, but in a critical manner and not, on its own, as the primary source. Reports could also be carefully augmented with video and pictures shot by citizens. This critical and cautious approach may sometimes be construed by opinionated citizens as professional journalists’ arrogance and aloofness. To avoid this, professionals should reach out and educate about their need for credibility and their commitment to fair and balanced reporting.

The optimal relationship would have professional journalists that are continually being kept to task by engaged citizens who, on the other hand, are encouraged by the same journalists to file credible and (if possible) verifiable photos, videos and eyewitness accounts. This will give media users access to a range of news-sources, from reasoned analysis by journalists, corroborated and augmented by citizen reports, to drinking straight from the pipe of raw and opinionated coverage flowing out of Twitter, YouTube or whatever other platform that is in vogue. Some professional journalists already embrace this development and I think Rick Sanchez of CNN says it particularly well when defending that channels Iran-coverage in the last 30 seconds of this clip.

An excellent example of citizen and professional journalists working together is found in a recent unravelling of a string of cases of medical malpractice in Denmark. Two journalists were contacted by a couple whose infant child had died some time after swallowing a battery. The parents had pleaded with doctors to examine their child, to no avail. Their complaint about the lack of treatment had also been turned down (Denmark has a single-provider health care system which is sometimes not very receptive to criticism). The journalists saw they had a powerful and emotional story, but wanted to find out if it was part of a trend or just a lone case. To that end, they created a Facebook group where people could volunteer similar stories. This unearthed a string of malpractice cases where complaints had also fallen on deaf ears. These were duly investigated and yielded a series of articles on medical negligence and ignored complaints. The journalists continued to use the Facebook group as a sounding board for new article angles and ideas and for soliciting feedback. Investigating and building this sort of story, while not impossible, would certainly have been very time consuming without the active participation of involved citizens.

While labeling people volunteering stories on a Facebook group “citizen journalists” may be a bit thick, they do form part of a continuum that extends over Twitter-users and YouTubers to bloggers. In the end, the professional journalists could write a string of explosive articles, citizens got their previously ignored stories told and all Danes will hopefully get better health care as a result.

What is the role then of the professional journalist confronted with wired, media-savvy and outspoken citizens? Journalists should insist on their commitment to provide fair and balanced reporting with integrity, even in the face of demands for speedy coverage of events that may or not be breaking right now. They should also reach out and tab into the wealth of information and opinion provided by citizen journalists and use it to augment and improve the stories they create.

Knight Foundation Scholarship Essay

(This summer I applied for the “Coders Wanted” Knight Foundation Scholarship at the Medill School of Journalism. In case anyone’s interested, I’m uploading the essays I wrote for my application.)

Question: How do journalism and technology relate to one another in the digital age?

Technology relates to journalism in two different ways: It is a topic of coverage (“science journalism”) and a driver of change. The subject of science and tech journalism is an interesting one, but this essay will focus on technology as an enabler and driver of change in the practice of journalism.

Ever since the invention of movable type, technological progress has gradually deceased the amount of money and time required to distribute information. The advent of digital technology has lowered the cost to (almost) zero and made distribution instantaneous. As Chris Anderson argues in his recent book “FREE”, this final drop to zero marks a discontinuity and it has some profound implications.

The speed and ease of digital publishing now makes it possible for everyone to write news reports, shoot photos and record video of news events — endeavours that used to be the exclusive privilege of journalists and photographers. The Internet has also greatly increased the scope for reader feedback and debate on stories created by traditional journalists. Taken together, this has led to an interesting integration of newsgathering where professional and so-called “citizen” journalists collaborate and compete to dig up, investigate and publish news.

An extreme example of this are The Guardian’s (a British newspaper) recent attempts at making sense of UK parliament members’ expense claims. The expense records were released under a freedom of information request as more than 2 million scanned documents. To investigate these, the newspaper enlisted its readers (and the Internet at large) to wade through the documents, sift out the interesting claims, determine amounts and exactly what items were claimed.

The Internet has led to the development of a range of interesting platforms, similar to the one mentioned, where journalism-related activities are taking place even outside of the confines of traditional media organizations. The author, for example, has created a web site called Folkets Ting (“People’s Parliament”) which — in the tradition of sites like OpenCongress (US) and The Public Whip (UK) — makes legislation, votes and debates from the Danish parliament available for public scrutiny and debate. It used to be the responsibility of journalists to keep elected politicians to account, but tools like these enable interested citizens to join in. It is the author’s hope that such sites will increase the scope of debate beyond the, often narrow, attention span of traditional media and lead to a greater breadth of opinion being voiced (even if the result is also likely to be lot messier).

Unfortunately, digital technology and the Internet has also seriously undermined the business model of many traditional media companies. The decline of newspapers is a particular worry, partly because theirs has been such a rapid fall (several renowned American newspapers have already shut down and more are teetering on the brink of bankruptcy), partly because they seem to play an outsize role in digging up and investigating agenda-setting stories that other types of media then pick up.

The traditional newspaper business model was based on the fact that printing technology was expensive and building a subscriber-base required time and large investments. After these had been secured however, the newspaper could make a mint on classifieds and other ads and the revenue then subsidized newsroom activities. The Internet rudely killed off this model because there is now nothing stopping sites like Craigslist and eBay from just publishing classifieds (and auctions) to large audiences and not donate the proceeds to deserving journalists.

Publishers have variously called on readers, governments and Google to do something, “do something” usually meaning “give us more money” in some shape or form. News has become a commodity that readers in most cases are unwilling to pay for. A large decline in journalism may represent a failure of the market warranting government intervention, but it is a path fraught with danger. Demanding money be redestributed from a successful part of the value chain looks like zero-sum thinking and reveals an unwillingness to reconsider ones own business. It is the opinion of this aspiring journalist (and of Chris Anderson) that the old business model, or something like it, is unlikely to return.

What, then, of journalism? Some forms (business coverage most prominently) are prospering in spite of the Internet. Other forms may shrink somewhat or find themselves augmented or supplanted by enthusiastic citizen journalists using technology and global connectivity to their advantage. An area such as public oversight of politicians and institutions could expand greatly if good tools for improving transparency and reporting are developed.

The author believes that journalism in the digital is more exciting than ever. To be sure, there are challenges to overcome, but the advantages are many: Journalists can reach wider audiences, both faster and cheaper and they can involve, solicit feedback from and collaborate with more people than at any time before. The author can’t wait to develop the platforms and systems that will form the foundations of new kinds of digital journalism, and hopes, with the help of the Knight Foundation, to get a chance to do so at Medill.