Michael Friis' Blog – Page 5 – A Michael Friis Production

Screen scraping with WatiN

May 8, 2012

This post describes how to use WatiN to screen scrape web sites that don’t want to be scraped. WatiN is generally used to instrument browsers to perform integration testing of web applications, but it works great for scraping too.

Screen scraping websites can range in difficulty from very easy to extremely hard. When encountering hard-to-scrape sites, the typical cause of difficulty is fumbling incompetence on the part of the people that built the site to be scraped. Every once in a while however, you’ll encounter a site openly displaying data to the casual browser, but with measures in place to prevent automatic scraping of that data.

The Danish Patent and Trademark Office is one such site. The people there maintain a searchable database that lets you search and peruse Danish and international patents. Unfortunately, computers are not allowed. If one tries to issue HTTP POST to the resource that generally performs searches and shows patents, an error is returned. If one emulates visiting the site with a real browser by providing a browser-looking User Agent setting, collecting cookies etc. (for example by using a tool like SimpleBrowser), the site sends a made-up 999 HTTP response code and the message “No Hacking”.

Faced with such an obstruction, there are two avenues of attack:

Break out Wireshark or Fiddler and spend a lot of time figuring out what it takes to fabricate requests that fools the site into thinking they originate from a normal browser and not from your bot
Instrument an actual browser so that the site will have no way (other than timing analysis and IP address request rate limiting) of knowing whether requests are from a bot or from a normal client

The second option turns out to be really easy because people have spent lots of time building tools for automatically testing web applications using full browsers, tools like WatiN. For example, successfully scraping the Danish Patent Authorities site using WatiN is as simple as this:

private static void GetPatentsInYear(int year)
{
	using (var browser = new IE("http://onlineweb.dkpto.dk/pvsonline/Patent"))
	{
		// go to the search form
		browser.Button(Find.ByName("menu")).ClickNoWait();

		// fill out search form and submit
		browser.CheckBox(Find.ByName("brugsmodel")).Click();
		browser.SelectList(Find.ByName("datotype")).Select("Patent/reg. dato");
		browser.TextField(Find.ByName("dato")).Value = string.Format("{0}*", year);
		browser.Button(Find.By("type", "submit")).ClickNoWait();
		browser.WaitForComplete();

		// go to first patent found in search result and save it
		browser.Buttons.Filter(Find.ByValue("Vis")).First().Click();
		GetPatentFromPage(browser, year);

		// hit the 'next' button until it's no longer there
		while (GetNextPatentButton(browser).Exists)
		{
			GetNextPatentButton(browser).Click();
			GetPatentFromPage(browser, year);
		}
	}
}

private static Button GetNextPatentButton(IE browser)
{
	return browser.Button(button =>
		button.Value == "Næste" && button.ClassName == "knapanden");
}

Note that in this example, we’re using Internet Explorer because it’s the easiest to setup and use (WatiN also works with Firefox, but only older versions). There’s definitely room for improvement, in particular it’d be interesting to explore parallelizing the scraper to download patents faster. The – still incomplete – project source code is available on Github. I’ll do a post shortly on what interesting data can be extracted from Danish patents.

Raw updated data on Danish business leader groups

March 18, 2012

Last summer, I published data on the members of Danish business leader groups, obtained with code written while I was still at Ekstra Bladet. I’ve cleaned up the code and removed the parts that fetched celebrities from various other obscure sources. You can fork the project on Github.

The code is fairly straightforward. The scraper itself is less than 150 loc. The scraper is configured to be run in a background worker on AppHarbor and will conduct a scrape once a month (I don’t know how often the VL-people update their website, but monthly updates seems sufficient to keep track of coming and goings). The resulting data can be fetched using a simple JSON API. You can find a list of scraped member-batches here (there’s just one at the time of writing). Hitting http://vlgroups.apphb.com/Member will always net you the latest batch.

I was motivated to revisit the code after this week’s dethroning of Anders Eldrup from his position as CEO of Dong Energy. Anders Eldrup sits in VL-gruppe 1, the most prestigious one. Let’s see if he’s still there next time the scraper looks. 14 other Dong Energy executives are members of other groups, although interestingly, Jakob Baruël Poulsen (Eldrup’s handsomely rewarded sidekick) is nowhere to be found. I think data like this in an important piece of the puzzle to figure out what relations exist between business leaders in Denmark and the Anders Eldrup debacle demonstrates why keeping track is important.

Nordic Newshacker

February 26, 2012

The excellent people at the Danish newspaper Information are hosting a competition to promote data journalism. It’s called “Nordisk Nyhedshacker 2012“. Data journalism was what I spent some of my time at Ekstra Bladet doing, and the organizers have been kind enough to put me on the jury. The winner will get a scholarship to go work at The Guardian for a month, sponsored by Google. Frankly, I’d prefer working at Information, but I guess The Guardian will do. If you’re a journalist that can hack or if you’re hacker interested in using your craft to make people more informed about the world we live in, you should use this opportunity to come up with something interesting and be recognized for it.

Hopefully, you already have awesome ideas for what to build. Should you need some inspiration, here a few interesting pieces of data you might want to consider (projects using this data will not be judged differently than others).

Examine the US Embassy Cables released by Wikileaks. I’ve tried to filter out the ones related to Denmark.
Examine the power relationships of members of Danish business leader groups. I have extracted the membership info from their web site. It’d be extra interesting if you combine this information with data about who sits on the boards of big Danish companies, perhaps to make the beginnings of something like LittleSis so that we can keep track of what favours those in power are doing each other.
Do something interesting with the CVR database of Danish companies that was leaked on The Pirate Bay last year.
Ekstra Bladet has been kind enough to let me open source the code for the award-winning Krimikort (Crime Map) I built while working there. It’s not quite ready to be released yet, but we’re making the current data available now. There’s 62,753 nuggets of geo-located and categorised crime ready for you to look at. You can download a rar file (50 MB) here. To use the data, you have to get a free copy of SQL Server Express and mount the database (Google will tell you how).

I’m afraid I won’t be able be participate in many of the activities preceding the actual competition but I can’t wait to see what people come up with!

US Embassy Cables Related to Denmark

September 3, 2011

As you may know, Wikileaks has released the full, un-redacted database of US Embassy cables. A torrent file useful for downloading all the data is available from Wikileaks, at the bottom of this page. It’s a PostgreSQL data dump. Danish journalists seem to be completely occupied producing vacuous election coverage, so to help out, I’ve filtered out the Denmark-related cables and are making them available as Google Spreadsheets/Fusiontables.

The first set (link) are cables (146 in all) from the US Embassy in Copenhagen, with all the “UNCLASSIFIED” ones filtered out (since they are typically trivial, if entertaining in their triviality). Here’s the query:

copy (
	select * 
	from cable 
	where origin = 'Embassy Copenhagen' 
		and classification not like '%UNCLASSIFIED%'
	order by date desc)
to 'C:/data/cph_embassy_confidential.csv' with csv header

The second set, at 1438 rows, (link) mention either “Denmark” or “Danish”, are from embassies other than the one in Copenhagen and are not “UNCLASSIFIED”. Query:


copy (
	select * 
	from cable 
	where origin != 'Embassy Copenhagen' 
		and classification not like '%UNCLASSIFIED%'
 		and (
 			content like '%Danish%' or
 			content like '%Denmark%'
 		)
	order by date desc
)
to 'C:/data/not_cph_embassy_confidential.csv' 
	with csv header 
	force quote content
	escape '"'

Members of Danish VL Groups

August 28, 2011

Denmark has a semi-formalised system of VL-groups. “VL” is short for “Virksomhedsleder” which translates to “business leader”. The groups select their own members, and the whole thing is organised by the Danish Society for Business Leadership. The groups are not composed only of business people — top civil servants and politicians are also members. The groups meet up regularly to talk about whatever people from those walks of life talk about when they get together.

Before doing what I currently do, I worked for Ekstra Bladet, a Danish tabloid. Other than giving Danes their daily dose of sordidness, Ekstra Bladet spends a lot of time holding Denmarks high’n-mighty to account. To that end, I worked on building a database of influential people and celebrities so that we could automatically track when their names crop in court documents and other official filings (scared yet, are we?). The VL-group members obviously belong in this database. Fortuitously, group membership is published online and is easily scraped.

In case you are interested, I’ve created a Google Docs Spreadsheet with the composition of the groups as of August 2011. I’ve included only groups in Denmark proper — there are also overseas groups for Danish expatriates and groups operating in the Danish North Atlantic colonies. The spreadsheet (3320 members in all) is embedded at the bottom of this post.

Now, with this list in hand, any well-trained Ekstra Bladet employee will be brainstorming what sort of other outrage can be manufactured from the group membership data. How about looking at the gender distribution of the members? (At this point I’d like to add a disclaimer: I personally don’t care whether the VL-groups are composed primarily of men, women or transgendered garden gnomes so I dedicate the following to Trine Maria Kristensen. Also, an Ekstra Bladet journalist wrote this story up some months after I left, but I wanted to make the underlying data available).

To determine the gender of each group member, I used the Department of Family Affairs lists of boys and girls given names (yes, the Socialist People’s Kingdom of Denmark gives parents lists of pre-approved names to choose from when naming their children). Some of the names are ambigious (eg. Kim and Bo are permitted for both boys and girls). For these names, the gender-determinitation chooses what I deem to be the most common gender for that name in Denmark.

Overall, there are 505 females out of 3320 group members (15.2%). 8 groups of 95 have no women at all (groups 25, 28, 52, 61, 63, 69, 104 and 115). 12 groups include a single woman, while 6 have two. There is also a single all-female troupe, VL Group 107.

Please take advantage of the data below to come up with other interesting analysis of the group compositions.

Non-trivial Facebook FQL example

November 11, 2010

This post will demonstrate a few non-trivial FQL calls from Javascript, including batching interdependent queries in one request. The example queries all participants of a public Facebook event and gets their names and any public status updates they’ve posted recently. It then goes on to find all friend-relations between the event participants and graphs those with an InfoVis Hypertree. I haven’t spent time on browser-compatibility in result-rendering (sorry!), but the actual queries work fine across browsers. You can try out the example here. The network-graph more-or-less only works in Google Chrome.

The demo was created for a session I did with Filip Wahlberg at the New Media Days conference. The session was called “Hack it, Mash it” and involved us showing off some of the stuff we do at ekstrabladet.dk and then demonstrating what sort of info can be pulled from Facebook. Amanda Cox was on the next morning and pretty much obliterated us with all the great interactive visualizations the New York Times builds, but that was all right.

Anyway, on to the code. Here are the three queries

var eventquery = FB.Data.query(
	'select uid from event_member ' +
	'where rsvp_status in ("attending", "unsure") ' +
		'and eid = 113151515408991 '
);

var userquery = FB.Data.query(
'select uid, name from user ' +
'where uid in  ' +
	' (select uid from {0})', eventquery
);

var streamquery = FB.Data.query(
	'select source_id, message from stream ' +
	'where ' +
	'updated_time > "2010-11-04" and ' +
	'source_id in ' +
		'(select uid from {0}) ' +
	'limit 1000 '
	, eventquery
);

FB.Data.waitOn([eventquery, userquery, streamquery],
	function () {
		// do something interesting with the data
	}
);

Once the function passed to waitOn executes, all the queries have executed and results are available. The neat thing is that FB.Data bundles the queries so that, even though the last two queries depend on the result of the first one to execute, the browser only does one request. Facebook limits the number of results returned from queries on the stream table (which stores status updates and similar). Passing a clause on ‘updated_time’ seems to arbitrarily increase this number.

So now that we have the uid’s of all the attendees, how do we get the friend-relations between those Facebook users? Generally, Facebook won’t give you the friends of a random user without your app first getting permission from said user. Facebook will tell you whether any two users are friends and this is done by querying the friend table. So I wrote this little query which handily gets all the relations in a set of uids. Assume you’ve stored all the uids in an array:

var uidstring = uids.join(",");
var friendquery = FB.Data.query(
	'select uid1, uid2 ' +
	'from friend ' +
	'where uid1 in ({0}) and uid2 in ({0})'
	, uidstring
);

FB.Data.waitOn([friendquery], function () {
	// do something with the relations, like draw a graph
});

Neat huh? The full script can be found here: http://friism.com/wp-content/uploads/2010/11/nmdscript.js

Wikileaks Iraq wardiaries data quality

October 23, 2010

til;dr: The Wikileaks Iraq data is heavily redacted (by Wikeleaks presumably) compared to the Afghanistan data: Names — of persons, bases, units and more — have been purged from the “Title” and “Summary” column-texts and the precision of geograpical coordinates have been truncated. This makes both researching and visualizing the Iraq data somewhat difficult.

(this is a cross-post from the Ekstra Bladet Bits blog)

Ekstra Bladet received the Iraq data from Wikileaks some time before the Friday 22. 23:00 (DK-time) embargo. We knew the dump was going to be in the exact same format as the Afghanistan one, so loading the data was a snap. When we started running some of the same research-scripts used on the Afghanistan data, it quickly became clear that something was amiss however. For example, we could only find a single report mentioning Danish involvement (namely the “Danish Demining Group”) in the Iraq War. We had drawn up a list persons, companies and places of interest, but searches for these also turned up nothing. A quick perusal of a few sample reports revealed that almost all identifying names have been purged from report texts.

Update: It turns out that Ekstra Bladet got the redacted version of the from Wikileaks. Apparently some 6 international news organisations (and the Danish newspaper Infomation) got the full, unredacted data. They won’t be limited in the ways mentioned below.

This caused us to temporarily abandon the search for interesting individual events and instead try to visualize the events in aggregate using maps. I had readied a heatmap tile-renderer which — when fed the Afghanistan data — produces really nice zoomable heatmaps overlayed on Google Maps. When loaded with the Iraq data however, the heatmap tiles had strange artifacts. This turns out to be because the report geo-coordinate-precision has been truncated. We chose not to publish the heatmap, but the effect is also evident on this Google Fusion-tables based map of IED-attacks (article text in Danish). The geo-precision truncation makes it impossible to produce something like the Guardian IED heatmap, demonstrating IED-attacks hugging roads and major cities.

We did manage to produce some body count-based articled before the embargo. Creating simple infographics showing report- and attack-frequency over time is also possible. Looking at the reports, it is also fairly easy to establish that Iraqi police mistreated prisoners. Danish soldiers are known to have handed over prisoners to Iraqi police (via British troops), making this significant in a Danish context. We have — however — not been able to use the reports to scrutinize the Danish involvement in the Iraq war in the same depth that we could with the Afghanistan data.

We initially thought the redactions were only for the pre-embargo data dump and that an unredacted dataset might become available post-embargo. That seems not to be the case though, since the reports Wikileaks published online after the embargo are also redacted.

I’m not qualified to say whether the redactions in the Iraq reports are necessary to protect the individuals mentioned in them. It is worth noting that the Pentagon itself found that no sources were revealed by the Afghanistan leak. The Iraq-leak is great ressource for documenting the brutality of the war there, but the redactions do make it difficult to make sense of individual events.

Facebook Open Graph at ekstrabladet.dk

June 29, 2010

(This post is a straight-up translation from Danish of a post on the Ekstra Bladet development blog)

Right before the 2010 World Cup started, ekstrabladet.dk (the Danish tabloid where I work) managed to get an interesting implementation of the new Facebook Open Graph protocol up and running. This blog post describes what this feature does for our users and what possibilities we think Open Graph holds. I will write a post detailing the technical side of the implementation shortly.

The Open Graph protocol involves adding mark-up to pages on your site so that Facebook users can ‘like’ them in the same way that you can like fan-pages on Facebook. A simple Open Graph implementation for a news-website might involve markup-additions that let users like individual articles, sections and the frontpage. We went a bit further and our readers can now ‘like’ the 700-800 soccer players competing in the World Cup. The actual liking works by hovering over linkified player-names in articles. You can try it out in this article (which tells our readers about the new feature, in Danish) or check out the action-shot below.

When a reader likes a player, Facebook sticks a notice in that users feed, similar to the ones you get when you like normal Facebook pages. The clever bit is that we at Ekstra Bladet can now — using the Facebook publishing API — automatically post updates to Facebook users that like soccer players on ekstrabladet.dk. For example “Nicklas Bendtner on ekstrabladet.dk” (a Danish striker) will post an update to his fans every time we write a new article about him, and so will all the other players. Below, you can see what this looks like in peoples Facebook feeds (in this case it is Lionel Messi posting to his fans).

Behind the scenes the players are stored using a semantic-web/linked-data datastore so that we know that the “Lionel Messi” currently playing for the Argentinian National Team is the same “Lionel Messi” that will be playing for FC Barcelona in the fall.

Our hope is that we can use the Open Graph implementation to give our readers prompt news about stuff they like, directly in their Facebook news feeds. We are looking at what other use we can make of this privileged access to users feeds. One option would be to give our users selected betting suggestions for matches that involve teams they are fans of (this would be similar to a service we currently provide on ekstrabladet.dk).

We have already expanded what our readers can like to bands playing at this years Roskilde Festival (see this article) and we would like to expand further to stuff like consumer products, brands and vacation destinations. We could then use access to liking users feeds to advertise for those product or do affiliate marketing (although I have to check Danish law and Facebook terms before embarking on this). In general, Facebook Open Graph is a great way for us to learn about our readers’ wants and desires and it is a great channel for delivering personalized content in their Facebook feeds.

Are there no drawbacks? Certainly. We haven’t quite figured out how to best let our readers like terms in article texts. Our first try involved linkifying the terms and sticking a small Facebook thumb-icon after it. Some of our users found that to ruin the reading experience however (even if you don’t know Danish, you might be able to catch the meaning of the comments below this article). Now the thumb is gone, but the blue link remains. As a replacement for the thumb, we are contemplating adding a box at the bottom of articles, listing the terms used in that article for easy liking.

Another drawback is the volume of updates we are pushing to our users. During the World Cup we might write 5 articles with any one player appearing in them over the course of a day and our readers may be subscribed to updates from several players. Facebook does a pretty good job of aggregating posts, but it is not perfect. We are contemplating doing daily digests to avoid swamping peoples news feeds.

A third drawback is that it is not Ekstra Bladet that is accumulating information about our readers, but Facebook. Even though we are pretty good at reader identity via our “The Nation!” initiative, we have to recognize that the audience is much larger when using Facebook. Using Facebook also gives us superb access to reader social graph and news feeds, something we could likely not built ourselves. A mitigating factor is that Facebook gives us pretty good APIs for pulling information about how readers interact with content on our site.

Stay tuned if you want to know more about our Facebook Open Graph efforts.

Roskilde Festival 2010 Schedule as XML

June 12, 2010

@mortenjust and @claus have created the excellent Roskilde Festival Pocket Schedule Generator. They gave me access to their schedule data, and I’ve used that to scrape more tidbits from the Roskilde Festival website. Fields are:

Name (all caps)
Stage where band plays
Time of performance (in UNIX and regular datetime)
Roskilde Festival website URL
Countrycode
Myspace URL
Band website URL
Picture URL
Video-embed-html
Tag-line

Get it here: roskilde2010.xml.zip

Google sampled my voice and all I got was this lousy T-shirt!

June 2, 2010

I’ve just submitted a voice-sample to help Google in their efforts to build Danish-language voice search. See what voice search is about in this video. In case anyone is interested, here’s how Google goes about collecting these samples.
The sampling was carried out by a Danish-speaker hired by Google for the specific task. The sampling was done in a crowded Copenhagen coffee-shop (Baresso at Strøget, near Googles Copenhagen sales office) with people talking, coffee-machines hissing and music in the background. This is likely to ensure that samples are collected in an environment similar to the one where voice search will be used.

The samples were recorded on a stock Google Nexus One using an Android application called “dataHound”. The sampling basically involved me reading 500 random terms, presumably search terms harvested from Google searches. Most were one-word phrases but there some multi-word ones too (this likely reflects the fact that most users only search using single words). The Googler said that it was due to the sensitive nature of these terms (and the risk of harvesting presumably) that the sampling had to be carried out in-person. Google apparently requires 500 of these 500-word samples to form a language-corpus (I was number 50).

The dataHound app displayed the term to be spoken at the top with a bunch of buttons at the bottom. One button advanced the app to the next term, one could be pressed if the term was completely unintelligible and one could be used if the term was offensive to you and you did not want to say it out loud (I had no such qualms). The interface was pretty rough but the app was fast.

The terms were all over the place. I work for Ekstra Bladet (a Danish tabloid) and noted our name cropped up twice. “Nationen” (our debate sub-site) showed up once. Other Danish media sites were well represented and there were many locality-relevant searches. There were also a lot of domain-names, presumably Google expect people to use Google Voice Search over typing in a url themselves (indeed, people already do this on google.com).

Among the terms were also “Fisse” (the Danish word for “cunt”), “tisse kone” (a more polite synonym for female genitals), “ak-47” and “aluha ackbar”. If Google prompts you to say “cunt” in a public place, how can you refuse?

The googler told me that she’s looking for more volunteers, so drop her a line of you speak Danish and live in Copenhagen: [email protected]. Plus, you get a Google T-shirt for your efforts!

Older Posts Newer Posts