Michael Friis' Blog

About


Exchange Rate data

As part of our ongoing efforts at making sense of the Tenders Electronic Daily procurement contracts, I had to get hold of historical exchange rates to convert the values of all the contracts into a comparable form. Professor Werner Antweiler at The University of British Columbia maintains a very impressive, free database of exactly this data. He doesn’t let you export it in (great) bulk unfortunately. I wrote a small script to get the monthly data for the currencies I wanted, the important parts (in C#) are included below. Note that the site may throttle you. Also, please don’t use this to try to scrape all the data and republish it, or in other ways make a fool of yourself.

string url = "http://fx.sauder.ubc.ca/cgi/fxdata";
// this uses Euros as the base currency
string requeststring =
	string.Format(
	"b=EUR&c={0}&rd=&fd=1&fm=1&fy=2003&ld=31&lm=12&ly=2008&y=monthly&q=volume&f=csv&o=",
	"YOURCURRENCY");

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

req.ContentType = "application/x-www-form-urlencoded";
req.Expect = null;
req.Method = "Post";

byte[] reqData = Encoding.UTF8.GetBytes(requeststring);
req.ContentLength = reqData.Length;
Stream reqStream = req.GetRequestStream();
reqStream.Write(reqData, 0, reqData.Length);
reqStream.Close();

HttpWebResponse WebResp = (HttpWebResponse)req.GetResponse();
var resp = WebResp.GetResponseStream();
StreamReader answer = new StreamReader(resp);
string res = answer.ReadToEnd();

if (res.Contains("Error"))
{
	throw new Exception(string.Format("Bad currency: {0}", curr));
}

if (res.Contains("Access"))
{
	// You're being throttled
}

var lines = res.Split(new char[] { '\n' });

// ignore the first two lines and the last two ones
for (int i = 2; i < lines.Length - 2 ; i++)
{
	var line = lines[i];
	var vals = line.Split(new char[] { ',' });

	// parse the vals
	var month = GetMonth(vals[0]);
	var year = GetYear(vals[0]);

	var rate = decimal.Parse(vals[1], CultureInfo.InvariantCulture);
}

// Util Methods
private static int GetMonth(string s)
{
	var month = s.Substring(1, 3);
	switch (month)
	{
		case "Jan": return 1;
		case "Feb": return 2;
		case "Mar": return 3;
		case "Apr": return 4;
		case "May": return 5;
		case "Jun": return 6;
		case "Jul": return 7;
		case "Aug": return 8;
		case "Sep": return 9;
		case "Oct": return 10;
		case "Nov": return 11;
		case "Dec": return 12;
		default: throw new Exception("crap");
	}
}

private static int GetYear(string s)
{
	var year = s.Substring(5, 4);
	return int.Parse(year);
}

Techniques for unique, correct and fast geo-queries II

You may remember my last post on this topic ended with a question. Just to recap, we have a table with a lot of rows that have geographical coordinates and we want to find a random subset that lies in a given map window. The problem with the query demonstrated last was that there are rows with the exact same coordinates and for those coordinates the query would always return the same row (namely the one with the highest intid).

I posted a question on Stackoverflow and Tom H. came up with a solution that was 90% there. The full query looks like this:

with cte as
(
    select
        intid,
        row_number() over
            (
                    partition by geoLat, geoLng
                    order by newid()
            ) as row_num,
        count(intid) over (partition by geoLat, geoLng) as TotalCount
    from
        documents d
	where
		d.geoLat < @maxLat
			and d.geoLat > @minLat
			and
			(
				((@maxLng > @minLng) and
					(d.geoLng < @maxLng and d.geoLng > @minLng))
				or
				((@maxLng < @minLng) and
					((d.geoLng > @minLng and d.geoLng < 180)
						or (d.geoLng > -180 and d.geoLng < @maxLng))
					)
			)
)
select top (@maxcount)
    dd.*, cte.intid, rand(cte.intid)
from
    cte,documents dd
where
    row_num = 1 + floor(rand() * TotalCount) and cte.intid = dd.intid
order by newid()

The query uses Common Table Expressions, which I’ve dabbled in before. Looking at the execution plan makes my head hurt, but it’s at least as fast as the original version. See the new version in action at the TEDBot site.

Downloading the EU

… or parts of it anyway.

The European Union is generally up to lots of weird and wonderful things, one of the more esoteric being the “Tenders Electronic Daily” (TED) database. Basically, all public procurement above a certain value has to go through this database so that companies from all over the world have a fair chance of winning the contracts. Once a tender is won, the names and addresses of the winning companies are also posted. This, in the hope that Europeans will get more roads for their tax-euros and that less money disappear into the mayors cousins pocket.

The database is hosted at http://ted.europa.eu/. It’s publicly available, but you can’t obtain the data in bulk unless you pay a lot of money. This post describes how data can be scraped en masse through the web interface. I’ve written it partly to brag, partly to help you out on the off chance that you someday need to analyse these contracts (in which case I sincerely hope you stumble on this post).

A typical won tender looks like this (you need to be logged in, registration is free):
http://ted.europa.eu/Exec?DataFlow=N_list_results.dfl&Template=TED/N_result_details_data.htm&Page=1&docnumber=2005238380&StatLang=EN

http://ted.europa.eu/Exec?DataFlow=N_list_results.dfl&Template=TED/N_result_details_curr.htm&Page=1&docnumber=2005238380&StatLang=EN

Notice that there are two interesting pages per contract, one has structured data, the other has unstructured text (the “Document family” stuff is also pretty interesting, but we won’t go into that here). Also note the “docnumber” url parameter. It actually turns out to be composed of two parts, the year (2005 for this particular contract) and an integer (238380). The integer part starts in January from 000001 and is then incremented as contracts accrue through the year. There are very few holes in this sequence, although it seems a few documents are redacted each year. The problem now looks as simple as determining the maximum docnumber for each year and wade through all of them, downloading both data and text parts, right?

The login-requirement will foil you unfortunately — to get at a contract more than a few weeks old, you need to be logged in. Being considered as “logged in” by TED turns out to entail posting a valid JSESSIONID cookie. The browser obviously accomplishes this automatically when you interact with the web site, but replicating it in a scraper proved to be extraordinarily difficult — in fact, I spent the better part of a week looking at WireShark traces before I managed to reverse engineer the process. There are three steps:

  1. Post username and password to http://ted.europa.eu/eLogin
  2. Simulate a search by posting to http://ted.europa.eu/ExecSessionVariables
  3. Simulate browsing the search result by getting http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&TableName=TED_EN&Template=TED/N_result_list.htm&Page=1&toconf=yes

Any sane login process would obviously stop after the first step, but for some reason you have to built state for your session variable on the server before you are allowed to retrieve random documents. The exact nature of the search being simulated is completely irrelevant, I just use a giant string mined from a random WireShark trace. The web site is built on some arcane Java stack — the convolutedness of which must be pretty amazing (posting incomplete requests will net you a stack trace that also reveals the supplier, who shall go unnamed here). Below, I’ve posted a C# class that will do the right thing. Note the gotcha that HttpWebRequest will not absorb cookies received during posts, so you have to record these yourself.

Once you can get valid session ids, getting all the data is pretty straight forward, although it will probably take a while. I highly recommend requesting gzipped data from the EU server (by including “Accept-Encoding” – “gzip” in the header) as it will drastically cut down on bandwidth usage. You can safely have multiple scrapers hammering the website concurrently, I found 10 to be a good number. Note that the servers are apparently rebooted around midnight CET each day during which time requests tend to fail.

Remember that you can browse the contract information on Google Maps here: http://tedbot.itu.dk/

And the code:

using System;
using System.Text;
using System.Net;
using System.IO;
using System.Net.Security;
using System.Security.Principal;

namespace TED.Scraper
{
    class Authenticator
    {
        static string userid = "username";
        static string password = "password";

        public enum MethodType { Get, Post };

        static string extSearchq = @"Name=statisticMode%40fulltext_textfield"+
            "%40OJ_textfield%40country_textfield"+
            "%40place_textfield%40contract_textfield"+
            "%40procedure_textfield%40document_textfield"+
            "%40regulation_textfield%40CPV_textfield%40NUTS_textfield"+
            "%40publication_textfield%40docnumber_textfield%40datedoc_textfield"+
            "%40deadline_textfield%40type_author_textfield%40name_author_textfield"+
            "%40heading_textfield%40activity_textfield%40fulltext_textfield_hid"+
            "%40OJ_textfield_hid%40country_textfield_hid%40place_textfield_hid"+
            "%40contract_textfield_hid%40procedure_textfield_hid%40document_textfield_hid"+
            "%40regulation_textfield_hid%40CPV_textfield_hid%40NUTS_textfield_hid"+
            "%40publication_textfield_hid%40docnumber_textfield_hid%40datedoc_textfield_hid"+
            "%40deadline_textfield_hid%40type_author_textfield_hid%40name_author_textfield_hid"+
            "%40heading_textfield_hid%40activity_textfield_hid%40docLang%40maxRow%40SelRetrieval" +
            "%40FTIndex%40SearchFrom%40ExpertQry%40op1%40op2%40Query%40ErrorMes%40AdviceMes&Value=No" +
            "%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null" +
            "%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null" +
            "%40null%40null%40null%40null%40null%40null%40null%40null%40null%40null%40EN%4025%40OJ%2CND%2CTI" +
            "%40TEDINDEX_ARCHIVE2%40extended%40null%40AND%40AND%40cs_type%3Adb+%40null" +
            "%40null&Language=EN&Redirect=Exec%3FDataFlow%3DShowPage.dfl%26TableName%3DTED_EN" +
            "%26Template%3DTED%2FN_result_list.htm%26Page%3D1%26toconf" +
            "%3Dyes&ErrorMes=null&AdviceMes=null&RedirectError=Exec" +
            "%3FDataFlow%3DShowPage.dfl%26Template%3DTED%2Fextended_search%26BANNER%3DEXT";

        public static string GetAuthenticatedSessionId()
        {
            CookieContainer cc = new CookieContainer();

            DoHttpRequest("http://ted.europa.eu/eLogin",
                MethodType.Post,
                true,
                String.Format("USERID={0}&PASSWORD={1}", userid, password),
                cc);

            DoHttpRequest("http://ted.europa.eu/ExecSessionVariables",
                            MethodType.Post,
                            true,
                            extSearchq,
                            cc);

            DoHttpRequest("http://ted.europa.eu/Exec?" +
                "DataFlow=ShowPage.dfl&TableName=TED_EN&Template=TED/N_result_list.htm&Page=1&toconf=yes",
                MethodType.Get,
                false,
                null,
                cc);

            string sessionid = cc.GetCookies(new Uri("http://ted.europa.eu"))["JSESSIONID"].Value;
            return sessionid;
        }

        private static void DoHttpRequest(string url, MethodType mtype,
            bool isFormEncoded, string requestString, CookieContainer cc)
        {
            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
            req.Method = mtype.ToString();
            req.CookieContainer = cc;

            req.AllowAutoRedirect = false;
            req.AllowWriteStreamBuffering = true;
            req.AuthenticationLevel = AuthenticationLevel.None;
            req.ImpersonationLevel = TokenImpersonationLevel.None;
            req.UserAgent =
                "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9b4) Gecko/2008030714 Firefox/3.0b4";

            req.Proxy = null;

            if (isFormEncoded)
            {
                req.ContentType = "application/x-www-form-urlencoded";
                req.Expect = null;

                byte[] reqData = Encoding.UTF8.GetBytes(requestString);
                req.ContentLength = reqData.Length;
                Stream reqStream = req.GetRequestStream();
                reqStream.Write(reqData, 0, reqData.Length);
                reqStream.Close();
            }

            try
            {
                HttpWebResponse response = (HttpWebResponse)req.GetResponse();
                if (isFormEncoded)
                {
                    // seems that cookies received in post are not saved to collection
                    req.CookieContainer.Add(new Uri("http://ted.europa.eu"), response.Cookies);
                }

                response.Close();
            }
            catch (Exception e)
            {
                throw;
            }
        }

        private static void AddRequestString(HttpWebRequest request, string requeststring)
        {
            byte[] reqData = Encoding.UTF8.GetBytes(requeststring);
            request.ContentLength = reqData.Length;
            Stream reqStream = request.GetRequestStream();
            reqStream.Write(reqData, 0, reqData.Length);
            reqStream.Close();
        }
    }
}

Showing maps and borders in Processing

A few days ago, I posted a video showing public procurement expanding geopraphically with the EU enlargements in the ’00s. There weren’t any borders, but you could sort of see the outline of Europe and how the dots spread east with time.

Adding actual borders to the map proved very frustrating. The Geographical Information System (GIS) space seems plagued by obtuse binary formats, stodgy desktop applications and a profusion of coordinate systems. I tried many avenues, one of the more promising being a smoothed map from MapShaper passed through TatukGIS and exported to KML. The coordinates from MapShaper turned out to be incompatible with the latitude/longitude ones I had from the Google Maps web services however.

After much searching, I found a KML-file in the Google Maps group (worldcountriesKML.zip) with the borders of all the worlds countries. It’s not smoothed in any way, so the haggard coastline of a country like Norway looks too thick when zoomed out. The result is better than the original though:

I implemented a simple Processing helper-class to parse and draw the KML. You instantiate it like this: helper = new XMLHelper(new XMLElement(this, "world.kml"));. It has a Init() method that takes a String[] of the countries you need and String denoting the XML path to the line coordinates, e.g. "Polygon/outerBoundaryIs/LinearRing/coordinates". After initialization you can ask it for its maximum and minimum coordinates using min_x, max_x, min_y, max_y. The Draw() method draws the map on the sceen, scaled to fit the size.

class XMLHelper
{
  float coordscale = 1;
  XMLElement borders;
  Line[] lines = new Line[0];
  public float max_y =0, min_y = 70, max_x = 0, min_x = 0;
  
  public XMLHelper(XMLElement xe)
  {
    borders = xe;
  }
  
  public void Init(String[] wantedcountries, String coordpath)
  {
    XMLElement[] filecountries = borders.getChildren("Folder/Placemark");
    for(int i = 0; i < filecountries.length; i++)
    {
      XMLElement c = filecountries[i];

      if(Util.Contains(wantedcountries, c.getChild(0).getContent()))
      {
        // this one should be included on the map
        String points = c.getChild(coordpath).getContent();
      
        String[] point_a = split(points," ");
        Point[] pointarray = new Point[point_a.length];
        for(int j = 0; j < point_a.length; j++)
        {
          String[] xyz = split(point_a[j],',');
          
          float x = float(xyz[0])/coordscale;
          float y = float(xyz[1])/coordscale;
  
          pointarray[j] = new Point(x, y);
  
          max_x = max(x, max_x);
          min_x = min(x, min_x);
          max_y = max(y, max_y);
          min_y = min(y, min_y);
        }
        // this looks slow
        lines = (Line[]) append(lines,new Line(pointarray));
      }
    }
  }
  
  public void Draw()
  {
    for(int i = 0; i < lines.length; i++)
    {
      Line l = lines[i];
      beginShape();
      for(int j = 0; j < l.points.length; j++)
      {
        vertex(
          map(l.points[j].x, min_x, max_x, 0, width),
          height - map(l.points[j].y, min_y, max_y, 0, height)
        );
      }
      endShape();
    }
  }
}

class Point
{
  public float x,y;
  public Point(float _x, float _y)
  {
    x = _x;
    y = _y;
  }
}

class Line
{
  Point[] points;
  public Line(Point[] _points)
  {
    points = _points;
  }
}

static class Util
{
  static boolean Contains(String[] a, String s)
  {
    for(int i = 0; i < a.length; i++)
    {
      if(a[i].toLowerCase().equals(s.toLowerCase()))
      {
        return true;
      }
    }
    return false;
  }
}

Video of procuring authorities in the EU

I did a video showing the geographical spread over time of authorities buying stuff through the EU public procurement system. We managed to hijack all the contracts some time ago and the addresses of the authorities and the winning contractors have now all been geocoded. You can also explore the data on the TEDBot website.

The animation starts in january 2003 and ends at the end of 2007. You can actually see the EU expansion happening in May 2004 with dots spreading east. 10 14 day intervals are shown each second. The animation was generated with Processing.

Techniques for unique, correct and fast geo-queries

UPDATE: Better solution here.

Here’s the scenario: You have a lot (a million+) of geotagged addresses in a database and you want to show them as markers on Google Maps. Some further requirements are:

  • You only want to show some constant (10) amount at any given time, since too many markers clutters the map and kills the browser
  • You want the markers shown for a given frame to be selected randomly so that a user looking at some particular area with many geocoded addresses is not always shown the same 10 markers
  • You want the markers not to be on top of each other, so that even though many addresses in the database have the same coordinates (i.e. “Copenhagen, Denmark”), you only return one result per coordinate (this is to avoid placing two or more markers in the same spot)
  • You want it to run at interactive speed on a crummy box

Imagine you have a Documents table with columns including geoLng, geoLat representing geographical coordinates and intid, a unique integer identifier. @maxcount is the maximum number of rows desired while @maxLat, @minLat, @maxLng and @minLng define the corners of the map viewport. This query will do the job:

select *
from Documents dd
where dd.intid in
(
    select top (@maxcount) max(intid)
    from Documents d
    where d.geoLat < @maxLat
        and d.geoLat > @minLat
        and
        (
            ((@maxLng > @minLng) and (d.geoLng < @maxLng and d.geoLng > @minLng))
            or
            ((@maxLng < @minLng) and
            (
                (d.geoLng > @minLng and d.geoLng < 180) or
                (d.geoLng > -180 and d.geoLng < @maxLng))
            )
        )
    group by d.geoLng, d.geoLat
    order by newid()
)

“What’s this monkeying around with the longitudes?” I hear your cry? Well, if the map viewport straddles the International Dateline (which is somewhere in the Pacific), Google Maps will feed you viewport corners that “wrap around” and that has to be handled. If-elses in SQL is a mess, so it’s better to cook up some pure boolean voodoo like above. “Who the hell looks at Google Maps of the International Dateline?” you ask. Good point, but it’s actually not that uncommon. Looking at Japan at low zoom-levels will often provoke this behaviour, as will looking at French Polynesia. Note that this trick is not needed for latitudes because the maps don’t wrap vertically

The slowest bit will usually be the order by newid() part, which gives each row in the prospective result a new (random) guid, and then sorts all the rows based on this column. It can be replaced with tablesample calls which are much faster, but also a lot more erratic.

There’s a very slight problem with this query in that the max(intid) will always cause the same row to be returned for any given Lat/Lng coordinate. The use of max(intid) is completely arbitrary and min(intid) would work too. Had there been a rand(intid) aggregate, the problem would have been solved. I haven’t figured out an elegant solution to this problem yet. Note that max()doesn’t work on guids produced by newid().

To get this to perform, the tables in question are organised around a clustered (geoLat, geoLng, intid) index. You can see somewhat more elaborate versions of this query doing their thing at the TEDBot website.