I’ve just submitted a voice-sample to help Google in their efforts to build Danish-language voice search. See what voice search is about in this video. In case anyone is interested, here’s how Google goes about collecting these samples.
The sampling was carried out by a Danish-speaker hired by Google for the specific task. The sampling was done in a crowded Copenhagen coffee-shop (Baresso at Strøget, near Googles Copenhagen sales office) with people talking, coffee-machines hissing and music in the background. This is likely to ensure that samples are collected in an environment similar to the one where voice search will be used.
The samples were recorded on a stock Google Nexus One using an Android application called “dataHound”. The sampling basically involved me reading 500 random terms, presumably search terms harvested from Google searches. Most were one-word phrases but there some multi-word ones too (this likely reflects the fact that most users only search using single words). The Googler said that it was due to the sensitive nature of these terms (and the risk of harvesting presumably) that the sampling had to be carried out in-person. Google apparently requires 500 of these 500-word samples to form a language-corpus (I was number 50).
The dataHound app displayed the term to be spoken at the top with a bunch of buttons at the bottom. One button advanced the app to the next term, one could be pressed if the term was completely unintelligible and one could be used if the term was offensive to you and you did not want to say it out loud (I had no such qualms). The interface was pretty rough but the app was fast.
The terms were all over the place. I work for Ekstra Bladet (a Danish tabloid) and noted our name cropped up twice. “Nationen” (our debate sub-site) showed up once. Other Danish media sites were well represented and there were many locality-relevant searches. There were also a lot of domain-names, presumably Google expect people to use Google Voice Search over typing in a url themselves (indeed, people already do this on google.com).
Among the terms were also “Fisse” (the Danish word for “cunt”), “tisse kone” (a more polite synonym for female genitals), “ak-47” and “aluha ackbar”. If Google prompts you to say “cunt” in a public place, how can you refuse?
The googler told me that she’s looking for more volunteers, so drop her a line of you speak Danish and live in Copenhagen: [email protected]. Plus, you get a Google T-shirt for your efforts!
The whole Folkets Ting business has turned out rather well (even though the site is not currently updated — we’re working on it!) and I’ve been invited to speak on a few occations. Some of the talks were recorded, and in the interest of self-agrandissement they are included below in chronological order (except for the last one).
Short blurb (in Danish) on what I think about the usefulness of public data at the ODIS conference:
Speech on “Political Data API” after a project of mine won a competition promoting reuse of public data (winners were announced at the conference mentioned above):
Denmark somehow seems to have hatched more programmers and language designers of note than one would expect of a country of 6 million. Since almost none of them live in Denmark, it is kind of easy to forget. Here’s a partial list (alphabetical, inclusion determined by my completely whimsical notions of famousness, reasons for inclusion may be somewhat exaggerated):
Anders Hejlsberg (came up with Turbo Pascal, lead architect for C#)
I’ve created a new web site on Danish politics in the tradition of The Public Whip and OpenCongress (although it’s not yet nearly as good as those guys). It’s called Folkets Ting and comes with a complimentary blog (both in Danish). Go check it out.
Right – after a few years on ITU servers, I’ve moved my blog to a separate domain hosted by Netplads. This was mostly for SEO reasons, so that I could build Google Juice on my own and not have my page rank muddled with whatever ITU does. The new host also allows .htaccess modifications so that I can get nice URLs. Netplads is a cheap and cheerful Danish hoster – the only fault I’ve found so far is a lack of mod_gzip support.
On my old blog, the Redirection plugin does 301 redirects to the one you’re currently reading (doing rewrites in .htaccess would have been easier but was unsupported). In fact, it’s so good at it that I can no longer access my old blog in any way. Good riddance.
After several false starts, I think I will now have enough material to post regularly. The posts will probably concern mainly LINQ (the subject of my master thesis), Dynamics CRM(which I work with daily) and C#/.Net/Web-tech in general — for the near future at least.
While Hemingway’s prose will consistently make the hairs on the back of my neck stand on end, that is — in fact — not the reason I chose the hemingway reloaded wp-theme. I just happen to think it’s aesthetically pleasing. I’ve made a few minor mods, including removing the credits in the lover left corner. Instead I’ll credit the creators here: Thank you startup365 and Kyle Neath for a beautiful theme. If I find the time, I may mod it some more. I’m thinking …CGA!
The blog is hosted at ITU, it’s free, has an agreeable LAMP-stack and plenty of bandwidth (not that I’ll need it).
If you want to know more about me, check the about page.
UPDATE, 04-08-2007: Google Code Prettify is now syntax highlighting code in posts.