Techniques for unique, correct and fast geo-queries
UPDATE: Better solution here.
Here’s the scenario: You have a lot (a million+) of geotagged addresses in a database and you want to show them as markers on Google Maps. Some further requirements are:
- You only want to show some constant (10) amount at any given time, since too many markers clutters the map and kills the browser
- You want the markers shown for a given frame to be selected randomly so that a user looking at some particular area with many geocoded addresses is not always shown the same 10 markers
- You want the markers not to be on top of each other, so that even though many addresses in the database have the same coordinates (i.e. “Copenhagen, Denmark”), you only return one result per coordinate (this is to avoid placing two or more markers in the same spot)
- You want it to run at interactive speed on a crummy box
Imagine you have a
Documents table with columns including
geoLat representing geographical coordinates and
intid, a unique integer identifier.
@maxcount is the maximum number of rows desired while
@minLng define the corners of the map viewport. This query will do the job:
select * from Documents dd where dd.intid in ( select top (@maxcount) max(intid) from Documents d where d.geoLat < @maxLat and d.geoLat > @minLat and ( ((@maxLng > @minLng) and (d.geoLng < @maxLng and d.geoLng > @minLng)) or ((@maxLng < @minLng) and ( (d.geoLng > @minLng and d.geoLng < 180) or (d.geoLng > -180 and d.geoLng < @maxLng)) ) ) group by d.geoLng, d.geoLat order by newid() )
"What's this monkeying around with the longitudes?" I hear your cry? Well, if the map viewport straddles the International Dateline (which is somewhere in the Pacific), Google Maps will feed you viewport corners that "wrap around" and that has to be handled. If-elses in SQL is a mess, so it's better to cook up some pure boolean voodoo like above. "Who the hell looks at Google Maps of the International Dateline?" you ask. Good point, but it's actually not that uncommon. Looking at Japan at low zoom-levels will often provoke this behaviour, as will looking at French Polynesia. Note that this trick is not needed for latitudes because the maps don't wrap vertically
The slowest bit will usually be the
order by newid() part, which gives each row in the prospective result a new (random) guid, and then sorts all the rows based on this column. It can be replaced with
tablesample calls which are much faster, but also a lot more erratic.
There's a very slight problem with this query in that the
max(intid) will always cause the same row to be returned for any given Lat/Lng coordinate. The use of
max(intid) is completely arbitrary and
min(intid) would work too. Had there been a
rand(intid) aggregate, the problem would have been solved. I haven't figured out an elegant solution to this problem yet. Note that
max()doesn't work on guids produced by
To get this to perform, the tables in question are organised around a clustered
(geoLat, geoLng, intid) index. You can see somewhat more elaborate versions of this query doing their thing at the TEDBot website.