Geohashes and the problem of Greenwich
Sunday, June 8th, 2008
A geohash is a way to encode latitude/longitude into a string such as gcpvjk3c that includes only number and letters, making it easy to pass around in URLs, etc. The Wikipedia entry makes the algorithm seem more complicated than it actually is. The steps are:
- normalise latitude and longitude to the range [0, 1)
- convert normalised latitude and longitude to binary (in the conventional way)
- interlace the bits of the two binary representations to create one binary number
- convert this number to base-32 (i.e. every 5 bits becomes a character) using a alphabet that includes all the numbers and all the alphabet, with the exception of a, i, l, o
A geohash also supposedly has the property that similar locations have a similar prefix, making it easier for computer programs to figure out similar locations, without resorting to complicated and slow trigonometry that won’t scale to hundreds of thousands of points.
Geohashes do have this property for many geospatial neighbours, but there’s still a most significant bit, and locations either side of it are going to have very different geohashes. Unfortunately, given the location of Greenwich—about 10km to the east of central London—a whole bunch of London locations have very different geohashes. u10hb7951 and gcpuzewfz, for example, are very close to each other.
Moving the 0, 0 point to into the middle of the Atlantic (or anywhere not as inhabited as London) helps with this, but it’s obviously not a robust fix. I wonder how the geospatial extensions to MySQL and PostgreSQL actually work? It’s fairly efficient to convert hashes based on one origin to another, so perhaps they store multiple hashes per location? There must be some clever algorithms for this.
Tags: geohash, geolocation
June 8th, 2008 at 10:12 pm
“A geohash also supposedly has the property that similar locations have a similar prefix, making it easier for computer programs to figure out similar locations,”
If you read the Wikipedia page carefully and go to geohash.org, you’ll see that this isn’t entirely accurate. There are many ways to explore the algorithm to find nearby places indeed, but computing how close two different Geohashes really are by merely comparing the string prefixes “blindly” (IOW, without taking the algorithm used in consideration) wasn’t one of the goals, even because it’s an impossible one. I’ll try to make that more clear in the page to prevent misunderstandings.
Suppose, for instance, that an algorithm established that places which are N kilometers apart must necessarily have the same first character. As a corollary, all places on earth would have to start with the same character. Picking lexically closeby characters to represent the boundary doesn’t work either, because the earth is round.
The often similar prefixes come, in fact, as a bonus of one of the goals of the algorithm, which was allowing for gradual degradation of the position by reducing the geocode size.
June 12th, 2008 at 10:35 am
The Wikipedia entry is clearer now, and I totally agree it’s impossible for a single number to capture a pair of numbers and maintain equivalent “locality.” There must still be some nice techniques for efficiently computing neighbours though.