Geohashes and the problem of Greenwich

8 June 2008

A geohash is a way to encode latitude/longitude into a string such as gcpvjk3c that in­cludes only number and letters, making it easy to pass around in URLs, etc. The Wikipedia entry makes the al­go­rithm seem more com­pli­cated than it ac­tu­ally is. The steps are:

  1. nor­malise lat­i­tude and lon­gi­tude to the range [0, 1)
  2. convert nor­malised lat­i­tude and lon­gi­tude to binary (in the con­ven­tional way)
  3. in­ter­lace the bits of the two binary rep­re­sen­ta­tions to create one binary number
  4. convert this number to base-32 (i.e. every 5 bits becomes a character) using a al­pha­bet that in­cludes all the numbers and all the alphabet, with the ex­cep­tion of a, i, l, o

A geohash also sup­pos­edly has the prop­erty that similar lo­ca­tions have a similar prefix, making it easier for com­puter pro­grams to figure out similar locations, without re­sort­ing to complicated and slow trigonometry that won’t scale to hun­dreds of thou­sands of points.

Geo­hashes do have this prop­erty for many geospa­tial neighbours, but there’s still a most sig­nif­i­cant bit, and lo­ca­tions either side of it are going to have very dif­fer­ent geohashes. Unfortunately, given the lo­ca­tion of Greenwich—about 10km to the east of central London—a whole bunch of London lo­ca­tions have very dif­fer­ent geohashes. u10hb7951 and gcpuzewfz, for example, are very close to each other.

Moving the 0, 0 point to into the middle of the At­lantic (or any­where not as in­hab­ited as London) helps with this, but it’s ob­vi­ously not a robust fix. I wonder how the geospa­tial ex­ten­sions to MySQL and Post­greSQL ac­tu­ally work? It’s fairly ef­fi­cient to convert hashes based on one origin to another, so perhaps they store mul­ti­ple hashes per location? There must be some clever al­go­rithms for this.