HAYCORN — 11 July 2010

On MongoDB durability and commit models

I’ve been reading and think­ing a bit about MongoDB’s con­sis­tency model a bit over the last few weeks. I think what follows is an ac­cu­rate model of how it behaves, and what the im­pli­ca­tions are, but I would be grate­ful for any corrections:

  • Whilst mongod is running, you may as well imagine that all your data is stored in memory, and that the files it creates on disk don’t ac­tu­ally exist. This is because the files them­selves are memory mapped, and MongoDB makes no guar­an­tees about whether they’re in a con­sis­tent state. (They’re the­o­ret­i­cally con­sis­tent im­me­di­ately after an fsync, and this can be forced in several ways, but I don’t think there’s any easy way to snap­shot the data­base files im­me­di­ately after the fsync.) [Update: Kristina pointed out that it is pos­si­ble to fsync and then prevent the memory mapped file from being mod­i­fied via the fsync and lock command. Once the lock (which blocks all write operations, though not reads) is in place, you can safely copy/snapshot the data­base file.]
  • Since the data is stored in memory (including virtual memory) you need repli­ca­tion to ensure durability. However, MongoDB argue you need repli­ca­tion for dura­bil­ity anyway (notwithstanding their plans to add single-server dura­bil­ity in 1.8). In a blog post (“What about Durability?”), they point out:

    • even if your data­base the­o­ret­i­cally sup­ports single-server dura­bil­ity (such as CouchDB with its crash only design), this will only protect you if you’ve turned off hard­ware buffer­ing or have a battery-backed RAID con­troller to ensure your writes really hit disk
    • having a well-formed data­base file on disk doesn’t help you if the disk itself fails, and this mode of failure is about as likely as any other
    • for some applications, you need 100% uptime, and so the delay re­quired to recover from a failure (by re­load­ing the data­base and re­play­ing the trans­ac­tion log, for example) is unacceptable
  • By default writes are not per­sisted im­me­di­ately to the database—not even the in-memory data structure. Writes instead go into a per-server queue (with FIFO ordering?), and there’s no guar­an­tee about when the write will be visible to other clients. (Although it does seem that clients are guar­an­teed to read their own writes.)
  • To change the default write behaviour, you: (a) issue the write as normal and then (b) issue the confusingly-named getLastError() command. getLastError() blocks until the write is either com­mit­ted to the in-memory data structure, fsynced to disk, or com­mit­ted to n replicas, de­pend­ing on the ar­gu­ments passed. (Note that many clients ab­stract away the write plus getLastError() call, so that the ar­gu­ments to getLastError() become ar­gu­ments to the write.

    • To block until the write is committed, issue getLastError() with no options.
    • To block until the data struc­ture is fsynced to disk, issue getLastError() with fsync = true.
    • To block until the write has been added to the write queue of n replicas, issue w = n. (Note that there is cur­rently no way to block until the write has been fsynced on the replicas.)