On MongoDB durability and commit models

I’ve been reading and thinking a bit about MongoDB’s consistency model a bit over the last few weeks. I think what follows is an accurate model of how it behaves, and what the implications are, but I would be grateful for any corrections:

Whilst mongod is running, you may as well imagine that all your data is stored in memory, and that the files it creates on disk don’t actually exist. This is because the files themselves are memory mapped, and MongoDB makes no guarantees about whether they’re in a consistent state. (They’re theoretically consistent immediately after an fsync, and this can be forced in several ways, but I don’t think there’s any easy way to snapshot the database files immediately after the fsync.) [Update: Kristina pointed out that it is possible to fsync and then prevent the memory mapped file from being modified via the fsync and lock command. Once the lock (which blocks all write operations, though not reads) is in place, you can safely copy/snapshot the database file.]
Since the data is stored in memory (including virtual memory) you need replication to ensure durability. However, MongoDB argue you need replication for durability anyway (notwithstanding their plans to add single-server durability in 1.8). In a blog post (“What about Durability?”), they point out:
- even if your database theoretically supports single-server durability (such as CouchDB with its crash only design), this will only protect you if you’ve turned off hardware buffering or have a battery-backed RAID controller to ensure your writes really hit disk
- having a well-formed database file on disk doesn’t help you if the disk itself fails, and this mode of failure is about as likely as any other
- for some applications, you need 100% uptime, and so the delay required to recover from a failure (by reloading the database and replaying the transaction log, for example) is unacceptable
By default writes are not persisted immediately to the database—not even the in-memory data structure. Writes instead go into a per-server queue (with FIFO ordering?), and there’s no guarantee about when the write will be visible to other clients. (Although it does seem that clients are guaranteed to read their own writes.)
To change the default write behaviour, you: (a) issue the write as normal and then (b) issue the confusingly-named getLastError() command. getLastError() blocks until the write is either committed to the in-memory data structure, fsynced to disk, or committed to n replicas, depending on the arguments passed. (Note that many clients abstract away the write plus getLastError() call, so that the arguments to getLastError() become arguments to the write.
- To block until the write is committed, issue getLastError() with no options.
- To block until the data structure is fsynced to disk, issue getLastError() with fsync = true.
- To block until the write has been added to the write queue of n replicas, issue w = n. (Note that there is currently no way to block until the write has been fsynced on the replicas.)