The right answer is that they shouldn't. A database has much more information than the operating system about what, how and when to cache information. Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.
> Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.
That's not wrong, but at the same time it adds complexity and requires effort which can't be spent elsewhere unless you've got someone who really only wants to DIO and wouldn't work on anything else anyway.
Postgres has never used DIO, and while there have been rumbling about moving to DIO (especially following the fsync mess) as Andres Freund noted:
> efficient DIO usage is a metric ton of work, and you need a
large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.
PostgreSQL has two main challenges with direct I/O. The basic one is that it adversely impacts portability, as mentioned, and is complicated in implementation because file system behavior under direct I/O is not always consistent.
The bigger challenge is that PostgreSQL is not architected like a database engine designed to use direct I/O effectively. Adding even the most rudimentary support will be a massive code change and implementation effort, and the end result won't be comparable to what you would expect from a modern database kernel designed to use direct I/O. This raises questions about return on investment.
I have found that planning for DIO from the start makes for a better, simpler design when designing storage systems, because it keeps the focus on logical/physical sector alignment, latent sector error handling, and caching from the beginning. And even better to design data layouts to work with block devices.
Retrofitting DIO onto a non-DIO design and doing this cross-platform is going to be more work, but I don't think that's the fault of DIO (when you're already building a database that is).
Although if it's a new project and you're used to C, I would recommend also taking a good look at Zig (https://ziglang.org/), because it's so explicit about alignment compared to C, and makes alignment a first-class part of the type system, see this other comment of mine that goes into more detail: https://news.ycombinator.com/item?id=25801542
Something that will also help, is setting your minimum IO unit to 4096 bytes, the Advanced Format sector size, because then your Direct IO system will just work, regardless of whether sysadmins swap disks of different sector sizes from underneath you. For example, a minimum sector size of 4096 bytes will work not only for newer AF disks but also for any 512 byte sector disks.
Lastly, Direct IO is actually more a property of the file system, not necessarily the OS (e.g. Linux), so you will find some file systems on Linux that return EINVAL when you try to open a file descriptor with O_DIRECT, simply because they don't support O_DIRECT (e.g. a macOS volume accessed from within a Linux VM) so that should be your way of testing for support, not only the OS.
Yes, and it's not only about performance, but also safety because O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk): https://www.usenix.org/system/files/atc20-rebello.pdf
From a safety perspective, O_DIRECT is now table stakes. There's simply no control over the granularity of read/write EIO errors when your syscalls only touch memory and where you have no visibility into background flush errors.
Around four years ago I was working on a transactional data store and ran into these issues that virtually no one tells you how durable I/O is supposed to work. There were very few articles on the internet that went beyond some of the basic stuff (e.g. create file => fsync directory) and perhaps one article explaining what needs to be considered when using sync_file_range. Docs and POSIX were useless. I noticed that there seemed to be inherent problems with I/O error handling when using the page cache, i.e. whenever something that wasn't the app itself caused write I/O you really didn't know any more if all the data got there.
Some two years later fsyncgate happened and since then I/O error handling on Linux has finally gotten at least some attention and people seemed to have woken up to the fact that this is a genuinely hard thing to do.
What was the data store you were working on? Is it open source?
My experience was the same as you.
What helped me was discovering all the fantastic storage and file system papers coming out of the University of Wisconsin Madison, supervised by Remzi and Andrea Arpaci-Dusseau.
Their teams have studied and documented almost all aspects of what is required to write reliable storage systems, even diving into interactions between local storage failures and global consensus protocols, how a single disk block failure can destroy Raft and Zookeeper. Most safety testing of these systems tends to focus on the network fault model. I think in a few years time we'll all look back and see how today we had almost no concept of a storage fault model. It's kind of exciting to think that there's going to be a new breed of replicated databases that are far more reliable than today's systems. On the another hand, perhaps the future is already here, just not very evenly distributed.
I was trying to address this aspect of the parent comment:
> O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk)
O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.
(By the way, O_SYNC/O_DSYNC are equivalent to calling fsync/fdatasync after each write, therefore subject to some of the same issues.)
But even in normal situations with fsync working fine, it is not clear if you can rely on fsync to do a drive write-cache flush when there isn't any metadata or page cache data for the file because you've only been using O_DIRECT.
Neither open(2) or fsync(2) man pages address this durability issue. You can't use O_DSYNC or O_SYNC for good throughout with O_DIRECT because your database does not want the overhead of a write-cache flush on every write. You only want it for barriers. And you can't rely on fdatasync because there's no data to flush in the page cache, no block I/O to do, so fdatasync could meet expectations by doing nothing.
My solution in the past has been to toggle the LSB in st_mtime before async just to force a drive write-cache flush when I'm not sure that anything else will force one. It's not pretty.
> O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.
I was specifically referring not to an fsync in your sense (where the disk or fs does not respect fsync at all so that fsync is a no-op, or where the fs has a bug with O_DIRECT not flushing if it sees nothing dirty in the page cache - by the way I think this is no longer an issue, otherwise it's a kernel bug you can report)
...but to handling an fsync error in the context of the paper from WISC that I linked to in that parent comment, where the kernel's page cache has gone out of sync with the disk after an fsync EIO error ("Fsyncgate"):
The details are all in the paper. Sure, some disks may not respect fsync, but O_DIRECT is still the only way to safely read and recover from the journal when the kernel's page cache is out of sync with the disk (again, details in the paper). It's another fantastic paper out of WISC.
> A database has much more information than the operating system about what, how and when to cache information
Yes, on a dedicated server. However many DB engines instances run on non-dedicated servers, for example along a web server flanked with various processes sometimes reading the local filesystem or using RAM (Varnish, memcached...), and often-run tasks (tempfiles purge, log aggregation, monitoring probes, MTA...). In such a case letting the DB engine use too much RAM, and therefore reducing its global efficiency while limiting buffercache size, may (all other things being equal) imply more 'read' operations, reducing overall performance.
Great point. Selecting the RDBMS page cache size is a key performance parameter that is near impossible to get right on a mixed-use host, both non-dedicated servers and client desktop/laptop. SQL Anywhere, which emphasizes zero-admin, has long supported Dynamic Cache Sizing [1] specifically for this mixed-use case which is/was its bread-and-butter. I don't know if any other RDBMSes do the same (MS SQL?).
As a side note, Apache Arrow's main use case is similar, a column oriented data store shared by one-or-more client processes (Python, R, Julia, Matlab, etc.) on the same general purpose host. This is also now a key distinction between the Apple M1 and its big.LITTLE ARM SoC vs. Amazon Graviton built for server-side virtualized/containerized instances. We should not conflate the two use-cases and understand that the best solution for one use case may not be the best for the other.
Enabling some coopetition among caches by periodically adjusting their sizes, giving more RAM to the most efficient one (higher hit/miss ratio) while shrinking the least efficient one, may let the whole system 'walk' toward the soft point where all caches stay nearby their peak efficiency, or at least alleviate avoidable thrashing.
A first model may lay on memory ballooning and a PID loop. I could not find any pertinent software.
> We should not conflate the two use-cases
Exactly the point! Most hints/principles aren't generic but are stated with insufficient description of their pertinent contexts, especially on online boards and chats channels. It impacts most 'best practices' and code snippets.
O_DIRECT prevents file double buffering by the OS and DBMS page cache. MMAP removes the need for the DBMS page cache and relies on the OS’s paging algorithm. The gain is zero memory copy and the ability for multiple processes to access the same data efficiently.
Apache Arrow takes advantage of mmap to share data across different language processes and enables fast startup for short lived processes that re-access the same OS cached data.
Yes, but the claim is that the buffer you should remove is the OS's one, not the DBMS's one, because for the DBMS use case (one very large file with deep internal structure, generally accessed by one long-running process), the DBMS has information the OS doesn't.
Arrow is a different use case, for which mmap makes sense. For something like a short-lived process that stores config or caches in SQLite, it probably is actually closer to Arrow than to (e.g.) Postgres, so mmap likely also makes sense for that. (Conversely, if you're not relying on Arrow's sharing properties and you have a big Python notebook that's doing some math on an extremely large data file on disk in a single process, you might actually get better results from O_DIRECT than mmap.)
In particular, "zero memory copy" only applies if you are accessing the same data from multiple processes (either at once or sequentially). If you have a single long-running database server, you have to copy the data from disk to RAM anyway. O_DIRECT means there's one copy, from disk to a userspace buffer; mmap means there's one copy, from disk to a kernel buffer. If you can arrange for a long-lived userspace buffer, there's no performance advantage to using the kernel buffer.
> but the claim is that the buffer you should remove is the OS's one
I was not trying to minimize O_DIRECT, I was trying to emphasize the key advantage succinctly and also explain the Apache Arrow use case of mmap which the article does not discuss.
In theory that's true. In practice, utilizing the highly-optimized already-in-kernel-mode page cache can produce tremendous performance. LMDB, for example, is screaming fast, and doesn't use DIO.
What is your metric for "screaming fast"? A user-mode cache with direct I/O can outperform any kernel-mode design several-fold. That's empirical fact, and the reason all high-performance database engines do it. I've designed systems both ways and it isn't particularly close; the technical reasons why are well-understood. Typical direct I/O designs enable macro-optimizations that are either not practical or not possible with mmap().
There was a patch set (introducing the RWF_UNCACHED flag) to get buffered IO with most of the benefits of O_DIRECT and without its drawbacks, but it looks like it hasn't landed.
There also are new options to give the kernel better page cache hints via the new MADV_COLD or MADV_PAGEOUT flags. These ones did land.
The article at https://www.scylladb.com/2017/10/05/io-access-methods-scylla... is a bit old (2017) but it explains the trade-offs