Hi, On 2018-04-19 02:12:33 +0000, Trond Myklebust wrote: > On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote: > > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > > > If the main use case is something like Postgresql, where you care > > > about > > > just one or two critical files, rather than monitoring the entire > > > filesystem could we perhaps use a dedicated mmap() mode? It should > > > be > > > possible to throw up a bitmap that displays the exact blocks or > > > pages > > > that are affected, once the file has been damaged. > > > > Perhaps we need to have a quick summary of the postgres problem ... > > they're not concerned with "one or two files", otherwise they could > > just keep those files open and the wb_err mechanism would work fine. > > The problem is that they have too many files to keep open in their > > checkpointer process, and when they come along and open the files, > > they don't see the error.. > > I thought I understood that there were at least two issues here: > > 1) Monitoring lots of files to figure out which ones may have an error. > 2) Drilling down to see what might be wrong with an individual file. > > Unless you are in a situation where you can have millions of files all > go wrong at the same time, it would seems that the former is the > operation that needs to scale. Once you're talking about large numbers > of files all getting errors, it would appear that an fsck-like recovery > would be necessary. Am I wrong? Well, the correctness issue really only centers around 1). Currently there are scenarios (some made less some more likely by the errseq_t changes) where we don't notice IO errors. The result can either be that we wrongly report back to the client that a "COMMIT;" was successful, even though it wasn't persisted, or we might throw away journal data because we think a checkpoint was successful even though it wasn't. To fix the correctness issue we really only need 1). That said, it'd obviously be nice to be able to report back a decent error pointing to individual files and a more descriptive error message than "PANIC: An IO error occurred somewhere. Perhaps look in the kernel logs?" wouldn't hurt either. To give a short overview of how PostgreSQL issues fsync and does the surrounding buffer management: 1) There's a traditional journal (WAL), addressed by LSN. Every modification needs to first be in the WAL before buffers (and thus on-disk data) can be modified. 2) There's a postgres internal buffer cache. Pages are tagged with the WAL LSN that need to be flushed to disk before the data can be written back. 3) Reads and writes between OS and buffer cache are done using buffered IO. There's valid reasons to change that, but it'll require new infrastructure. Each process has a limited size path -> fd cache. 4) Buffers are written out by: - checkpointing process during checkpoints - background writer that attempts to keep some "victim" buffers clean - backends (client connection associated) when they have to reuse dirty victim buffers whenever such a writeout happens information about the file containing that dirty buffer is forwarded to the checkpointer. The checkpointer keeps track of each file that'll need to be fsynced in a hashmap. It's worth to note that each table / index / whatnot is a separate file, and large relations are segmented into 1GB segments. So it's pretty common to have tens of thousands of files in a larger database. 5) During checkpointing, which is paced in most cases and often will be configured to take 30-60min, all buffers from before the start of the checkpoint will be written out. We'll issue SYNC_FILE_RANGE_WRITE requests occasionally to keep the amount of dirty kernel buffers under control. After that we'll fsync each of the dirty files. Once that and some other boring internal stuff succeeded, we'll issue a checkpoint record, and allow discarding WAL from before the checkpoint. Because we cannot realistically keep each of the files open between 4) and the end of 5), and because the fds used in 4) are not the same as the ones in 5) (different processes), we currently aren't guaranteed notification of writeback failures. Realistically we're not going to do much file-specific handling in case there's errors. Either a retry is going to fix the issue (ooops RN, because the error has been "eaten"), or we're doing a crash-recovery cycle from WAL (oops, because we don't necessarily know an error occurred). It's worth to note that for us syncfs() is better than nothing, but it's not perfect. It's pretty common to have temporary files (sort spool files, temporary tables, ...) on the same filesystem as the persistent database. syncfs() has the potential to flush out a lot of unnecessary dirty data. Note that it'd be very unlikely that the temp data files would be moved to DIO - it's *good* that the kernel manages the amount of dirty / cached data. It has a heck of a lot of more knowledge about how much memory pressure the system has than postgres ever will have. One reason, besides some architectural issues inside PG, we've been concerned about when considering DIO, is along similar lines. A lot of people use databases as a part of their stack without focusing on them. Which usually means the database will be largely untuned. With buffered IO that's not so bad, the kernel will dynamically adapt to some extent. With DIO the consequences of a mistuned buffer cache size or such are way worse. It's good for critical databases maintained by dedicated people, not so good outside of that. Matthew, I'm not sure what kind of summary you had in mind. Please let me know if you want more detail in any of the areas, happy to expand. Greetings, Andres Freund