[-stable] On Tue, Sep 5, 2017 at 5:40 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote: > On Sat, Sep 02, 2017 at 06:47:03PM +0300, Amir Goldstein wrote: >> On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote: >> > On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote: >> >> When calling into _xfs_log_force{,_lsn}() with a pointer >> >> to log_flushed variable, log_flushed will be set to 1 if: >> >> 1. xlog_sync() is called to flush the active log buffer >> >> AND/OR >> >> 2. xlog_wait() is called to wait on a syncing log buffers >> >> >> >> xfs_file_fsync() checks the value of log_flushed after >> >> _xfs_log_force_lsn() call to optimize away an explicit >> >> PREFLUSH request to the data block device after writing >> >> out all the file's pages to disk. >> >> >> >> This optimization is incorrect in the following sequence of events: >> >> >> >> Task A Task B >> >> ------------------------------------------------------- >> >> xfs_file_fsync() >> >> _xfs_log_force_lsn() >> >> xlog_sync() >> >> [submit PREFLUSH] >> >> xfs_file_fsync() >> >> file_write_and_wait_range() >> >> [submit WRITE X] >> >> [endio WRITE X] >> >> _xfs_log_force_lsn() >> >> xlog_wait() >> >> [endio PREFLUSH] >> >> >> >> The write X is not guarantied to be on persistent storage >> >> when PREFLUSH request in completed, because write A was submitted >> >> after the PREFLUSH request, but xfs_file_fsync() of task A will >> >> be notified of log_flushed=1 and will skip explicit flush. >> >> >> >> If the system crashes after fsync of task A, write X may not be >> >> present on disk after reboot. >> >> >> >> This bug was discovered and demonstrated using Josef Bacik's >> >> dm-log-writes target, which can be used to record block io operations >> >> and then replay a subset of these operations onto the target device. >> >> The test goes something like this: >> >> - Use fsx to execute ops of a file and record ops on log device >> >> - Every now and then fsync the file, store md5 of file and mark >> > >> >> md5 of file to stored value >> >> >> >> Cc: Christoph Hellwig <hch@xxxxxx> >> >> Cc: Josef Bacik <jbacik@xxxxxx> >> >> Cc: <stable@xxxxxxxxxxxxxxx> >> >> Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> >> >> --- >> >> >> >> Christoph, >> >> >> >> Here is another, more subtle, version of the fix. >> >> It keeps a lot more use cases optimized (e.g. many threads doing fsync >> >> and setting WANT_SYNC may still be optimized). >> >> It also addresses your concern that xlog_state_release_iclog() >> >> may not actually start the buffer sync. >> >> >> >> I tried to keep the logic as simple as possible: >> >> If we see a buffer who is not yet SYNCING and we later >> >> see that l_last_sync_lsn is >= to the lsn of that buffer, >> >> we can safely say log_flushed. >> >> >> >> I only tested this patch through a few crash test runs, not even full >> >> xfstests cycle, so I am not sure it is correct, just posting to get >> >> your feedback. >> >> >> >> Is it worth something over the simpler v1 patch? >> >> I can't really say. >> >> >> > >> > This looks like it has a couple potential problems on a very quick look >> > (so I could definitely be mistaken). E.g., the lsn could be zero at the >> > time we set log_flushed in _xfs_log_force(). It also looks like waiting >> > on an iclog that is waiting to run callbacks due to out of order >> > completion could be interpreted as a log flush having occurred, but I >> > haven't stared at this long enough to say whether that is actually >> > possible. >> > >> > Stepping back from the details.. this seems like it could be done >> > correctly in general. IIUC what you want to know is whether any iclog >> > went from a pre-SYNCING state to a post-SYNCING state during the log >> > force, right? The drawbacks to this are that the log states do not by >> > design tell us whether devices have been flushed (landmine alert). >> > AFAICS, the last tail lsn isn't necessarily updated on every I/O >> > completion either. >> > >> > I'm really confused by the preoccupation with finding a way to keep this >> > fix localized to xfs_log_force(), as if that provides some inherent >> > advantage over fundamentally more simple code. If anything, the fact >> > that this has been broken for so long suggests that is not the case. >> > >> >> Brian, >> >> Don't let my motives confuse you, the localized approach has two reasons: >> 1. I though there may be a low hanging fix, because of already existing >> lsn counters >> 2. I lack the experience and time to make the 'correct' fix you suggested >> > > Fair enough, but note that the "correct" fix was your idea (based on > hch's patch). :) I just suggested refactoring it out of the logging code > because there's no reason it needs to be buried there. > You could also say that flush sequence counting code doesn't belong to xfs code at all. There is nothing xfs specific about it. If we had an API: flush_seq = blkdev_get_flush_seq(bdev, flush_seq); blkdev_issue_flush_after(bdev, flush_seq); I am sure it would have been useful to more fs. In fact, block drivers that use blk_insert_flush(), already have serialized flush requests, so implementing the above functionality would be quite trivial for those. I am not fluent enough in block layer to say if this makes sense. Christoph? Should we add some block people to this discussion? Amir.