Re: [PATCH v2] xfs: fix incorrect log_flushed on fsync

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 2 Sep 2017 18:47:03 +0300

On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
> On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
>> When calling into _xfs_log_force{,_lsn}() with a pointer
>> to log_flushed variable, log_flushed will be set to 1 if:
>> 1. xlog_sync() is called to flush the active log buffer
>> AND/OR
>> 2. xlog_wait() is called to wait on a syncing log buffers
>>
>> xfs_file_fsync() checks the value of log_flushed after
>> _xfs_log_force_lsn() call to optimize away an explicit
>> PREFLUSH request to the data block device after writing
>> out all the file's pages to disk.
>>
>> This optimization is incorrect in the following sequence of events:
>>
>>  Task A                    Task B
>>  -------------------------------------------------------
>>  xfs_file_fsync()
>>    _xfs_log_force_lsn()
>>      xlog_sync()
>>         [submit PREFLUSH]
>>                            xfs_file_fsync()
>>                              file_write_and_wait_range()
>>                                [submit WRITE X]
>>                                [endio  WRITE X]
>>                              _xfs_log_force_lsn()
>>                                xlog_wait()
>>         [endio  PREFLUSH]
>>
>> The write X is not guarantied to be on persistent storage
>> when PREFLUSH request in completed, because write A was submitted
>> after the PREFLUSH request, but xfs_file_fsync() of task A will
>> be notified of log_flushed=1 and will skip explicit flush.
>>
>> If the system crashes after fsync of task A, write X may not be
>> present on disk after reboot.
>>
>> This bug was discovered and demonstrated using Josef Bacik's
>> dm-log-writes target, which can be used to record block io operations
>> and then replay a subset of these operations onto the target device.
>> The test goes something like this:
>> - Use fsx to execute ops of a file and record ops on log device
>> - Every now and then fsync the file, store md5 of file and mark
>
>>   md5 of file to stored value
>>
>> Cc: Christoph Hellwig <hch@xxxxxx>
>> Cc: Josef Bacik <jbacik@xxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
>> ---
>>
>> Christoph,
>>
>> Here is another, more subtle, version of the fix.
>> It keeps a lot more use cases optimized (e.g. many threads doing fsync
>> and setting WANT_SYNC may still be optimized).
>> It also addresses your concern that xlog_state_release_iclog()
>> may not actually start the buffer sync.
>>
>> I tried to keep the logic as simple as possible:
>> If we see a buffer who is not yet SYNCING and we later
>> see that l_last_sync_lsn is >= to the lsn of that buffer,
>> we can safely say log_flushed.
>>
>> I only tested this patch through a few crash test runs, not even full
>> xfstests cycle, so I am not sure it is correct, just posting to get
>> your feedback.
>>
>> Is it worth something over the simpler v1 patch?
>> I can't really say.
>>
>
> This looks like it has a couple potential problems on a very quick look
> (so I could definitely be mistaken). E.g., the lsn could be zero at the
> time we set log_flushed in _xfs_log_force(). It also looks like waiting
> on an iclog that is waiting to run callbacks due to out of order
> completion could be interpreted as a log flush having occurred, but I
> haven't stared at this long enough to say whether that is actually
> possible.
>
> Stepping back from the details.. this seems like it could be done
> correctly in general. IIUC what you want to know is whether any iclog
> went from a pre-SYNCING state to a post-SYNCING state during the log
> force, right? The drawbacks to this are that the log states do not by
> design tell us whether devices have been flushed (landmine alert).
> AFAICS, the last tail lsn isn't necessarily updated on every I/O
> completion either.
>
> I'm really confused by the preoccupation with finding a way to keep this
> fix localized to xfs_log_force(), as if that provides some inherent
> advantage over fundamentally more simple code. If anything, the fact
> that this has been broken for so long suggests that is not the case.
>

Brian,

Don't let my motives confuse you, the localized approach has two reasons:
1. I though there may be a low hanging fix, because of already existing
    lsn counters
2. I lack the experience and time to make the 'correct' fix you suggested

> I'll reiterate my previous comment.. if we want to track device flush
> submits+completes, please just track them directly in the buftarg using
> the existing buffer flush flag and the common buffer
> submission/completion paths that we already use for this kind of generic
> mechanism (e.g., in-flight async I/O accounting, read/write verifiers).
> I don't really see any benefit to this, at least until/unless we find
> some reason to rule out the other approach.
>

If I wasn't clear, my patch was not meant to object to your comment,
just to display the alternative. If someone else posts a 'proper' patch
I will test it with crash simulator. I recon that's not going to be
for rc1 anyway.

Cheers,
Amir.