On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > The broader point is that by controlling writeback ordering, we can > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > never expect stale data exposure even in the event of out of order > > > > > writeback completion, regardless of the cause. > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > is outside the range the user asks to sync and the implementation > > > doesn't go anywhere near the filesystem so we can't actually do the > > > right thing here. So we are left to either hack fixes around a > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > it. We've simply ignored it for the past 10+ years because it really > > > can't be used safely for it's intended purposes (as everyone who > > > tries to use it finds out eventually)... > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > special required of sync_file_range() to reproduce this stale data > > exposure. See the following variation of a command sequence from the > > fstest I posted the other day: > > > > ... > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > -c "fpunch 96k 4k" <file> > > ... > > # hexdump <file> > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > * > > 0011000 abab abab abab abab abab abab abab abab > > * > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > * > > 0019000 > > > > Same problem, and it can be reproduced just as easily with a reflink or > > even an overlapping direct I/O. > > How does overlapping direct IO cause stale data exposure given that > it uses unwritten extent allocation? That, of course, is ignoring > the fact that the result of overlapping concurent direct IO is, by > definition, undefined and an application bug if it occurs.... > It's overlapping a buffered write and so starts with a flush instead of an unwritten allocation (so even a dio read is sufficient). Also, the target of the dio isn't where stale data is exposed, so even if there was an existing underlying unwritten block before the initial buffered write it wouldn't change anything. Yes, this is technically usage with undefined behavior, but that still misses the forest from the trees. I'm not claiming this is good or supported practice just as I'm not not claiming sync_file_range() should be used for anything in practice. I'm just pointing out it's one of several possible, obvious data exposure vectors. I suspect there are others; these were just the easy variants to reproduce. If you prefer to reason about reproducer patterns that don't use operations with quirks or undefined behaviors, that's fine, just use the reflink or hole punch example posted above that produce the same behavior. > > I haven't confirmed, but I suspect with > > enough effort background writeback is susceptible to this problem as > > well. > > Well, the similarity here is with the NULL files problems and using > XFS_ITRUNCATE case to avoid NULL files after truncating them down. > i.e. if that flag is set, we do extra flushing where appropriate to > ensure ithe data at risk of being lost hits the disk before the loss > mechanism can be triggered. > > i.e. this is the mirror equivalent of XFS_ITRUNCATE - it is the > truncate-up flushing case in that we need to wait for the zeroing to > hit the disk before we change the file size. > > Eseentially, what this says to me is we need to track when we > extended the file via a write beyond EOF via an inode flag. Then if we need to > flush a data range from the file for integrity reasons (such as a > fpnuch) and this flag is set, we ignore the "start" parameter of the > range and flush from the start of the file instead. i.e. we > guarantee that any dirty zeroed pages between the start of the file > and the end of the range we are operating on gets written back. This > will capture all these "didn't flush regions zeroed on file > extension" problems in one go, without needing to change the way we > do allocation or EOF zeroing. > I don't think that would prevent the problem outside of the common write extension case. If the file is sparse (i.e. truncated up ahead of time) such that the file size is on disk, then any flush sequence and block allocation within EOF is susceptible to stale data exposure in the window between when the extent is allocated and writeback completes over the entire extent. This variant is reproducible even with a plain old fsync() call, it's just hard enough to reproduce to require instrumentation (as opposed to the test sequences I've posted in this thread so far). It may be possible to selectively (i.e. as an optimization) use an XFS_ITRUNCATED like algorithm as you propose here to always force in order writeback in the cases where writeback extends the on-disk isize (and thus isize protects from stale exposure), but I think we'd still need something tied to extent allocation one way or another in addition (like Darrick's original unwritten extent patch or your proposed intent/done idea in the previous reply) to address the fundamental problem. The question is what is the ideal solution.. Brian > I suspect we could just use the XFS_ITRUNCATE flag for this - set it > in the EOF zeroing code, check it in xfs_flush_unmap_range() and > other places that do range based flushing.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx