On Fri, Apr 05, 2019 at 07:24:44AM -0400, Brian Foster wrote: > On Fri, Apr 05, 2019 at 09:08:45AM +1100, Dave Chinner wrote: > > On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > > > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > > > > > ... > > > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > > > ... > > > > > > > > > # hexdump <file> > > > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > > > * > > > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > > > * > > > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > > > * > > > > > > > > > 0019000 > > > > > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > > > that there are multiple vectors to the problem, some of which are sane > > > > > sequences and some apparently not, and I'd rather not get caught up in > > > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > > > see if we can find a solution to the fundamental data exposure problem. > > > > > > > > Is there a way to increase the amount of writeback? Say for example > > > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > > > us to write back block 7. If we succeed at converting the entire > > > > 10-block reservation into a real extent, why can't we come back and say > > > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > > > to need something like that for blocksize > pagesize filesystems too? > > > > > > > > > > That might be possible in various places. In fact, I think we used to do > > > something like this in XFS (perhaps not for this purpose). I just jumped > > > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > > > exists at that point down in the ->writepage() path. > > > > We /really/ don't want to go back to that. This caused us to have to > > essentially implement write_cache_pages() inside ->writepage. > > > > Agreed.. > > > If we want to expand the writeback range, (i.e. move the offset we > > start at) then we need to do that in xfs_vm_writepages(), before we > > call write_cache_pages(). This means all the tagged writepages stuff > > will continue to work, and we don't end up doing redundant page > > cache lookups because we're trying to write back the same range from > > two different places. > > > > IMO, messing around with writeback like this is a mitigation and/or a > contributer to a broader solution, not a fundamental fix by itself. For > example, I could see doing something like that if we already put proper > writeback vs. append transaction ordering in place and we identify > certain corner cases where we want to prevent delaying an on-disk i_size > update for too long because a user indirectly triggered writeback of a > high index page. > > > > I'm not sure if doing that alone is worth the effort if you consider > > > it's more of a mitigation than a solution to the underlying problem. > > > Once we convert the associated delalloc extent, we're open to stale data > > > exposure (assuming we're not protected by on-disk EOF and whatnot) until > > > all of the remaining pages over the extent are written back, whether > > > that occurs in the current writeback cycle or a subsequent one. > > > > If we were to do a read iomap lookup in xfs_vm_writepages() we could > > determine if the delalloc conversion would allocate a range outside > > the current write that is bing flushed, and either modify the > > writeback control range/offset to match the delalloc extent. > > > > however, this runs the risk of breaking all the fairness algorithms > > in high level writeback, so we'd need to be very careful about how > > we extend the write range for writeback inside EOF.... > > > > Indeed. The more I think about this, the more I think we'd have an > easier time trying to schedule the on-disk i_size update vs. trying to > control the size and range of writeback in all possible scenarios. The > former would still require non-trivial tracking and logic, but that's > something writeback control described above could actually help with. > > For example, we don't currently have any way to track "stale" state of > just allocated but not yet fully written extents. If we established a > rule that we always wrote back such extents in entirety and via a single > ioend, then writeback completion code would simply need to check for > delalloc extents within in-core eof to determine when it's safe to > update on-disk eof. If there are cases where we determine we can't > writeback an entire delalloc extent because of some fairness reasons or > whatever, we could decide to only physically allocate the subset of that > delalloc extent that we can write back rather than the whole thing > (trading off stale data exposure for some increased fragmentation). That > way writeback completion still sees existence of delalloc blocks and > knows to hold off on the on-disk size update. > > One challenge with an approach like that is if we do skip completion > time on-disk i_size updates, we'd have to make sure that those updates > still happen eventually if the address space happens to change outside > of writeback. For example, if those remaining dirty delalloc pages > somehow end up invalidated and never become real blocks, we may never > see another writeback cycle on the inode and we'd need to update on-disk > size from some other context (in a timely fashion). > > Of course this only deals with the append case. We'd still need to > select a mechanism for dealing with inside EOF allocations (such as the > whole write-intent thing or unwritten extents). To me, the intent thing > still seems like the most simple/elegant option we've discussed so far > because the potential for low impact. Part of me thinks that if we > determine we'll fall back to that approach for certain cases anyways > (like inside EOF allocs), it might be worth just doing that first and > evaluate using it unconditionally. /me wonders if we could solve this by using a bitmap to track all the in-flight post-ondisk-eof writeback such that when a particular writeback completes it can check if there aren't any writeback at a lower offset still in progress and only then update the ondisk eof? (Where "a bitmap or something" could just use another instance of Dave's range lock structure...) --D > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx