Re: [PATCH 2/4] xfs: force writes to delalloc regions to unwritten

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 4 Apr 2019 08:22:54 -0700

On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote:
> On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote:
> > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote:
> > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote:
> > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote:
> > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote:
> > > > > > > > > > The broader point is that by controlling writeback ordering, we can
> > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should
> > > > > > > > > > never expect stale data exposure even in the event of out of order
> > > > > > > > > > writeback completion, regardless of the cause.
> > > > > > > > > 
> > > > > > > > > Agree.  Even if s_f_r is a giant trash fire, if we're exposing stale
> > > > > > > > > disk contents it's game over anyway and we ought to fix it.
> > > > > > > > 
> > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data
> > > > > > > > is outside the range the user asks to sync and the implementation
> > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the
> > > > > > > > right thing here. So we are left to either hack fixes around a
> > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore
> > > > > > > > it. We've simply ignored it for the past 10+ years because it really
> > > > > > > > can't be used safely for it's intended purposes (as everyone who
> > > > > > > > tries to use it finds out eventually)...
> > > > > > > > 
> > > > > > > 
> > > > > > > Just to make this abundantly clear, there is nothing unique, magic or
> > > > > > > special required of sync_file_range() to reproduce this stale data
> > > > > > > exposure. See the following variation of a command sequence from the
> > > > > > > fstest I posted the other day:
> > > > > > > 
> > > > > > > ...
> > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \
> > > > > > > 	-c "fpunch 96k 4k" <file>
> > > > > > > ...
> > > > > > > # hexdump <file>
> > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> > > > > > > *
> > > > > > > 0011000 abab abab abab abab abab abab abab abab
> > > > > > > *
> > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000
> > > > > > > *
> > > > > > > 0019000
> > > > > > > 
> > > > > > > Same problem, and it can be reproduced just as easily with a reflink or
> > > > > > > even an overlapping direct I/O.
> > > > > > 
> > > > > > How does overlapping direct IO cause stale data exposure given that
> > > > > > it uses unwritten extent allocation? That, of course, is ignoring
> > > > > > the fact that the result of overlapping concurent direct IO is, by
> > > > > > definition, undefined and an application bug if it occurs....
> > > > > > 
> > > > > 
> > > > > It's overlapping a buffered write and so starts with a flush instead of
> > > > 
> > > > Ok, but that's not the normal usage of the phrase "overlapping
> > > > direct IO".  Overlapping direct IO means two concurrent direct IOs
> > > > to the same LBA range in the device underlying the filesystem. i.e.
> > > > they overlap in both the temporal and physical (device LBA) domains.
> > > > 
> > > > Subtle difference, yes, but a very important one.
> > > > 
> > > 
> > > Yeah, I just referenced it in passing and was unclear. The point is just
> > > that there are multiple vectors to the problem, some of which are sane
> > > sequences and some apparently not, and I'd rather not get caught up in
> > > the minutiae of why some of those sequences are not sane. I'm trying to
> > > see if we can find a solution to the fundamental data exposure problem.
> > 
> > Is there a way to increase the amount of writeback?  Say for example
> > that we have a delalloc reservation for blocks 0-9, and writeback asks
> > us to write back block 7.  If we succeed at converting the entire
> > 10-block reservation into a real extent, why can't we come back and say
> > "Hey, we also need to flush the pages for blocks 0-6,8-9"?  Are we going
> > to need something like that for blocksize > pagesize filesystems too?
> > 
> 
> That might be possible in various places. In fact, I think we used to do
> something like this in XFS (perhaps not for this purpose). I just jumped
> back to v2.6.32 and see that the xfs_cluster_write() mechanism still
> exists at that point down in the ->writepage() path.
> 
> I'm not sure if doing that alone is worth the effort if you consider
> it's more of a mitigation than a solution to the underlying problem.
> Once we convert the associated delalloc extent, we're open to stale data
> exposure (assuming we're not protected by on-disk EOF and whatnot) until
> all of the remaining pages over the extent are written back, whether
> that occurs in the current writeback cycle or a subsequent one.
> 
> I suppose it might make sense in service to another solution, such as
> writing back everything preceding a page that is going to update on-disk
> EOF (which is something I think Dave suggested earlier as an incremental
> improvement, I've kind of lost track). We'd probably just have to make
> sure that the on-disk eof can never get ahead of the lower index
> writeback pages by I/O reordering or whatever..

...which would be easi(er) to handle if we could chain all the
file-extending writeback bios to a single ioend like you suggested in
the per-inode ioend completion thread, right?

Though such a beast would still be racy, I think, because background
writeback could start writing back a /single/ page from the middle of
the post-eof range before the integrity writeback flushes everything
else, and if the integrity writeback completes before the background
writeback does, we're still exposed.  I think?

Dunno, maybe the per-inode completion can help here, since at least if
the completions all come in at the same time we'll reorder and combine
the metadata work, though that's not a full solution either.

> > > 
> > > > > an unwritten allocation (so even a dio read is sufficient). Also, the
> > > > > target of the dio isn't where stale data is exposed, so even if there
> > > > > was an existing underlying unwritten block before the initial buffered
> > > > > write it wouldn't change anything.
> > > > > 
> > > > > Yes, this is technically usage with undefined behavior, but that still
> > > > > misses the forest from the trees.
> > > > 
> > > > No, I'm not missing the forest for the trees.
> > > > 
> > > > > I'm not claiming this is good or
> > > > > supported practice just as I'm not not claiming sync_file_range() should
> > > > > be used for anything in practice. I'm just pointing out it's one of
> > > > > several possible, obvious data exposure vectors. I suspect there are
> > > > > others; these were just the easy variants to reproduce.
> > > > 
> > > > I'm aware that we have these problems - we've known about them for
> > > > 20+ years and the many ways they can be exposed. Indeed, We've been
> > > > talking about using unwritten extents for delalloc for as long as I
> > > > can remember, but never done it because the actual occurrence of
> > > > these problems in production systems is /extremely/ rare and the
> > > > cost of using unwritten extents everywhere will affect /everyone/.
> > > > 
> > > > i.e. It's always been easy to create test conditions to expose stale
> > > > data, but reported cases of stale data exposure after a production
> > > > system crash are pretty much non-existent. This just isn't a common
> > > > problem that users are experiencing, but solving it with "unwritten
> > > > extents for everyone" comes at a substantial cost for /everyone/,
> > > > /all the time/.
> > > > 
> > > 
> > > I know you're aware. I'm pretty sure we've discussed this directly in
> > > the past (with regard to how buffer heads were an impediment to a
> > > solution) and that you made me aware of the issue in the first place. ;)
> > > 
> > > > That's why a) it hasn't been done yet; and b) any discussion about
> > > > "use unwritten extents everywhere" always ends up in trying to find
> > > > a solution that isn't just applying the "big hammer" and ignoring
> > > > the problems that causes.
> > > > 
> > > 
> > > Right..
> > > 
> > > > > If you prefer to
> > > > > reason about reproducer patterns that don't use operations with quirks
> > > > > or undefined behaviors, that's fine, just use the reflink or hole punch
> > > > > example posted above that produce the same behavior.
> > > > 
> > > > Well, yes. But it does go both ways - if you don't want people to
> > > > say "that's undefined" or "that makes no sense", then don't use
> > > > those quirky examples if you can use better, unambiguous examples.
> > > > 
> > > 
> > > The sync_file_range() reproducer was a poor example in that regard. I'm
> > 
> > (I don't want to stir this up again but just so we're all clear I had
> > figured the s_f_r call was really just a way to simulate random drive-by
> > scattershot background writeback for testing purposes.)
> > 
> 
> Ack.
> 
> > > not concerned over that being pointed out, but I've explained several
> > > times subsequently that we can replace the sync_file_range() calls with
> > > a sane command (hole punch) to reproduce the same general behavior and
> > > resulting data exposure problem and thus the quirks of the former are
> > > not a primary factor. Despite that, this discussion has somehow turned
> > > into an appraisal of sync_file_range(). Moving on..
> > > 
> > > I posted some comments/questions around your log item suggestion for
> > > extent zeroing during recovery in my previous reply and that part
> > > apparently got completely snipped out and ignored. :/ Any feedback on
> > > that?
> > 
> > A(nother) big downside of adding a new "write intent": now we need to
> > add a log incompat flag which will occasionally generate "I wrote
> > something, crashed, and this old knoppix cd won't recover the fs for
> > some reason" reports.
> > 
> 
> Yeah, though I think it's not really recommended to process an fs with a
> dirty log across kernels like that.

That may be, but it seems to work most often enough that our service
techs think they can do it 100%.  And I have the bug reports from when
that doesn't work to prove it. :(

> We already have some recovery
> restrictions that prevent this, such on-disk order for example. Of
> course that's different from just going back to an older kernel version
> on the same hardware and getting stuck that way because of a feature bit
> or something.

<nod>

> > > > For example, that's exactly how I phrased this extension flushing
> > > > mechanism trigger:
> > > > 
> > > > > > Eseentially, what this says to me is we need to track when we
> > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to
> > > > > > flush a data range from the file for integrity reasons (such as a
> > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the
> > > > 
> > > > See? Entirely generic trigger description, uses hole punch as an
> > > > example, but also covers all the quirky cases such as direct IO
> > > > range flushing without having to mention them explicitly.
> > > > 
> > > > > > range and flush from the start of the file instead. i.e. we
> > > > > > guarantee that any dirty zeroed pages between the start of the file
> > > > > > and the end of the range we are operating on gets written back. This
> > > > > > will capture all these "didn't flush regions zeroed on file
> > > > > > extension" problems in one go, without needing to change the way we
> > > > > > do allocation or EOF zeroing.
> > > > > > 
> > > > > 
> > > > > I don't think that would prevent the problem outside of the common write
> > > > > extension case.
> > > > 
> > > > I feel like we're going around in circles now, Brian. :/
> > > > 
> > > 
> > > Yes, let's try to reset here...
> > > 
> > > > I didn't propose this as the "only solution we need". I've proposed
> > > > it in the context that we use "only use unwritten extents for
> > > > allocations within EOF" to solve the common "write within EOF" case.
> > > > This, however, doesn't solve this common write extension case where
> > > > we don't want to use unwritten extents because of the performance
> > > > cost.
> > > > 
> > > > And now you're saying that "well, this extending write thing
> > > > doesn't solve the inside EOF problem"......
> > > > 
> > > 
> > > Ok, I lost that context. As I said above, I can see how this could be
> > > useful in combination with another approach to handle inside eof
> > > allocations. There are still some loose ends (in my mind) to consider
> > > with this technique, however...
> > > 
> > > This requires that we always guarantee in-order writeback, right? I can
> > > see how we can do that in the cases we control based on the flag based
> > > approach you suggested. But do we really have enough control within the
> > > fs to always prevent out of order writeback? What about non-integrity
> > > background writeback? AFAICT it's technically possible to writeback out
> > > of expected order (the cyclic range logic looks particularly suspicious
> > > to me if you consider a mapping can be truncated and repopulated many
> > > times over without resetting the cycle index). What about memory
> > > reclaim? Is it not possible to hit the ->writepage() path via reclaim
> > > for an arbitrary page in a mapping? It's not clear to me if page
> > > migration is a factor for us as well, but it also looks like it has some
> > > specific page writeback hooks.
> > > 
> > > I suppose there are technically ways around these things. We could skip
> > > writeback of certain pages in some cases (but that may have unacceptable
> > > consequences) or make the decision to use one exposure preventation
> > > approach or another dynamically at writepage time based on the state of
> > > the inode, mapping and associated page. I'd think background writeback
> > > is the common case as opposed to things like hole punch and whatnot,
> > > however, so a dynamic solution leaves the effectiveness of this approach
> > > somewhat open ended unless we characterize how all possible writeback
> > > scenarios behave.
> > > 
> > > Given that, it seems more useful to me to work through the inside eof
> > > extent allocation scenario first and work back from there. Then we can
> > > at least try to establish potential performance impact and determine if
> > > we should try to work around it in certain cases, etc.
> > 
> > <nod> I'll get back to working on two parts of fixing this:
> > 
> > a) adapting the "convert delalloc to unwritten" patch that started this
> > thread so that it only does that for conversions that start within eof,
> > 
> > b) fixing your punch case by extending the range that xfs_flush_unmap_range
> > asks to flush if the caller is trying to flush a range past the ondisk
> > eof.
> > 
> > Though I think (b) is unnecessary if we had the ability to flush all the
> > pages fronting a delalloc reservation that we just converted to a real
> > extent, as I muttered above.
> > 
> 
> I think it's more that we have to order all such page writebacks against
> the on-disk size update so the latter protects us from exposure in the
> event of a crash, but otherwise this sounds like a reasonable next step
> to me..

<nod> Figuring out ordered completion is going to take some thought.

--D

> Brian
> 
> > > On that note,
> > > please see the part that was snipped out of my earlier reply[1], in
> > > particular the part that ended with "Am I following the high level idea
> > > correctly?" :P Thanks.
> > 
> > I thought your take at least matched my interpretation of Dave's
> > comments.
> > 
> > Thoughts?
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2
> > > 
> > > > > It may be possible to selectively (i.e. as an optimization) use an
> > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in
> > > > > order writeback in the cases where writeback extends the on-disk isize
> > > > > (and thus isize protects from stale exposure), but I think we'd still
> > > > > need something tied to extent allocation one way or another in addition
> > > > > (like Darrick's original unwritten extent patch or your proposed
> > > > > intent/done idea in the previous reply) to address the fundamental
> > > > > problem. The question is what is the ideal solution..
> > > > 
> > > > The solution will be a combination of things that work together, not
> > > > one big hammer. The more we can avoid exposure vectors by
> > > > operational ordering rather than transactional state changes, the
> > > > better the solution will perform.
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@xxxxxxxxxxxxx