On Thu, Apr 04, 2019 at 08:50:54AM -0400, Brian Foster wrote: > On Wed, Apr 03, 2019 at 03:38:09PM -0700, Darrick J. Wong wrote: > > On Fri, Mar 22, 2019 at 08:46:15AM -0400, Brian Foster wrote: > > > On Fri, Mar 22, 2019 at 12:52:40PM +1100, Dave Chinner wrote: > > > > On Thu, Mar 21, 2019 at 08:27:40AM -0400, Brian Foster wrote: > > > > > On Thu, Mar 21, 2019 at 08:10:19AM +1100, Dave Chinner wrote: > > > > > > On Wed, Mar 20, 2019 at 08:02:05AM -0400, Brian Foster wrote: > > > > > > > On Wed, Mar 20, 2019 at 07:25:37AM +1100, Dave Chinner wrote: > > > > > > > > > > The broader point is that by controlling writeback ordering, we can > > > > > > > > > > explicitly demonstrate stale data exposure. If fixed properly, we should > > > > > > > > > > never expect stale data exposure even in the event of out of order > > > > > > > > > > writeback completion, regardless of the cause. > > > > > > > > > > > > > > > > > > Agree. Even if s_f_r is a giant trash fire, if we're exposing stale > > > > > > > > > disk contents it's game over anyway and we ought to fix it. > > > > > > > > > > > > > > > > Well, that's exactly the problem with SFR, isn't it? The dirty data > > > > > > > > is outside the range the user asks to sync and the implementation > > > > > > > > doesn't go anywhere near the filesystem so we can't actually do the > > > > > > > > right thing here. So we are left to either hack fixes around a > > > > > > > > shitty, broken interface and everyone pays the penalty, or we ignore > > > > > > > > it. We've simply ignored it for the past 10+ years because it really > > > > > > > > can't be used safely for it's intended purposes (as everyone who > > > > > > > > tries to use it finds out eventually)... > > > > > > > > > > > > > > > > > > > > > > Just to make this abundantly clear, there is nothing unique, magic or > > > > > > > special required of sync_file_range() to reproduce this stale data > > > > > > > exposure. See the following variation of a command sequence from the > > > > > > > fstest I posted the other day: > > > > > > > > > > > > > > ... > > > > > > > # xfs_io -fc "pwrite 0 68k" -c fsync -c "pwrite 96k 4k" \ > > > > > > > -c "fpunch 96k 4k" <file> > > > > > > > ... > > > > > > > # hexdump <file> > > > > > > > 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > > > > > > * > > > > > > > 0011000 abab abab abab abab abab abab abab abab > > > > > > > * > > > > > > > 0018000 0000 0000 0000 0000 0000 0000 0000 0000 > > > > > > > * > > > > > > > 0019000 > > > > > > > > > > > > > > Same problem, and it can be reproduced just as easily with a reflink or > > > > > > > even an overlapping direct I/O. > > > > > > > > > > > > How does overlapping direct IO cause stale data exposure given that > > > > > > it uses unwritten extent allocation? That, of course, is ignoring > > > > > > the fact that the result of overlapping concurent direct IO is, by > > > > > > definition, undefined and an application bug if it occurs.... > > > > > > > > > > > > > > > > It's overlapping a buffered write and so starts with a flush instead of > > > > > > > > Ok, but that's not the normal usage of the phrase "overlapping > > > > direct IO". Overlapping direct IO means two concurrent direct IOs > > > > to the same LBA range in the device underlying the filesystem. i.e. > > > > they overlap in both the temporal and physical (device LBA) domains. > > > > > > > > Subtle difference, yes, but a very important one. > > > > > > > > > > Yeah, I just referenced it in passing and was unclear. The point is just > > > that there are multiple vectors to the problem, some of which are sane > > > sequences and some apparently not, and I'd rather not get caught up in > > > the minutiae of why some of those sequences are not sane. I'm trying to > > > see if we can find a solution to the fundamental data exposure problem. > > > > Is there a way to increase the amount of writeback? Say for example > > that we have a delalloc reservation for blocks 0-9, and writeback asks > > us to write back block 7. If we succeed at converting the entire > > 10-block reservation into a real extent, why can't we come back and say > > "Hey, we also need to flush the pages for blocks 0-6,8-9"? Are we going > > to need something like that for blocksize > pagesize filesystems too? > > > > That might be possible in various places. In fact, I think we used to do > something like this in XFS (perhaps not for this purpose). I just jumped > back to v2.6.32 and see that the xfs_cluster_write() mechanism still > exists at that point down in the ->writepage() path. > > I'm not sure if doing that alone is worth the effort if you consider > it's more of a mitigation than a solution to the underlying problem. > Once we convert the associated delalloc extent, we're open to stale data > exposure (assuming we're not protected by on-disk EOF and whatnot) until > all of the remaining pages over the extent are written back, whether > that occurs in the current writeback cycle or a subsequent one. > > I suppose it might make sense in service to another solution, such as > writing back everything preceding a page that is going to update on-disk > EOF (which is something I think Dave suggested earlier as an incremental > improvement, I've kind of lost track). We'd probably just have to make > sure that the on-disk eof can never get ahead of the lower index > writeback pages by I/O reordering or whatever.. ...which would be easi(er) to handle if we could chain all the file-extending writeback bios to a single ioend like you suggested in the per-inode ioend completion thread, right? Though such a beast would still be racy, I think, because background writeback could start writing back a /single/ page from the middle of the post-eof range before the integrity writeback flushes everything else, and if the integrity writeback completes before the background writeback does, we're still exposed. I think? Dunno, maybe the per-inode completion can help here, since at least if the completions all come in at the same time we'll reorder and combine the metadata work, though that's not a full solution either. > > > > > > > > an unwritten allocation (so even a dio read is sufficient). Also, the > > > > > target of the dio isn't where stale data is exposed, so even if there > > > > > was an existing underlying unwritten block before the initial buffered > > > > > write it wouldn't change anything. > > > > > > > > > > Yes, this is technically usage with undefined behavior, but that still > > > > > misses the forest from the trees. > > > > > > > > No, I'm not missing the forest for the trees. > > > > > > > > > I'm not claiming this is good or > > > > > supported practice just as I'm not not claiming sync_file_range() should > > > > > be used for anything in practice. I'm just pointing out it's one of > > > > > several possible, obvious data exposure vectors. I suspect there are > > > > > others; these were just the easy variants to reproduce. > > > > > > > > I'm aware that we have these problems - we've known about them for > > > > 20+ years and the many ways they can be exposed. Indeed, We've been > > > > talking about using unwritten extents for delalloc for as long as I > > > > can remember, but never done it because the actual occurrence of > > > > these problems in production systems is /extremely/ rare and the > > > > cost of using unwritten extents everywhere will affect /everyone/. > > > > > > > > i.e. It's always been easy to create test conditions to expose stale > > > > data, but reported cases of stale data exposure after a production > > > > system crash are pretty much non-existent. This just isn't a common > > > > problem that users are experiencing, but solving it with "unwritten > > > > extents for everyone" comes at a substantial cost for /everyone/, > > > > /all the time/. > > > > > > > > > > I know you're aware. I'm pretty sure we've discussed this directly in > > > the past (with regard to how buffer heads were an impediment to a > > > solution) and that you made me aware of the issue in the first place. ;) > > > > > > > That's why a) it hasn't been done yet; and b) any discussion about > > > > "use unwritten extents everywhere" always ends up in trying to find > > > > a solution that isn't just applying the "big hammer" and ignoring > > > > the problems that causes. > > > > > > > > > > Right.. > > > > > > > > If you prefer to > > > > > reason about reproducer patterns that don't use operations with quirks > > > > > or undefined behaviors, that's fine, just use the reflink or hole punch > > > > > example posted above that produce the same behavior. > > > > > > > > Well, yes. But it does go both ways - if you don't want people to > > > > say "that's undefined" or "that makes no sense", then don't use > > > > those quirky examples if you can use better, unambiguous examples. > > > > > > > > > > The sync_file_range() reproducer was a poor example in that regard. I'm > > > > (I don't want to stir this up again but just so we're all clear I had > > figured the s_f_r call was really just a way to simulate random drive-by > > scattershot background writeback for testing purposes.) > > > > Ack. > > > > not concerned over that being pointed out, but I've explained several > > > times subsequently that we can replace the sync_file_range() calls with > > > a sane command (hole punch) to reproduce the same general behavior and > > > resulting data exposure problem and thus the quirks of the former are > > > not a primary factor. Despite that, this discussion has somehow turned > > > into an appraisal of sync_file_range(). Moving on.. > > > > > > I posted some comments/questions around your log item suggestion for > > > extent zeroing during recovery in my previous reply and that part > > > apparently got completely snipped out and ignored. :/ Any feedback on > > > that? > > > > A(nother) big downside of adding a new "write intent": now we need to > > add a log incompat flag which will occasionally generate "I wrote > > something, crashed, and this old knoppix cd won't recover the fs for > > some reason" reports. > > > > Yeah, though I think it's not really recommended to process an fs with a > dirty log across kernels like that. That may be, but it seems to work most often enough that our service techs think they can do it 100%. And I have the bug reports from when that doesn't work to prove it. :( > We already have some recovery > restrictions that prevent this, such on-disk order for example. Of > course that's different from just going back to an older kernel version > on the same hardware and getting stuck that way because of a feature bit > or something. <nod> > > > > For example, that's exactly how I phrased this extension flushing > > > > mechanism trigger: > > > > > > > > > > Eseentially, what this says to me is we need to track when we > > > > > > extended the file via a write beyond EOF via an inode flag. Then if we need to > > > > > > flush a data range from the file for integrity reasons (such as a > > > > > > fpnuch) and this flag is set, we ignore the "start" parameter of the > > > > > > > > See? Entirely generic trigger description, uses hole punch as an > > > > example, but also covers all the quirky cases such as direct IO > > > > range flushing without having to mention them explicitly. > > > > > > > > > > range and flush from the start of the file instead. i.e. we > > > > > > guarantee that any dirty zeroed pages between the start of the file > > > > > > and the end of the range we are operating on gets written back. This > > > > > > will capture all these "didn't flush regions zeroed on file > > > > > > extension" problems in one go, without needing to change the way we > > > > > > do allocation or EOF zeroing. > > > > > > > > > > > > > > > > I don't think that would prevent the problem outside of the common write > > > > > extension case. > > > > > > > > I feel like we're going around in circles now, Brian. :/ > > > > > > > > > > Yes, let's try to reset here... > > > > > > > I didn't propose this as the "only solution we need". I've proposed > > > > it in the context that we use "only use unwritten extents for > > > > allocations within EOF" to solve the common "write within EOF" case. > > > > This, however, doesn't solve this common write extension case where > > > > we don't want to use unwritten extents because of the performance > > > > cost. > > > > > > > > And now you're saying that "well, this extending write thing > > > > doesn't solve the inside EOF problem"...... > > > > > > > > > > Ok, I lost that context. As I said above, I can see how this could be > > > useful in combination with another approach to handle inside eof > > > allocations. There are still some loose ends (in my mind) to consider > > > with this technique, however... > > > > > > This requires that we always guarantee in-order writeback, right? I can > > > see how we can do that in the cases we control based on the flag based > > > approach you suggested. But do we really have enough control within the > > > fs to always prevent out of order writeback? What about non-integrity > > > background writeback? AFAICT it's technically possible to writeback out > > > of expected order (the cyclic range logic looks particularly suspicious > > > to me if you consider a mapping can be truncated and repopulated many > > > times over without resetting the cycle index). What about memory > > > reclaim? Is it not possible to hit the ->writepage() path via reclaim > > > for an arbitrary page in a mapping? It's not clear to me if page > > > migration is a factor for us as well, but it also looks like it has some > > > specific page writeback hooks. > > > > > > I suppose there are technically ways around these things. We could skip > > > writeback of certain pages in some cases (but that may have unacceptable > > > consequences) or make the decision to use one exposure preventation > > > approach or another dynamically at writepage time based on the state of > > > the inode, mapping and associated page. I'd think background writeback > > > is the common case as opposed to things like hole punch and whatnot, > > > however, so a dynamic solution leaves the effectiveness of this approach > > > somewhat open ended unless we characterize how all possible writeback > > > scenarios behave. > > > > > > Given that, it seems more useful to me to work through the inside eof > > > extent allocation scenario first and work back from there. Then we can > > > at least try to establish potential performance impact and determine if > > > we should try to work around it in certain cases, etc. > > > > <nod> I'll get back to working on two parts of fixing this: > > > > a) adapting the "convert delalloc to unwritten" patch that started this > > thread so that it only does that for conversions that start within eof, > > > > b) fixing your punch case by extending the range that xfs_flush_unmap_range > > asks to flush if the caller is trying to flush a range past the ondisk > > eof. > > > > Though I think (b) is unnecessary if we had the ability to flush all the > > pages fronting a delalloc reservation that we just converted to a real > > extent, as I muttered above. > > > > I think it's more that we have to order all such page writebacks against > the on-disk size update so the latter protects us from exposure in the > event of a crash, but otherwise this sounds like a reasonable next step > to me.. <nod> Figuring out ordered completion is going to take some thought. --D > Brian > > > > On that note, > > > please see the part that was snipped out of my earlier reply[1], in > > > particular the part that ended with "Am I following the high level idea > > > correctly?" :P Thanks. > > > > I thought your take at least matched my interpretation of Dave's > > comments. > > > > Thoughts? > > > > --D > > > > > > > > Brian > > > > > > [1] https://marc.info/?l=linux-xfs&m=155308332901975&w=2 > > > > > > > > It may be possible to selectively (i.e. as an optimization) use an > > > > > XFS_ITRUNCATED like algorithm as you propose here to always force in > > > > > order writeback in the cases where writeback extends the on-disk isize > > > > > (and thus isize protects from stale exposure), but I think we'd still > > > > > need something tied to extent allocation one way or another in addition > > > > > (like Darrick's original unwritten extent patch or your proposed > > > > > intent/done idea in the previous reply) to address the fundamental > > > > > problem. The question is what is the ideal solution.. > > > > > > > > The solution will be a combination of things that work together, not > > > > one big hammer. The more we can avoid exposure vectors by > > > > operational ordering rather than transactional state changes, the > > > > better the solution will perform. > > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@xxxxxxxxxxxxx