On Mon, Mar 18, 2019 at 09:40:11AM +1100, Dave Chinner wrote: > On Fri, Mar 15, 2019 at 08:29:40AM -0400, Brian Foster wrote: > > On Fri, Mar 15, 2019 at 02:40:38PM +1100, Dave Chinner wrote: > > > > If the delalloc extent crosses EOF then I think it makes sense to map it > > > > in as unwritten, issue the write, and convert the extent to real during > > > > io completion, and not split it up just to avoid having unwritten > > > > extents past EOF. > > > > > > Yup, makes sense. For a sequential write, they should always be > > > beyond EOF. For anything else, we use unwritten. > > > > > > > I'm not quite sure I follow the suggested change in behavior here tbh so > > maybe this is not an issue, but another thing to consider is whether > > selective delalloc -> real conversion for post-eof extents translates to > > conditional stale data exposure vectors in certain file extension > > scenarios. > > No, it doesn't because the transaction that moves EOF is done after > the data write into the region it exposes is complete. hence the > device cache flush that occurs before the log write containing the > transaction that moves the EOF will always result in the data being > on stable storage *before* the extending szie transaction hits the > journal and exposes it. > Ok, but this ordering alone doesn't guarantee the data we've just written is on-disk before the transaction hits the log. > > I think even the post-eof zeroing code may be susceptible to > > this problem if the log happens to push after a truncate transaction > > commits but before the associated dirty pages are written back.. > > That's what this code in xfs_setattr_size() handles: > > /* > * We are going to log the inode size change in this transaction so > * any previous writes that are beyond the on disk EOF and the new > * EOF that have not been written out need to be written here. If we > * do not write the data out, we expose ourselves to the null files > * problem. Note that this includes any block zeroing we did above; > * otherwise those blocks may not be zeroed after a crash. > */ > if (did_zeroing || > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > ip->i_d.di_size, newsize - 1); > if (error) > return error; > } > Ah, I forgot about this code. So this combined with the above provides correctness for this particular case with respect to stale data exposure after a crash. AFAICT this is still a performance tradeoff as the zeroing may not be necessary if the post-eof extents happened to be unwritten. This optimization is also achieved "for free" in this path because did_zeroing is only set if iomap had to physically zero some pages. This also only covers one extension scenario. It looks like the typical write extend scenario is intended to be handled similarly by submitting the di_size update transaction from data write completion. It looks to me that this should be fairly robust in most common cases (i.e., background writeback and/or fsync()), but I'm not totally convinced this is always robust because write time extension doesn't perform the same kind of writeback ordering as the truncate example above. Consider sync_file_range() for example. Suppose we have a large, physical post-eof prealloc extent and perform a sparse extended write to the last block of that extent followed by a sync_file_range() of the just written data. The file write should physically zero the pagecache for the preallocated region before the write, then the proper write completes and returns. Nothing has changed on disk so far. A sync_file_range() call sets the range start/end in the writeback_control and works its way through __filemap_fdatawrite_range() to write_cache_pages() and our writeback mechanism for the requested pages only. From there, the same logic applies as for background writeback: we issue writeback and I/O completion commits a file extension transaction. AFAICT, nothing has guaranteed writeback has even initiated on any of the zeroed pages over the non-unwritten extent that has just been pulled within EOF, which means we are susceptible to stale data exposure until that happens. While reasoning about this I realized this should be pretty easy to test: - pre-seed a small fs with a data pattern (0xab) - mount with a 1mb alloc size (for ease of test) - run something like the following (uses default 0xcd data pattern) # xfs_io -f -x -c "pwrite 0 4k" -c fsync -c "pwrite 32k 4k" \ -c "sync_range -w 32k 4k" -c "sync_range -b 32k 4k" \ -c "shutdown -f" /mnt/file - remount and check the file: # hexdump /mnt/file 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001000 abab abab abab abab abab abab abab abab * 0008000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0009000 ... which clearly shows stale data exposure. Hmm, this might actually be a good fstest if we don't have one already. Brian > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > > i.e. it ensures the data and post-eof zeroing is on disk before the > truncate transaction is started. We already hold the IOLOCK at this > point and drained in-flight direct IO, so no IO that can affect the > truncate region will begin or be in progress after the flush and > wait above is complete. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx