Re: Question on slow fallocate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/23/23 6:49 AM, Ritesh Harjani (IBM) wrote:
Sorry, but I still haven't understood the real problem here for which
XFS does filemap_write_and_wait_range(). Is it a stale data exposure
problem?

(Hopefully I get this right by trying to be helpful, here. It's been a while).

Not really. IIRC the original problem was that the file size could get updated (transactionally) before the delayed allocation and IO happened at writeback time, leaving a hole before EOF where buffered writes had failed to land before a crash. This is what people originally called the "NULL files problem" because reading the hole post-crash returned zeros. It wasn't stale date, it was no data.

Some commits that dealt with this explain it fairly well I think:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c32676eea19ce29cb74dba0f97b085e83f6b8915

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba87ea699ebd9dd577bf055ebc4a98200e337542

Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as
"expose ourselves to the null files problem".
What is the "expose ourselves to the null files problem here"
for which we do filemap_write_and_wait_range()?


	/*
	 * We are going to log the inode size change in this transaction so
	 * any previous writes that are beyond the on disk EOF and the new
	 * EOF that have not been written out need to be written here.  If we

i.e. force the writeback of any pending buffered IO into the hole created up to the new EOF

	 * do not write the data out, we expose ourselves to the null files
	 * problem. Note that this includes any block zeroing we did above;
	 * otherwise those blocks may not be zeroed after a crash.

and I suppose this relates a little to stale date, IIRC this is referring to zeroing partial blocks past the old EOF.

	 */
	if (did_zeroing ||
	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
						ip->i_disk_size, newsize - 1);
		if (error)
			return error;
	}


Talking about ext4, it handles truncates to a file using orphan
handline, yes. In case if the truncate operation spans multiple txns and
if the crash happens say in the middle of a txn, then the subsequent crash
recovery will truncate the blocks spanning i_disksize.

But we aren't discussing shrinking here right. We are doing pwrite
followed by fallocate to grow the file size. With pwrite we use delalloc
so the blocks only get allocated during writeback time and with
fallocate we will allocate unwritten extents, so there should be no
stale data expose problem in this case right?

yeah, it's not a stale data problem. I think that the extended EOF created by fallocate is being treated exactly the same as if we had extended it with ftruncate(). Indeed, replacing the posix_fallocate with ftruncate to the same size in the test program results in a similarly slow run, slightly faster probably because unwritten conversion doesn't have to happen in that case.

Hence my question was to mainly understand what does "expose ourselves to
the null files problem" means in XFS?

Hopefully the above explains it; that said, I'm not sure this is anything more than academically interesting. As Dave mentioned, fallocating tiny space and then writing into it is not at all the recommended or efficient use of fallocate.

The one thing I'm not remembering exactly here is why we have the heuristic that a truncate up requires flushing all pending data behind it.

I *think* it's because most users knew enough to expect buffered writes could be lost on a crash, but they expected to see valid data up to the on-disk EOF post-crash. Without this heuristic, they'd get some valid data that made it out followed by a hole ("NULLS") up to the new EOF, and they Did Not Like It.

-Eric




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux