On Wed, Nov 21, 2018 at 07:27:36PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > When we read the EOF page of the file via readpages, we need > to zero the region beyond EOF that we either do not read or > should not contain data so that mmap does not expose stale data to > user applications. > > However, iomap_adjust_read_range() fails to detect EOF correctly, > and so fsx on 1k block size filesystems fails very quickly with > mapreads exposing data beyond EOF. There are two problems here. > > Firstly, when calculating the end block of the EOF byte, we have > to round the size by one to avoid a block aligned EOF from reporting > a block too large. i.e. a size of 1024 bytes is 1 block, which in > index terms is block 0. Therefore we have to calculate the end block > from (isize - 1), not isize. > > The second bug is determining if the current page spans EOF, and so > whether we need split it into two half, one for the IO, and the > other for zeroing. Unfortunately, the code that checks whether > we should split the block doesn't actually check if we span EOF, it > just checks if the read spans the /offset in the page/ that EOF > sits on. So it splits every read into two if EOF is not page > aligned, regardless of whether we are reading the EOF block or not. > > Hence we need to restrict the "does the read span EOF" check to > just the page that spans EOF, not every page we read. > > This patch results in correct EOF detection through readpages: > > xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24 > xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4 > iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864 > xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c > iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864 > iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864 > iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864 > > As you can see, it now does full page reads until the last one which > is split correctly at the block aligned EOF, reading 3072 bytes and > zeroing the last 1024 bytes. The original version of the patch got > this right, but it got another case wrong. > > The EOF detection crossing really needs to the the original length > as plen, while it starts at the end of the block, will be shortened > as up-to-date blocks are found on the page. This means "orig_pos + > plen" no longer points to the end of the page, and so will not > correctly detect EOF crossing. Hence we have to use the length > passed in to detect this partial page case: > > xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0 > xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1 > xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4 > iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296 > xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1 > iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296 > > Heere we see a trace where the first block on the EOF page is up to > date, hence poff = 1024 bytes. The offset into the page of EOF is > 3072, so the range we want to read is 1024 - 3071, and the range we > want to zero is 3072 - 4095. You can see this is split correctly > now. > > This fixes the stale data beyond EOF problem that fsx quickly > uncovers on 1k block size filesystems. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Looks ok, will test... Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --D > --- > > Testing the first version of this patch failed the direct IO fsx > test case I've been using at ~16.5 million ops when it tripped over > a partially up to date EOF page. I used the wrong length to detect > EOF being crossed, this update fixes that and it no longer fails at > 16.5m ops. We'll see how long it lasts this time.... > > fs/iomap.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index d51e7a2ae641..b95110867eee 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -142,13 +142,14 @@ static void > iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop, > loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp) > { > + loff_t orig_pos = *pos; > + loff_t isize = i_size_read(inode); > unsigned block_bits = inode->i_blkbits; > unsigned block_size = (1 << block_bits); > unsigned poff = offset_in_page(*pos); > unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length); > unsigned first = poff >> block_bits; > unsigned last = (poff + plen - 1) >> block_bits; > - unsigned end = offset_in_page(i_size_read(inode)) >> block_bits; > > /* > * If the block size is smaller than the page size we need to check the > @@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop, > * handle both halves separately so that we properly zero data in the > * page cache for blocks that are entirely outside of i_size. > */ > - if (first <= end && last > end) > - plen -= (last - end) * block_size; > + if (orig_pos <= isize && orig_pos + length > isize) { > + unsigned end = offset_in_page(isize - 1) >> block_bits; > + if (first <= end && last > end) > + plen -= (last - end) * block_size; > + } > > *offp = poff; > *lenp = plen;