* Eric Whitney <enwlinux@xxxxxxxxx>: > When converting files with inline data to extents, delayed allocations > made on a file system created with both the bigalloc and inline options > can result in invalid extent status cache content, incorrect reserved > cluster counts, kernel memory leaks, and potential kernel panics. > > With bigalloc, the code that determines whether a block must be > delayed allocated searches the extent tree to see if that block maps > to a previously allocated cluster. If not, the block is delayed > allocated, and otherwise, it isn't. However, if the inline option is > also used, and if the file containing the block is marked as able to > store data inline, there isn't a valid extent tree associated with > the file. The current code in ext4_clu_mapped() calls > ext4_find_extent() to search the non-existent tree for a previously > allocated cluster anyway, which typically finds nothing, as desired. > However, a side effect of the search can be to cache invalid content > from the non-existent tree (garbage) in the extent status tree, > including bogus entries in the pending reservation tree. > > To fix this, avoid searching the extent tree when allocating blocks > for bigalloc + inline files that are being converted from inline to > extent mapped. > > Signed-off-by: Eric Whitney <enwlinux@xxxxxxxxx> > --- > fs/ext4/extents.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index f1956288307f..a8928d6d47ac 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -5791,6 +5791,14 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu) > struct ext4_extent *extent; > ext4_lblk_t first_lblk, first_lclu, last_lclu; > > + /* > + * if data can be stored inline, the logical cluster isn't > + * mapped - no physical clusters have been allocated, and the > + * file has no extents > + */ > + if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) > + return 0; > + > /* search for the extent closest to the first block in the cluster */ > path = ext4_find_extent(inode, EXT4_C2B(sbi, lclu), NULL, 0); > if (IS_ERR(path)) { > -- > 2.30.2 > The bug this patch addresses was a topic in last week's concall, and on request here's some more detailed information describing the circumstances of the test failure I saw. generic/478 produced a kernel panic during a -g auto run of kvm-xfstests' bigalloc_inline test configuration. This was caused by a BUG_ON in ext4_es_cache_extent, where the value of lblk passed in from ext4_find_extent -> ext4_cache_extents was 0xffffffff and len was a much smaller non-zero value. Running generic/478 alone isn't sufficient to reproduce the problem - generic/192 must be run first. That sequence of tests can be reduced to the following reproducer (in fstest terms): echo test >$TEST_DIR/testfile _test_cycle_mount $XFS_IO_PROG -f -c "pwrite -S 0xFF 0 4096" \ $TEST_DIR/testfile >> $seqres.full 2>&1 What happens as a result of the xfs_io pwrite is that the inline data area is read from the test file and treated as extents which ext4_find_extent attempts to cache before proceeding to search for an allocated cluster, as noted above, at the point in time where the file is being converted from inline to extents. (All this descends from ext4_da_convert_inline_data_to_extent). The BUG_ON is triggered because there happens to be an 0xffffffff value in the inline data area that is interpreted as an lblk. Before that happens, a few bogus extents are inserted in the extent status cache. It seems clear that if the BUG_ON hadn't been triggered with an appropriate value, execution would have continued with the possibility of the consequences noted above. A look at the extent header passed to ext4_cache_extents by ext4_find_extent reveals a bad magic number which reflects a portion of the string "test" rather than EXT4_EXT_MAGIC (0xf30a). Removing the _test_cycle_mount leads to a different result. The reproducer executes without triggering a panic. The difference in this case is that there is an empty extent tree in place of the inline data; the magic number is valid. ext4_find_extent proceeds without caching anything, and fails to find an allocated cluster for that empty tree. All at least okay. So, the difference appears to be that the new extent tree hasn't been set up before the delayed allocation activity resulting from conversion in the umount/mount case, and it has when that doesn't take place in the other case. I've been looking for the reason. In either case, there's no point in going through the motions of searching for a potential cluster mapping at the time of conversion when there can be no extent tree yet - hence the patch above. There's also the possibility of shortcutting this higher in the call chain in ext4_da_map_blocks (where there's an odd twist - extent status cache gets searched but extent tree does not before delayed allocation). For the moment, doing this in ext4_clu_mapped seems lowest risk. An aside: wouldn't it make sense to harden ext4_find_extent a little more by checking eh_magic? Eric