On Mon 03-04-23 22:02:56, Baokun Li wrote: > On 2023/3/30 0:22, Jan Kara wrote: > > On Wed 29-03-23 15:23:19, Baokun Li wrote: > > > On 2023/3/28 18:00, Jan Kara wrote: > > > > On Mon 27-03-23 21:09:42, Baokun Li wrote: > > > > > On 2023/3/27 20:47, Jan Kara wrote: > > > > > > On Sat 25-03-23 14:34:43, Baokun Li wrote: > > > > > > > In our fault injection test, we create an ext4 file, migrate it to > > > > > > > non-extent based file, then punch a hole and finally trigger a WARN_ON > > > > > > > in the ext4_da_update_reserve_space(): > > > > > > > > > > > > > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369: > > > > > > > ino 14, used 11 with only 10 reserved data blocks > > > > > > > > > > > > > > When writing back a non-extent based file, if we enable delalloc, the > > > > > > > number of reserved blocks will be subtracted from the number of blocks > > > > > > > mapped by ext4_ind_map_blocks(), and the extent status tree will be > > > > > > > updated. We update the extent status tree by first removing the old > > > > > > > extent_status and then inserting the new extent_status. If the block range > > > > > > > we remove happens to be in an extent, then we need to allocate another > > > > > > > extent_status with ext4_es_alloc_extent(). > > > > > > > > > > > > > > use old to remove to add new > > > > > > > |----------|------------|------------| > > > > > > > old extent_status > > > > > > > > > > > > > > The problem is that the allocation of a new extent_status failed due to a > > > > > > > fault injection, and __es_shrink() did not get free memory, resulting in > > > > > > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM, > > > > > > > we map to the same extent again, and the number of reserved blocks is again > > > > > > > subtracted from the number of blocks in that extent. Since the blocks in > > > > > > > the same extent are subtracted twice, we end up triggering WARN_ON at > > > > > > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks. > > > > > > Hum, but this second call to ext4_map_blocks() should find already allocated > > > > > > blocks in the indirect block and thus should not be subtracting > > > > > > ei->i_reserved_data_blocks for the second time. What am I missing? > > > > > > > > > > > > Honza > > > > > > > > > > > ext4_map_blocks > > > > > 1. Lookup extent status tree firstly > > > > > goto found; > > > > > 2. get the block without requesting a new file system block. > > > > > found: > > > > > 3. ceate and map the block > > > > > > > > > > When we call ext4_map_blocks() for the second time, we directly find the > > > > > corresponding blocks in the extent status tree, and then go directly to step > > > > > 3, > > > > > because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED > > > > > but contains EXT4_GET_BLOCKS_CREATE, thus subtracting > > > > > ei->i_reserved_data_blocks > > > > > for the second time. > > > > Ah, I see. Thanks for explanation. But then the problem is deeper than just > > > > a mismatch in number of reserved delalloc block. The problem really is that > > > > if extent status tree update fails, we have inconsistency between what is > > > > stored in the extent status tree and what is stored on disk. And that can > > > > cause even data corruption issues in some cases. > > > The scenario we encountered was this: > > > ``` > > > write: > > > ext4_es_insert_delayed_block > > > [0/16) 576460752303423487 (U,D) > > > writepages: > > > alloc lblk 11 pblk 35328 > > > [0/16) 576460752303423487 (U,D) > > > -- remove block 11 from extent > > > [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[12/4) > > > 549196775151 (U,D,R) > > > --Failure to allocate memory for a new extent will undo as: > > > [0/16) 576460752303423487 (U,D,R) > > Yes, this is what I was expecting. So now extent status tree is > > inconsistent with the on-disk allocation info because the block 11 is > > already allocated on disk but recorded as unallocated in the extent status > > tree. > > Yes! There is an inconsistency here, but do_writepages finds that the > writeback returns -ENOMEM and keeps retrying until it succeeds, at which > point the above inconsistency does not exist. Well, do_writepages() will not retry if wbc->sync_mode == WB_SYNC_NONE. So the inconsistency can stay for a long time. > > If the similar problem happened say when we punch a hole into a middle of a > > written extent and so block on disk got freed but extent status tree would > > still record it as allocated, user would be able to access freed block thus > > potentially exposing sensitive data. > > ext4_punch_hole > // remove extents in extents status tree > ext4_es_remove_extent > // remove extents tree on disk > ext4_ext_remove_space > > In this scenario, we always try to delete the extents in the in-memory > extents status tree first, and then delete the extents tree on disk. So > even if we fail in deleting extents in memory, there is no inconsistency, > am I missing something? No, you are right, this case is safe. Still I suspect inconsistencies with extent status tree can cause more problems and possibly stale data exposure. > > > -- if success insert block 11 to extent status tree > > > [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W) > > > + [12/4) 549196775151 (U,D,R) > > > > > > U: UNWRITTEN > > > D: DELAYED > > > W: WRITTEN > > > R: REFERENCED > > > ``` > > > > > > When we fail to allocate a new extent, we don't map buffer and we don't do > > > io_submit, so why is the extent tree in memory inconsistent with the one > > > stored on disk? Am I missing something? > > > > > > I would appreciate it if you could explain under what cases and what kind of > > > data corruption issues can be caused. > > See above. > > > > > > And this should also fix the problem you've hit because in case of > > > > allocation failure we may just end up with removed extent from the extent > > > > status tree and thus we refetch info from the disk and find out blocks are > > > > already allocated. > > > Reloading extent tree from disk I don't quite understand here, how do we > > > handle reserved blocks? could you explain it in more detail? > > > > > > Logically, I think it is still necessary to update i_reserved_data_blocks > > > only after a successful allocation. This is also done in > > > ext4_ext_map_blocks(). > > I guess there is some misunderstanding here. Both with > > ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating > > i_reserved_data_blocks only after the blocks are successfully allocated and > > inserted in the respective data structure but *before* updating extent > > status tree. If extent status tree operation fails, we currently get > > inconsistency between extent status tree and on-disk info in both cases > > AFAICS. Am I missing something? > > Yes, our code is indeed designed to only update the number of reserved > blocks after the block allocation is complete. We have different > treatment for extent based file and non-extent based file in commit > 5f634d064c70 ("ext4: Fix quota accounting error with fallocate"). > > For extent based file, we update the number of reserved blocks before the > "got_allocated_blocks" tag after the blocks are successfully allocated in > ext4_ext_map_blocks(). > > For the non-extent based file we update the number of reserved blocks > after ext4_ind_map_blocks() is executed, which leads to the problem that > when we call ext4_ind_map_blocks() to create a block, it does not always > create a block. For example, if the extents status tree we encountered > earlier does not match the extents tree on disk, this is of course a > problem in itself, but in terms of code logic, updating the number of > reserved blocks as ext4_ext_map_blocks() does can prevent us from trying > to create a block and not creating it, resulting in an incorrect number > of reserved blocks. I see, thanks for explanation! Indeed it may be good to cleanup this code so that indirect block and extent based inodes are handled in the same way. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR