Baokun Li <libaokun1@xxxxxxxxxx> writes: > On 2023/7/20 20:44, Ritesh Harjani (IBM) wrote: >> Baokun Li <libaokun1@xxxxxxxxxx> writes: >> >>> When we calculate the end position of ext4_free_extent, this position may >>> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if >>> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the >>> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not >>> the first case of adjusting the best extent, that is, new_bex_end > 0, the >>> following BUG_ON will be triggered: >> Nice spotting. >> >>> ========================================================= >>> kernel BUG at fs/ext4/mballoc.c:5116! >>> invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279 >>> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430 >>> Call Trace: >>> <TASK> >>> ext4_mb_use_best_found+0x203/0x2f0 >>> ext4_mb_try_best_found+0x163/0x240 >>> ext4_mb_regular_allocator+0x158/0x1550 >>> ext4_mb_new_blocks+0x86a/0xe10 >>> ext4_ext_map_blocks+0xb0c/0x13a0 >>> ext4_map_blocks+0x2cd/0x8f0 >>> ext4_iomap_begin+0x27b/0x400 >>> iomap_iter+0x222/0x3d0 >>> __iomap_dio_rw+0x243/0xcb0 >>> iomap_dio_rw+0x16/0x80 >>> ========================================================= >>> >>> A simple reproducer demonstrating the problem: >>> >>> mkfs.ext4 -F /dev/sda -b 4096 100M >>> mount /dev/sda /tmp/test >>> fallocate -l1M /tmp/test/tmp >>> fallocate -l10M /tmp/test/file >>> fallocate -i -o 1M -l16777203M /tmp/test/file >>> fsstress -d /tmp/test -l 0 -n 100000 -p 8 & >>> sleep 10 && killall -9 fsstress >>> rm -f /tmp/test/tmp >>> xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" >> >> Could you please also add it into xfstests? > Sure!I'll try to push this test case to xfstests. Thanks that would be great! >> I think it's a nice test which can check the boundary conditions for >> start and end of data types used in mballoc. I think it should even work >> if you don't do fsstress but instead just fallocate some remaining space >> in filesystem, so that when you open and try to write it to 0th offset, >> if automatically hits this error in ext4_mb_new_inode_pa(). > Yes, the fsstress here is just to fill up the remaining space on the disk. >> >>> We declare new_bex_start and new_bex_end as correct types and use fex_end() >>> to avoid the problems caused by the ext4_lblk_t overflow above. >>> >>> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") >>> Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx> >>> --- >>> fs/ext4/mballoc.c | 11 +++++------ >>> 1 file changed, 5 insertions(+), 6 deletions(-) >>> >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >>> index eb7f5d35ef96..2090e5e7ba58 100644 >>> --- a/fs/ext4/mballoc.c >>> +++ b/fs/ext4/mballoc.c >>> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) >>> pa = ac->ac_pa; >>> >>> if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) { >>> - int new_bex_start; >>> - int new_bex_end; >>> + ext4_lblk_t new_bex_start; >>> + loff_t new_bex_end; >>> >>> /* we can't allocate as much as normalizer wants. >>> * so, found space must get proper lstart >>> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) >>> * still cover original start >>> * 3. Else, keep the best ex at start of original request. >>> */ >>> - new_bex_end = ac->ac_g_ex.fe_logical + >>> - EXT4_C2B(sbi, ac->ac_orig_goal_len); >>> + new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len); >>> new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len); >>> if (ac->ac_o_ex.fe_logical >= new_bex_start) >>> goto adjust_bex; >>> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) >>> >>> BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); >>> BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); >>> - BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical + >>> - EXT4_C2B(sbi, ac->ac_orig_goal_len))); >> Ok so the right hand becomes 0 (because then end can go upto 1<<32 which >> will be 0 for unsigned int). And the left (new_bex_end) might be >> negative due to some operations above as I see it. >> And comparing an int with unsigned int, it will promote new_bex_end to >> unsigned int which will make it's value too large and will hit the >> bug_on. > Exactly! >> >> I would like to carefully review all such paths. I will soon review and >> get back. > Okay, thank you very much for your careful review. > The 2nd and 3rd cases of adjusting the best extent are impossible to > overflow, > so only the first case is converted here. I noticed them too during review. I think it would be safe to make the changes in other two places as well such that in future we never trip over such overlooked overflow bugs. >> >> >>> + BUG_ON(new_bex_end > >>> + fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len)); >> I am not sure whether using fex_end or pa_end is any helpful. >> I think we can just typecast if needed and keep it simple rather >> than adding helpers functions for addition operation. >> (because of the fact that fex_end() can take a third parameter which >> sometimes you pass as NULL. Hence it doesn't look clean, IMO) > I added the helper functions here for two reasons: > 1. restricting the type of the return value. > 2. This avoids the ugly line breaks in most cases. > > The fex_end() indeed doesn't look as clean as the pa_end(), because we > might use > the start of the free extent plus some other length to get a new end, > like right in > ext4_mb_new_inode_pa(), which makes me have to add another extra length > argument, but I think it's worth it, and even with the addition of a > parameter > that will probably be unused, it still looks a lot shorter than the > original code. IMO, we don't need pa_end() and fex_end() at all. In several places in ext4 we always have taken care by directly typecasting to avoid overflows. Also it reads much simpler rather to typecast in place than having a helper function which is also not very elegant due to a third parameter. Hence I think we should drop those helpers. Thanks once again for catching the overflows and coming up with a easy reproducer. I am surprised that this bug was never caught with LTP, fstests, smatch static checker. How did you find it? :) -ritesh