On 2020/4/17 11:26, Ritesh Harjani wrote: > Hello Yi, > > On 4/17/20 7:36 AM, zhangyi (F) wrote: >> Hi, Ritesh >> >> On 2020/4/17 2:33, Ritesh Harjani wrote: >>> Hello Kun, >>> >>> On 4/16/20 7:49 PM, yangerkun wrote: >>>> Nowadays, we trigger the a bug that has been reported before[1](trigger the bug with read block bitmap error before). After search the patch, >>>> I found some related patch which has not been included in our kernel. >>>> >>>> eb5760863fc2 ext4: mark block bitmap corrupted when found instead of BUGON >>>> 736dedbb1a7d ext4: mark block bitmap corrupted when found >>>> 206f6d552d0c ext4: mark inode bitmap corrupted when found >>>> db79e6d1fb1f ext4: add new ext4_mark_group_bitmap_corrupted() helper >>>> 0db9fdeb347c ext4: fix wrong return value in ext4_read_inode_bitmap() >>> >>> I see that you anyways have figured all these patches out. >>> >>>> >>>> Maybe this patch can fix the problem, but I am a little confused with >>>> the explain from Ted described in the mail: >>>> >>>> > What probably happened is that the page containing actual allocation >>>> > bitmap was pushed out of memory due to memory pressure. However, the >>>> > buddy bitmap was still cached in memory. That's actually quite >>>> > possible since the buddy bitmap will often be referenced more >>>> > frequently than the allocation bitmap (for example, while searching >>>> > for free space of a specific size, and then having that block group >>>> > skipped when it's not available). >>>> >>>> > Since there was an I/O error reading the allocation bitmap, the buffer >>>> > is not valid. So it's not surprising that the BUG_ON(k >= max) is >>>> > getting triggered. >>> >>> @Others, please correct me if I am wrong here. >>> >>> So just as a small summary. Ext4 maintains an inode (we call it as >>> buddy cache inode which is sbi->s_buddy_cache) which stores the block >>> bitmap and buddy information for every block group. So we require 2 >>> blocks for every block group to store both of this info in it. >>> >>> So what generally happens is whenever there is a request to block >>> allocation, this(buddy and block bitmap information is loaded from the >>> disk into the page cache. >>> >>> When someone does the block allocation these pages get loaded into the >>> page cache. And it will be there until these pages are getting heavily >>> used (that's coz of page eviction algo in mm). >>> But in case when the memory pressure is high, these pages may get >>> written out and eventually getting evicted from the page cache. >>> Now if any of this page is not present in the page cache we go and try >>> to read it from the disk. (I think that's the job of >>> ext4_mb_load_buddy_gfp()). >>> >>> So let's say while reading this page from disk we get an I/O error, >>> so this means, as Ted explained, that the buffer which was not properly >>> read and hence it is not uptodate (and so we also don't set buffer >>> verified bit). >>> And in that case we should mark that block group corrupted. So that next >>> time, ext4_mb_good_group() does not allow us to do allocation from that >>> block group. I think some of the patches which you pointed add the logic >>> into the mballoc. So that we don't hit that bug_on(). >>> >>> {... >>> [Addition info - not related to your issue though] >>> So this could also be an e.g. where the grp->bb_free may not be uptodate >>> for a block group of which bitmap was not properly loaded. >>> ...} >>> >>> >>>> >>>> (Our machine: x86, 4K page size, 4K block size) >>>> >>>> After check the related code, we found that once we get a IO error from ext4_wait_block_bitmap, ext4_mb_init_cache will return directly with a error number, so the latter ext4_mb_simple_scan_group may never been called! So any other scene will trigger this BUG_ON? >>> >>> Sorry that's not what I see in latest upstream kernel. >>> I am not sure which kernel version you are checking this on. >>> Check the latest upstream kernel and compare with it. >>> >>> >> >> Thanks for your reply. >> >> We check the upstream kernel 5.7-rc1, on our machine which has 4K page size >> and 4K block size, if the ext4_wait_block_bitmap() invoked from >> ext4_mb_init_cache() return -EIO, the 'err' variable will be set and the >> subsequent loop will be jumped out due to '!buffer_verified[group - first_group] >> && blocks_per_page == 1', so the -EIO error number will return by >> ext4_mb_load_buddy() and there is no chance to invoke ext4_mb_simple_scan_group() >> and trigger BUG_ON(). >> >> static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp) >> { >> ... >> /* wait for I/O completion */ >> for (i = 0, group = first_group; i < groups_per_page; i++, group++) { >> ... >> err2 = ext4_wait_block_bitmap(sb, group, bh[i]); >> if (!err) >> err = err2; <------ set -EIO here >> } >> >> first_block = page->index * blocks_per_page; >> for (i = 0; i < blocks_per_page; i++) { >> group = (first_block + i) >> 1; >> ... >> if (!buffer_verified(bh[group - first_group])) >> /* Skip faulty bitmaps */ >> continue;<----- blocks_per_page == 1, we jump out here >> err = 0; <---- never excute >> ... >> out: >> ... >> return err; >> } >> >> static noinline_for_stack int >> ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> { >> ... >> err = ext4_mb_load_buddy(sb, group, &e4b); >> if (err) >> goto out; <--- return here >> ... >> if (cr == 0) >> ext4_mb_simple_scan_group(ac, &e4b); <--- never invoke >> ... >> } > > Yup, I guess what you mentioned is correct. But I noted one other thing. > Check if below could lead to this. > > static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp) > { > <...> > first_group = page->index * blocks_per_page / 2; > > /* read all groups the page covers into the cache */ > for (i = 0, group = first_group; i < groups_per_page; i++, group++) { > if (group >= ngroups) > break; > > grinfo = ext4_get_group_info(sb, group); > /* > * If page is uptodate then we came here after online resize > * which added some new uninitialized group info structs, so > * we must skip all initialized uptodate buddies on the page, > * which may be currently in use by an allocating task. > */ > if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) { > bh[i] = NULL; > continue; > } > bh[i] = ext4_read_block_bitmap_nowait(sb, group); > if (IS_ERR(bh[i])) { > err = PTR_ERR(bh[i]); > bh[i] = NULL; > goto out; > } > mb_debug(1, "read bitmap for group %u\n", group); > } > > /* wait for I/O completion */ > for (i = 0, group = first_group; i < groups_per_page; i++, group++) { > int err2; > > if (!bh[i]) > continue; > err2 = ext4_wait_block_bitmap(sb, group, bh[i]); > if (!err) > err = err2; > } > > first_block = page->index * blocks_per_page; > for (i = 0; i < blocks_per_page; i++) { > <...> > if (!buffer_verified(bh[group - first_group])) > /* Skip faulty bitmaps */ > continue; > err = 0; ====> yes this was not set I think. > > <...> > } > SetPageUptodate(page); ========> But it seems we are still setting uptodate bit on page. > > out: > if (bh) { > for (i = 0; i < groups_per_page; i++) > brelse(bh[i]); > if (bh != &bhs) > kfree(bh); > } > return err; > } > > > ext4_mb_load_buddy_gfp() { > <...> > > > /* we could use find_or_create_page(), but it locks page > * what we'd like to avoid in fast path ... */ > page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED); > if (page == NULL || !PageUptodate(page)) { ====> next time we won't go in this if condition. (since PageUptodate is already set) > <...> > page = find_or_create_page(inode->i_mapping, pnum, gfp); > if (page) { > BUG_ON(page->mapping != inode->i_mapping); > if (!PageUptodate(page)) { > ret = ext4_mb_init_cache(page, NULL, gfp); > <...> > } > unlock_page(page); > } > } > if (page == NULL) { > ret = -ENOMEM; > goto err; > } > if (!PageUptodate(page)) { > ret = -EIO; > goto err; > } > <...> > } > > > So maybe since the PageUptodate bit was set on page previously as > highlighted above. Next time when there will be any allocation request, > we won't read those pages again from disk (thinking that it's uptodate. > And hence may encounter that BUG. Thoughts? > Indeed,it looks correct, that's match our error log: [409589.665086] EXT4-fs error (device sda5): ext4_wait_block_bitmap:524: comm Kbaselogd: Cannot read block bitmap - block_group = 529, block_bitmap = 17334272 [409590.664990] EXT4-fs error (device sda5): ext4_wait_block_bitmap:524: comm Kbaselogd: Cannot read block bitmap - block_group = 529, block_bitmap = 17334272 The first one is triggered while building block bitmap page when first invoking ext4_mb_load_buddy(), the second one is triggered while building buddy bitmap bage at second time. After these two failure, both bitmap pages are inconsistency with ext4_group_info:bb_counters, and finally encounter the BUG_ON at third time. > If above is true, then may be we should not call > "SetPageUptodate(page)", in case of an error reading block bitmap? > Thoughts? > Yeah, it's better to set page uptodate only if all block bitmap buffers are uptodate represent by this page. Thanks, Yi. > One other thing from [1]. I guess it doesn't have a full kernel log. > So maybe there were some previous I/O errors as well and it failed > to update the superblock buffer_head even before. So something > may have messed up long before and it slipped through all the cracks > until it crashed in mballoc. > > > -ritesh > > >> >> We also find that ext4_group_info:bb_counters and the corresponding buddy bit map >> are updated or initialized at the same time, so even if we encounter page miss and >> forget to mark that block group corrupted due to IO failure, it seems that it also >> could not trigger this inconsistency. Am I missing something ? >> >> Thanks, >> Yi. >> >>>> ----- >>>> [1] https://www.spinics.net/lists/linux-ext4/msg60329.html >>>> >>> >>> >>> . >> > > > .