> On Jul 31, 2020, at 1:08 PM, Theodore Ts'o <tytso@xxxxxxx> wrote: > > From: Alex Zhuravlev <azhuravlev@xxxxxxxxxxxxx> > > cr=0 is supposed to be an optimization to save CPU cycles, but if > buddy data (in memory) is not initialized then all this makes no sense > as we have to do sync IO taking a lot of cycles. Also, at cr=0 > mballoc doesn't choose any avaibale chunk. cr=1 also skips groups (typo) "available" > using heuristic based on avg. fragment size. It's more useful to skip > such groups and switch to cr=2 where groups will be scanned for > available chunks. However, we always read the first block group in a > flex_bg so metadata blocks will get read into the first flex_bg if > possible. > > Using sparse image and dm-slow virtual device of 120TB was > simulated, then the image was formatted and filled using debugfs to > mark ~85% of available space as busy. mount process w/o the patch > couldn't complete in half an hour (according to vmstat it would take > ~10-11 hours). With the patch applied mount took ~20 seconds. > > Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988 > Signed-off-by: Alex Zhuravlev <azhuravlev@xxxxxxxxxxxxx> > Reviewed-by: Andreas Dilger <adilger@xxxxxxxxxxxxx> > Reviewed-by: Artem Blagodarenko <artem.blagodarenko@xxxxxxxxx> Looks good, one trivial cleanup below could also be done before landing. > --- > fs/ext4/mballoc.c | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index fcc702f1ff15..b1ef35a9e9f1 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -2195,7 +2196,26 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, > > /* We only do this if the grp has never been initialized */ > if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { > - ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); > + struct ext4_group_desc *gdp = > + ext4_get_group_desc(sb, group, NULL); > + int ret; > + > + /* cr=0/1 is a very optimistic search to find large > + * good chunks almost for free. If buddy data is not > + * ready, then this optimization makes no sense. But > + * we never skip the first block group in a flex_bg, > + * since this gets used for metadata block allocation, > + * and we want to make sure we locate metadata blocks > + * in the first block group in the flex_bg if > + * possible. (style) "possible" could fit on the previous line? > + */ > + if (cr < 2 && > + (!sbi->s_log_groups_per_flex || > + ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) && > + !(ext4_has_group_desc_csum(sb) && > + (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) > + return 0; > + ret = ext4_mb_init_group(sb, group, GFP_NOFS); > if (ret) > return ret; > } > -- > 2.24.1 > Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP