Theodore Y. Ts'o wrote on 2020/12/10 3:39: > On Wed, Dec 09, 2020 at 07:48:09PM +0800, brookxu wrote: >> >> Maybe I missed something. If i% meta_bg_size is used instead, if >> flex_size <64, then we will miss some flex_bg. There seems to be >> a contradiction here. In the scenario where only flex_bg is >> enabled, it may not be appropriate to use meta_bg_size. In the >> scenario where only meta_bg is enabled, it may not be appropriate >> to use flex_size. >> >> As you said before, it maybe better to remove >> >> if ((i <5) || ((i% flex_size) == 0)) >> >> and do it for all groups. > > I don't think the original (i % flex_size) made any sense in the first > place. > > What flex_bg does is that it collects the allocation bitmaps and inode > tables for each block group and locates them within the first block > group in a flex_bg. It doesn't have anything to do with whether or > not a particular block group has a backup copy of the superblock and > block group descriptor table --- in non-meta_bg file systems and the > meta_bg file systems where the block group is less than > s_first_meta_bg * EXT4_DESC_PER_BLOCK(sb). And the condition in > question is only about whether or not to add the backup superblock and > backup block group descriptors. So checking for i % flex_size made no > sense, and I'm not sure that check was there in the first place. I think we should add backup sb and gdt to system_zone, because these blocks should not be used by applications. In fact, I think we may have done some work. >> In this way weh won't miss some flex_bg, meta_bg, and sparse_bg. >> I tested it on an 80T disk and found that the performance loss >> was small: >> >> unpatched kernel: >> ext4_setup_system_zone() takes 524ms, >> >> patched kernel: >> ext4_setup_system_zone() takes 552ms, > > I don't really care that much about the time it takes to execute > ext4_setup_system_zone(). > > The really interesting question is how large is the rb_tree > constructed by that function, and what is the percentage increase of > time that the ext4_inode_block_valid() function takes. (e.g., how > much additional memory is the system_blks tree taking, and how deep is > that tree, since ext4_inode_block_valid() gets called every time we > allocate or free a block, and every time we need to validate an extent > tree node. During detailed analysis, I found that when the current logic calls ext4_setup_system_zone(), s_log_groups_per_flex has not been initialized, and flex_size is always 1, which seems to be a mistake. therefore if (ext4_bg_has_super(sb, i) && ((i <5) || ((i% flex_size) == 0))) Degenerate to if (ext4_bg_has_super(sb, i)) So, the existing implementation just adds the backup super block in sparse_group to system_zone. Due to this mistake, the behavior of the system in the flex_bg scenario happens to be correct? I tested it in three scenarios: only meta_bg, only flex_bg, both flex_bg and meta_bg were enabled. The test results are as follows: Meta_bg only unpacthed kernel: ext4_setup_system_zone time 866 count 1309087 pacthed kernel: ext4_setup_system_zone time 841 count 1309087 Since the backup gdt of meta_bg and BB are connected, they can be merged, so no additional nodes are added. Flex_bg only unpacthed kernel: ext4_setup_system_zone time 529 count 41016 pacthed kernel: ext4_setup_system_zone time 553 count 41016 The system behavior has not changed. All sparse_group backup sb and gdt are still added, so no additional nodes are added. Meta_bg & Flex_bg only unpacthed kernel: ext4_setup_system_zone time 535 count 41016 pacthed kernel: ext4_setup_system_zone time 571 count 61508 In addition to sparse_group, the system needs to add the backup gdt of meta_bg to the system. Set N=max(flex_bg_size / meta_bg_size, 1) then every N meta_bg has a gdt block that can be merged into the node corresponding to flex_bg, such as flex_bg_size < meta_bg_size, then the number of new nodes is 2 * nr_meta_bg. On this 80T disk, the maximum depth of rbtree is 2log(n+1). According to this calculation, in this test case, the depth of rbtree is not increased. Thus, there is no major performance overhead. Maybe we can deal with it in the same way as discussed before? > Cheers, > > - Ted >