On Wed, Sep 27, 2023 at 08:18:28AM +0530, Ritesh Harjani wrote: > Andreas Dilger <adilger@xxxxxxxxx> writes: > > > On Sep 25, 2023, at 9:35 PM, Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> wrote: > >> > >> Andreas Dilger <adilger@xxxxxxxxx> writes: > >> > >>> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> wrote: > >>>> > >>>> Bobi Jam <bobijam@xxxxxxxxxxx> writes: > >>>> > >>>>> Non-metadata block allocation does not allocate from the IOPS groups > >>>>> if non-IOPS groups are not used up. > >>>> > >>>>> Add for mke2fs an option to mark which blocks are in the IOPS region > >>>>> of storage at format time: > >>>>> > >>>>> -E iops=0-1024G,4096-8192G > >>>>> > >>>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the > >>>>> group descriptors to decide which groups to allocate dynamic > >>>>> filesystem metadata. > >>>>> > >>>>> Signed-off-by: Bobi Jam <bobijam@xxxxxxxxxxx > >>>>> > >>>>> -- > >>>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation > >>>>> from IOPS groups. > >>>>> v1->v2: for metadata block allocation, search in IOPS list then normal > >>>>> list. > >>>>> --- > >>>>> > >> > >>>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, > >>>>> return; > >>>>> } > >>>>> > >>>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) { > >>>>> + if (*new_cr == 0) > >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group); > >>>>> + if (!ret && *new_cr < 2) > >>>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group); > >>> > >>> It looks like this patch is a bit stale since Ojaswin's renaming of the > >>> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names. > >>> > >> > >> Yup. We should rebase our development effort to latest tree. > >> > >>>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return > >>>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we > >>>> were able to find a group which satisfies this allocation request we > >>>> return. But the caller never knows that we allocated using cr1 and not > >>>> cr0. Because we never updated *new_cr inside xx_iops_group_crX() > >>> > >>> I guess it is a bit messy, since updating new_cr might interfere with the > >>> use of new_cr in the fallthrough to the non-IOPS groups below? In the > >>> "1% IOPS groups" case, doing this extra scan for metadata blocks should > >>> be very fast, since the metadata allocations are almost always one block > >>> (excluding large xattrs), so the only time this would fail is if no IOPS > >>> blocks are free, in which case it is fast since the group lists are empty. > >>> > >> > >> What I was suggesting was we never update *new_cr value when we were > >> able to find a suitable group. That will confuse the two for loops we > >> have in the caller. We might as well just update the *new_cr value once > >> we have identified a suitable group in cr0 or cr1 before returning. > >> > >>> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the > >>> non-IOPS group allocation starts to be able to distinguish these cases > >>> (i.e. skip IOPS group scans if they are full) and the fallthrough search > >>> is also having trouble to find a single free block for the metadata, but > >>> I think that is pretty unlikely. > > > > I'm not clear which option you prefer here? > > - update *new_cr based on the scan in the IOPS groups (in which case the > > fallback to non-IOPS groups would start at a higher CR than necessary) > > This obviously won't help, since it will even further slowdown the metdata > allocations when the iops group becomes full. > > > > - add new phases *before* CR_POWER2_ALIGNED, e.g. "CR_IOPS_POWER2_ALIGNED" > > "CR_IOPS_CR_GOAL_LEN_FAST" and "CR_IOPS_CR_GOAL_LEN_SLOW" to do either > > a fast scan or a slow scan on the IOPS groups, and then fall back to > > non-IOPS groups? They would be skipped if no IOPS groups exist. > > > > The second option allows preserving the CR value across the loops, in case > > the group returned is not suitable for some reason, without confusing whether > > the lookup is being done for IOPS groups or not. Also, it makes sense to > > have a "SLOW" pass on the IOPS groups, instead of just "FAST", to ensure > > that all of the IOPS groups have been tried. This should be very rare > > since most allocations (excluding xattrs) are only one block long. > > So how about we add a scan_state to allocation criteria for iops v/s > non-iops scanning. This way we don't add any new crs, but mballoc now > does a stateful scan rather than stateless scan being done today. > (because we now have two different type of groups to select, iops v/s > non-iops). > > We can add a function like.... > ac->scan_state = ext4_mb_get_scan_state_policy() > > ...before starting the scan. > > For metadata allocations it will return IOPS_SCAN and for data > allocations it can return NON_IOPS_SCAN (more policy decisions can be > added later). This way we don't have to add extra CRs to the enum > criteria. The allocation still happens using the same existing > allocation criteria. If say the allocation using IOPS_SCAN for metadata > fails, we can switch to NON_IOPS_SCAN & restart the scan from cr0 again. > > We might also need to maintain some extra state or variable which tells us > which previous scan_state failed to find a suitable group. But I > will leave that as implementation details. It might be useful for data > allocations where we might have a policy to start with either IOPS_SCAN > or NON_IOPS_SCAN first. > > I believe this will be a simpler change rather than adding more crs. > Problem in adding more crs at the beginning could be when we have to > roll over from NON_IOPS criteria to IOPS allocation criteria. > > Thoughts? > Hi Ritesh, Andreas. So I'm not sure if I'm convinced with the way we are handling CRs in current implementation. This changes the behavior of how other ext4_mb_choose_*_cr0/1() functions work that is by updating the cr if we fail to find a suitable group. The new functions instead return true or false without updating the CR which, as Ritesh pointed out, leaves some unhandled edge cases that could cause silent behavior changes. I like the original way of handling CRs where the logic to update the CR is mostly in the ext4_mb_choose_*_cr0/1() functions and I'd rather not move it in the ext4_mb_choose_next_group() like in this patch. That being said, @Andreas, my thoughts on the 2 ways you've proposed: > > - update *new_cr based on the scan in the IOPS groups (in which case the > > fallback to non-IOPS groups would start at a higher CR than necessary) > 1 So the way I see it, each CR represents the algorithm to find a good group rather than the data structures we use for them. Hence, I feel that since the algo of CR0/1 remains the same whether its IOPS or not, we should not add a new CR for IOPS. > > - update *new_cr based on the scan in the IOPS groups (in which case the > > fallback to non-IOPS groups would start at a higher CR than necessary) 2. I would actually prefer this way of doing it. I think this is also somewhat similar to how we were doing it in PATCH v1. We should keep using the existing ext4_mb_choose_*_cr0/1() functions but update the CR based on if its a metadata allocation or not. I actually like Ritesh's proposal of using a scan state which will then lead to our psuedo-code looking something like: ext4_mb_choose_next_group_cr0() new_cr = 1 ext4_mb_choose_next_group_cr1() new_cr = 2 /* cr 2 search in outer loop */ if (not found in cr 2) { if (ac->scan_state == IOPS_SCAN) { ac->scan_state == NON_IOPS_SCAN; new_cr = 0; } else { ac->scan_state == IOPS_SCAN; new_cr = 0; } goto repeat; } With the patch I was working on to shift CR2 to order list [1] and then have a ext4_mb_choose_next_group_cr2() functions instead of using the linear loop, we can then maintain an IOPS lists like we do for cr0/1 here and further simplify the above psuedo code for CR2. [1] https://lore.kernel.org/linux-ext4/cover.1693911548.git.ojaswin@xxxxxxxxxxxxx/ (In the above discussion i used the older cr0/1/2 notation for simplicity) Also, I didn't completely understand this particular statement: > in which case the fallback to non-IOPS groups would start at a higher > CR than necessary I think we can always reset the cr to 0 when we reach the end in IOPS allocation right like in the psuedo code above, or am I missing something? Regards, ojaswin > > > > Ojaswin, do you have any input here? You've been doing somework on the > > mballoc code recently, and it would be good to get this aligned with what > > you are doing/planning. > > > >>>>> if (*new_cr == 0) { > >>>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); > >>>>> } else if (*new_cr == 1) { > >>>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups); > >>>>> } else { > >>>>> + /* > >>>>> + * Cannot get data group from slow storage, try IOPS storage > >>>>> + */ > >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS && > >>>>> + !alloc_metadata && sbi->s_mb_enable_iops_data && > >>>>> + *new_cr == 3) { > >>>>> + if (ac->ac_2order) > >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, > >>>>> + group); > >>>>> + if (!ret) > >>>>> + ext4_mb_choose_next_iops_group_cr1(ac, group); > >>>>> + } > >>>> > >>>> We might never come here in this else case because > >>>> should_optimize_scan() which we check in the beginning of this function > >>>> will return 0 and we will chose a next linear group for CR >= 2. > >>> > >>> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)" > >>> that this code block should never be entered. > >> > >> right. > >> > >>> > >>> Really, the fallback to IOPS groups for regular files should only happen > >>> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block > >>> allocation to fill up the IOPS groups just because the filesystem is > >>> fragmented and low on space, but only if out of non-IOPS space. > >>> > >> > >> Sure, I have added some comments later on this policy part... > >> > >>>>> > >>>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, > >>>>> goto out; > >>>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) > >>>>> goto out; > >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS && > >>>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) && > >>>>> + !sbi->s_mb_enable_iops_data) > >>>>> + goto out; > >>>> > >>>> since we already have a grp information here. Then checking for s_flags > >>>> and is redundant here right? > >>> > >>> This is intended to stop regular data allocations in IOPS groups that are > >>> found by next_linear_group(). > >> > >> What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we > >> have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right? > >> So do we still need to keep both conditions here? > > > > Well, EXT2_FLAGS_HAS_IOPS determines whether this functionality is enabled > > or not, while the GRP_TEST_IOPS check is for the individual group. So if > > the feature is totally disabled (no EXT2_FLAGS_HAS_IOPS) then the per-group > > bit should be ignored. > > > > Ok, thanks for clarification. > > >>> With the change to allow regular data to be allocated in IOPS groups, > >>> there might need to be an extra check added here to see what allocation > >>> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to > >>> allow distinguishing between IOPS->regular fallback and regular->IOPS > >>> fallback? > >>> > >>> > >>> It seems like most of the complexity/issues here have crept in since the > >>> addition of the fallback for regular data allocations in IOPS groups... > >>> I'm not sure if we want to defer that functionality to a separate patch, > >>> or if you have any suggestions on how to clarify this without adding a > >>> lot of complexity? > >> > >> I agree that the separation is not clear. I think it would have been > >> better if we would have split that functionality in 2 separate patches. > >> The 1st patch just adds the functionality that you intended i.e. > >> > >> 1. metadata allocations should happen via IOPS group and only if there > >> is no space left in IOPS group it will fallback to non-IOPS group. > >> This 1st patch also have data allocations coming only from non-iops group. > >> > >> and the second patch can have details of... > >> 2. adding a knob which can allow users to fallback data allocations to IOPS group too. > > > > Sure, I would be happy with that. My main goal is to reserve these flags > > and get this feature working in a basic fashion, and then more elaborate > > policy decisions can be added once there is a demand for it. > > > > Sure make sense. > > >> If you think you would like to defer the second patch to later to avoid > >> the complexity, I am fine with that too. The reason is we should still > >> think upon what should be the fallback critera for that. Should we do it > >> when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE) > >> or is it ok to fallback even earlier. I guess it will also depend upon > >> the information of how many groups are IOPS v/s non-IOPS. > >> > >> I don't think we are keeping that information anywhere on disk right? > >> (no. of IOPS v/s non-IOPS groups). That means we might have to do that > >> at runtime. Once we have that information, the filesystem can better > >> decide on when should the fallback happen. > > > > Mount already scans all of the groups at mount to set the IOPS flags in > > the in-memory group_info, so the count of IOPS groups vs. non-IOPS could > > be easily determined, if there is a use for this. > > > > Yup. > > > >> So I agree, we need to more discussion and think it through. I guess Ted > >> was also suggesting the same on the call. Feel free to defer the > >> fallback of data allocations to non-IOPS group for a later time (If > >> we don't have a strong objection from others on this). > > > > Great, thanks for your review and feedback. > > Thanks for helping with the queries. > > -ritesh