Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Wed, 16 Aug 2023 15:39:21 +0530

Andreas Dilger <adilger@xxxxxxxxx> writes:

> On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> wrote:
>> 
>> Bobi Jam <bobijam@xxxxxxxxxxx> writes:
>> 
>>> With LVM it is possible to create an LV with SSD storage at the
>>> beginning of the LV and HDD storage at the end of the LV, and use that
>>> to separate ext4 metadata allocations (that need small random IOs)
>>> from data allocations (that are better suited for large sequential
>>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>>> the filesystem capacity would need to be high-IOPS storage in order to
>>> hold all of the internal metadata.
>>> 
>>> This would improve performance for inode and other metadata access,
>>> such as ls, find, e2fsck, and in general improve file access latency,
>>> modification, truncate, unlink, transaction commit, etc.
>>> 
>>> This patch split largest free order group lists and average fragment
>>> size lists into other two lists for IOPS/fast storage groups, and
>>> cr 0 / cr 1 group scanning for metadata block allocation in following
>>> order:
>>> 
>>> cr 0 on largest free order IOPS group list
>>> cr 1 on average fragment size IOPS group list
>>> cr 0 on largest free order non-IOPS group list
>>> cr 1 on average fragment size non-IOPS group list
>>> cr >= 2 perform the linear search as before
>
> Hi Ritesh,
> thanks for the review and the discussion about the patch.
>
>> Yes. The implementation looks straight forward to me.
>> 
>
>>> Non-metadata block allocation does not allocate from the IOPS groups.
>>> 
>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>> of storage at format time:
>>> 
>>>  -E iops=0-1024G,4096-8192G
>> 
>
>> However few things to discuss here are -
>
> As Ted requested on the call, this should be done as two separate calls
> to the allocator, rather than embedding the policy in mballoc group
> selection itself.  Presumably this would be in ext4_mb_new_blocks()
> calling ext4_mb_regular_allocator() twice with different allocation
> flags (first with EXT4_MB_HINT_METADATA, then without, though I don't
> actually see this was used anywhere in the code before this patch?)
>
> Metadata allocations should try only IOPS groups on the first call,
> but would go through all allocation phases.  If IOPS allocation fails,
> then the allocator should do a full second pass to allocate from the
> non-IOPS groups. Non-metadata allocations would only allocate from
> non-IOPS groups.
>
>> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
>> the allocation for data blocks will still succeed, however we won't be
>> able to make use of optimized scanning any more. Because we search within
>> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.
>
> The intention for our usage is that data allocations should *only* come
> from the HDD region of the device, and *not* from the IOPS (flash) region
> of the device.  The IOPS region will be comparatively small (0.5-1.0% of
> the total device size) so using or not using this space will be mostly
> meaningless to the overall filesystem usage, especially with a 1-5%
> reserved blocks percentage that is the default for new filesystems.
>

Yes, but when we give this functionality to non-enterprise users,
everyone would like to take advantage of a faster performing ext4 using
1 ssd and few hdds or a smaller spare ssd and larger hdds. Then it could
be that the space of iops region might not strictly be less than 1-2%
and could be anywhere between 10-50% ;)  

Shouldn't we still support this class of usecase as well? ^^^ 
So if the HDD gets full then the allocation should fallback to ssd for
data blocks no?

Or we can have a policy knob i.e. fallback_data_to_iops_region_thresh.
So if the free space %age in the iops region is above 1% (can be changed
by user) then the data allocations can fallback to iops region if it is
unable to allocate data blocks from hdd region.

      echo %age_threshold > fallback_data_to_iops_region_thresh (default 1%)

        Fallback data allocations to iops region as long as we have free space
        %age of iops region above %age_threshold.

> As you mentioned on the call, it seems this is a defect in the current
> patch, that non-metadata allocations may eventually fall back to scan
> all block groups for free space including IOPS groups.  They need to
> explicitly skip groups that have the IOPS flags set.
>
>> 2. Similarly what happens when the ssd space for metadata gets full?
>> In this case we keep falling back to cr2 for allocation and we don't
>> utilize optimize_scanning to find the block groups from hdd space to
>> allocate from.
>
> In the case when the IOPS groups are full then the metadata allocations
> should fall back to using non-IOPS groups.  That avoids ENOSPC when the
> metadata space is accidentally formatted too small, or unexpected usage
> such as large xattrs or many directories are consuming more IOPS space.
>
>> 3. So it seems after a period of time, these iops lists can have block
>> groups belonging to differnt ssds. Could this cause the metadata
>> allocation of related inodes to come from different ssds.
>> Will this be optimal? Checking on this...
>>     ...On checking further on this, we start with a goal group and we
>> at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
>> we frequently allocate metadata blocks from different SSDs.
>
> In our usage will typically be only a single IOPS region at the start of
> the device, but the ability to allow multiple IOPS regions was added for
> completeness and flexibility in the future (e.g. resize of filesystem).

I am interested in knowing what do you think will be challenges in
supporting resize with hybrid devices? Like if someone would like to add
an additional ssd and do a resize, do you think all later metadata
allocations can be fullfilled from this iops region?

And what happens when someone adds hdds to existing ssds.
I guess adding an hdd followed by resize operation can still allocate, bgdt, block/inode
bitmaps and inode tables etc for these block groups to sit on the resized hdd right. 
Are there any other challenges as well for such usecase? 

> In our case, the IOPS region would itself be RAIDed, so "different SSDs"
> is not really a concern.
>
>> 4. Ok looking into this, do we even require the iops lists for metadata
>> allocations? Do we allocate more than 1 blocks for metadata? If not then
>> maintaining these iops lists for metadata allocation isn't really
>> helpful. On the other hand it does make sense to maintain it when we
>> allow data allocations from these ssds when hdds gets full.
>
> I don't think we *need* to use the same mballoc code for IOPS allocation
> in most cases, though large xattr inode allocations should also be using
> the IOPS groups for allocating blocks, and these might be up to 64KB.
> I don't think that is actually implemented properly in this patch yet.
>
> Also, the mballoc list/array make it easy to find groups with free space
> in a full filesystem instead of having to scan for them, even if we
> don't need the full "allocate order-N" functionality.  Having one list
> of free groups or order-N lists doesn't make it more expensive (and it
> actually improves scalability to have multiple list heads).
>
> One of the future enhancements might be to allow small files (of some
> configurable size) to also be allocated from the IOPS groups, so it is
> probably easier IMHO to just stick with the same allocator for both.
>
>> 5. Did we run any benchmarks with this yet? What kind of gains we are
>> looking for? Do we have any numbers for this?
>
> We're working on that.  I just wanted to get the initial patches out for
> review sooner rather than later, both to get feedback on implementation
> (like this, thanks), and also to reserve the EXT4_BG_IOPS field so it
> doesn't get used in a conflicting manner.
>
>> 6. I couldn't stop but start to think of...
>> Should there also be a provision from the user to pass hot/cold data
>> types which we can use as a hint within the filesystem to allocate from
>> ssd v/s hdd? Does it even make sense to think in this direction?
>
> Yes, I also had the same idea, but then left it out of my email to avoid
> getting distracted from the initial goal.  There are a number of possible
> improvements that could be done with a mechanism like this:
> - have fast/slow regions within a single HDD (i.e. last 20% of spindle is
>   in "slow" region due to reduced linear velocity/bandwidth on inner tracks)
>   to avoid using the slow region unless the fast region is (mostly) full
> - have several regions across an HDD to *intentionally* allocate some
>   extents in the "slow" groups to reduce *peak* bandwidth but keep
>   *average* bandwidth higher as the disk becomes more full since there
>   would still be free space in the faster groups.
>

Interesting! 

> Cheers, Andreas
>

Thanks
-ritesh