Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Wed, 20 Sep 2023 10:53:14 +0530

Andreas Dilger <adilger@xxxxxxxxx> writes:

> On Sep 12, 2023, at 12:59 AM, Bobi Jam <bobijam@xxxxxxxxxxx> wrote:
>> 
>> With LVM it is possible to create an LV with SSD storage at the
>> beginning of the LV and HDD storage at the end of the LV, and use that
>> to separate ext4 metadata allocations (that need small random IOs)
>> from data allocations (that are better suited for large sequential
>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>> the filesystem capacity would need to be high-IOPS storage in order to
>> hold all of the internal metadata.
>> 
>> This would improve performance for inode and other metadata access,
>> such as ls, find, e2fsck, and in general improve file access latency,
>> modification, truncate, unlink, transaction commit, etc.
>> 
>> This patch split largest free order group lists and average fragment
>> size lists into other two lists for IOPS/fast storage groups, and
>> cr 0 / cr 1 group scanning for metadata block allocation in following
>> order:
>> 
>> if (allocate metadata blocks)
>>      if (cr == 0)
>>              try to find group in largest free order IOPS group list
>>      if (cr == 1)
>>              try to find group in fragment size IOPS group list
>>      if (above two find failed)
>>              fall through normal group lists as before
>> if (allocate data blocks)
>>      try to find group in normal group lists as before
>>      if (failed to find group in normal group && mb_enable_iops_data)
>>              try to find group in IOPS groups
>> 
>> Non-metadata block allocation does not allocate from the IOPS groups
>> if non-IOPS groups are not used up.
>
> Hi Ritesh,
> I believe this updated version of the patch addresses your original
> request that it is possible to allocate blocks from the IOPS block
> groups if the non-IOPS groups are full.  This is currently disabled
> by default, because in cases where the IOPS groups make up only a
> small fraction of the space (e.g. < 1% of total capacity) having data
> blocks allocated from this space would not make a big improvement
> to the end-user usage of the filesystem, but would semi-permanently
> hurt the ability to allocate metadata into the IOPS groups.
>
> We discussed on the ext4 concall various options to make this more
> useful (e.g. allowing the root user to allocate from the IOPS groups
> if the filesystem is out of space, having a heuristic to balance IOPS
> vs. non-IOPS allocations for small files, having a BPF rule that can
> specify which UID/GID/procname/filename/etc. can access this space,
> but everyone was reluctant to put any complex policy into the kernel
> to make any decision, since this eventually is wrong for some usage.
>
> For now, there is just a simple on/off switch, and if this is enabled
> the IOPS groups are only used when all of the non-IOPS groups are full.
> Any more complex policy can be deferred to a separate patch, I think.

I think having a on/off switch for any user to enable/disable allocation
of data from iops groups is good enough for now. Atleast users with
larger iops disk space won't run out of ENOSPC if they enable with this feature.

So, thanks for addressing it. I am going through the series. I will provide
my review comments shortly. 

Meanwhile, here is my understanding of your usecase. Can you please
correct add your inputs to this - 

1. You would like to create a FS with a combination of high iops storage
disk and non-high iops disk. With high iops disk space to be around 1 %
of the total disk capacity. (well this is obvious as it is stated in the
patch description itself)

2. Since ofcourse ext4 currently does not support multi-drive, so we
will use it using LVM and place high iops disk in front. 

3. Then at the creation of the FS we will use a cmd like this
   mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G /path/to/lvm

Is this understanding right? 

I have few followup queries as well - 

1. What about Thin Provisioned LVM? IIUC, the space in that is
pre-allocated, but allocation happens at the time of write and it might
so happen that both data/metadata allocations will start to sit in
iops/non-iops group randomly?

2. Even in case of taditional LVM, the mapping of the physical blocks
can be changed during an overwrite or discard sort of usecase right? So
do we have a gurantee of the metadata always sitting on high iops groups
after filesystem ages?

3. With this options of mkfs to utilize this feature, we do loose the
ability to resize right? I am guessing resize will be disabled with
sparse_super2 and/or packed_meta_blocks itself?

Thanks!
-ritesh