Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Dongyang Li <dli@xxxxxxx> · Fri, 22 Sep 2023 11:07:51 +0000



On Thu, 2023-09-21 at 21:27 -0600, Andreas Dilger wrote:
> On Sep 19, 2023, at 11:23 PM, Ritesh Harjani (IBM)
> <ritesh.list@xxxxxxxxx> wrote:
> > 
> > Andreas Dilger <adilger@xxxxxxxxx> writes:
> > 
> > > On Sep 12, 2023, at 12:59 AM, Bobi Jam <bobijam@xxxxxxxxxxx>
> > > wrote:
> > > > 
> > > > With LVM it is possible to create an LV with SSD storage at the
> > > > beginning of the LV and HDD storage at the end of the LV, and
> > > > use that
> > > > to separate ext4 metadata allocations (that need small random
> > > > IOs)
> > > > from data allocations (that are better suited for large
> > > > sequential
> > > > IOs) depending on the type of underlying storage.  Between 0.5-
> > > > 1.0% of
> > > > the filesystem capacity would need to be high-IOPS storage in
> > > > order to
> > > > hold all of the internal metadata.
> > > > 
> > > > This would improve performance for inode and other metadata
> > > > access,
> > > > such as ls, find, e2fsck, and in general improve file access
> > > > latency,
> > > > modification, truncate, unlink, transaction commit, etc.
> > > > 
> > > > This patch split largest free order group lists and average
> > > > fragment
> > > > size lists into other two lists for IOPS/fast storage groups,
> > > > and
> > > > cr 0 / cr 1 group scanning for metadata block allocation in
> > > > following
> > > > order:
> > > > 
> > > > if (allocate metadata blocks)
> > > >     if (cr == 0)
> > > >             try to find group in largest free order IOPS group
> > > > list
> > > >     if (cr == 1)
> > > >             try to find group in fragment size IOPS group list
> > > >     if (above two find failed)
> > > >             fall through normal group lists as before
> > > > if (allocate data blocks)
> > > >     try to find group in normal group lists as before
> > > >     if (failed to find group in normal group &&
> > > > mb_enable_iops_data)
> > > >             try to find group in IOPS groups
> > > > 
> > > > Non-metadata block allocation does not allocate from the IOPS
> > > > groups
> > > > if non-IOPS groups are not used up.
> > > 
> > > Hi Ritesh,
> > > I believe this updated version of the patch addresses your
> > > original
> > > request that it is possible to allocate blocks from the IOPS
> > > block
> > > groups if the non-IOPS groups are full.  This is currently
> > > disabled
> > > by default, because in cases where the IOPS groups make up only a
> > > small fraction of the space (e.g. < 1% of total capacity) having
> > > data
> > > blocks allocated from this space would not make a big improvement
> > > to the end-user usage of the filesystem, but would semi-
> > > permanently
> > > hurt the ability to allocate metadata into the IOPS groups.
> > > 
> > > We discussed on the ext4 concall various options to make this
> > > more
> > > useful (e.g. allowing the root user to allocate from the IOPS
> > > groups
> > > if the filesystem is out of space, having a heuristic to balance
> > > IOPS
> > > vs. non-IOPS allocations for small files, having a BPF rule that
> > > can
> > > specify which UID/GID/procname/filename/etc. can access this
> > > space,
> > > but everyone was reluctant to put any complex policy into the
> > > kernel
> > > to make any decision, since this eventually is wrong for some
> > > usage.
> > > 
> > > For now, there is just a simple on/off switch, and if this is
> > > enabled
> > > the IOPS groups are only used when all of the non-IOPS groups are
> > > full.
> > > Any more complex policy can be deferred to a separate patch, I
> > > think.
> > 
> > I think having a on/off switch for any user to enable/disable
> > allocation
> > of data from iops groups is good enough for now. Atleast users with
> > larger iops disk space won't run out of ENOSPC if they enable with
> > this feature.
> > 
> > So, thanks for addressing it. I am going through the series. I will
> > provide
> > my review comments shortly.
> > 
> > Meanwhile, here is my understanding of your usecase. Can you please
> > correct add your inputs to this -
> > 
> > 1. You would like to create a FS with a combination of high iops
> > storage
> > disk and non-high iops disk. With high iops disk space to be around
> > 1 %
> > of the total disk capacity. (well this is obvious as it is stated
> > in the
> > patch description itself)
> > 
> 
> > 2. Since ofcourse ext4 currently does not support multi-drive, so
> > we
> > will use it using LVM and place high iops disk in front.
> > 
> 
> > 3. Then at the creation of the FS we will use a cmd like this
> >   mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G
> > /path/to/lvm
> > 
> > Is this understanding right?
> 
> Correct.  Note that for filesystems larger than 256 TiB, when the
> group
> descriptor table grows larger than the size of group 0, an few extra
> patches that Dongyang developed are needed to fix the sparse_super2
> option in mke2fs to allow this to pack all metadata at the start of
> the
> device and move the GDT backup to further out.  For example, a 2TiB
> filesystem it would use group 9 as the start of the first GDT backup.
> 
> I don't expect this will be a problem for most users, and is somewhat
> an independent issue from the IOPS groups, so it has been kept
> separate.
> 
> I couldn't find a version of that patch series pushed to the list,
> but it is in our Gerrit (the first one is already pushed):
> 
> https://review.whamcloud.com/52219 ;("e2fsck: check all sparse_super
> backups")
> https://review.whamcloud.com/52273 ;("mke2fs: set free blocks
> accurately ...")
> https://review.whamcloud.com/52274 ;("mke2fs: do not set BLOCK_UNINIT
> ...")
> https://review.whamcloud.com/51295 ;("mke2fs: try to pack GDT blocks
> together")
> 
> (Dongyang, could you please submit the last three patches in this
> series).
Will post the series when I finish making offline resize working with
the last patch. It needs more work than I expected, e.g. when growing
the filesystem as we want the GDT blocks packed together it could grow
beyond group 0 to backup_bgs[0], which means backup_bgs[0] needs to be
moved.

Cheers
Dongyang
> 
> > I have few followup queries as well -
> > 
> > 1. What about Thin Provisioned LVM? IIUC, the space in that is
> > pre-allocated, but allocation happens at the time of write and it
> > might
> > so happen that both data/metadata allocations will start to sit in
> > iops/non-iops group randomly?
> 
> I think the underlying storage type would be controlled by LVM in
> that
> case.  I don't know what kind of policy options are available with
> thin
> provisioned LVs, but my first thought is "don't do that with IOPS
> groups"
> since there is no way to know/control what the underlying storage is.
> 
> > 2. Even in case of taditional LVM, the mapping of the physical
> > blocks
> > can be changed during an overwrite or discard sort of usecase
> > right? So
> > do we have a gurantee of the metadata always sitting on high iops
> > groups
> > after filesystem ages?
> 
> No, I don't think that would happen under normal usage.  The PV/LV
> maps
> are static after the LV is created, so overwriting a block at runtime
> with ext4 would give the same type of storage as at mke2fs time.
> 
> The exception would be with LVM snapshots, in which case I'd suggest
> to
> use flash PV space for the snapshot (assuming there is enough) to
> avoid
> overhead when blocks are COW'd.  Even so, AFAIK the chunks written to
> the snapshot LV are the *old* blocks and the current blocks are kept
> on
> the main PV, so the IOPS groups would still work properly in this
> case.
> 
> > 3. With this options of mkfs to utilize this feature, we do loose
> > the
> > ability to resize right? I am guessing resize will be disabled with
> > sparse_super2 and/or packed_meta_blocks itself?
> 
> Online resize was disabled in commit v5.13-rc5-20-gb1489186cc83
> "ext4: add check to prevent attempting to resize an fs with
> sparse_super2".
> However, I think that was a misunderstanding.  It looks like online
> resize
> was getting confused by sparse_super2 together with resize_inode,
> because
> there are only 2 backup group descriptor tables, and resize_inode
> expects
> there to be a bunch more backups.  I suspect resize would "work" if
> resize_inode was disabled completely.
> 
> The drawback is that online resize would almost immediately fall back
> to meta_bg (as it does anyway for > 16TiB filesystems anyway), and
> spew
> the GDT blocks and other metadata across the non-IOPS storage device.
> This would "work" (give you a larger filesystem), but is not ideal.
> 
> 
> I think the long-term solution for this would be to fix the
> interaction
> with sparse_super2, so that the resize_inode could reserve GDT blocks
> on the flash storage for the primary GDT and backup_bgs[0], and also
> backup_bgs[1] would be kept in a group < 2M so that it does not need
> to store 64-bit block numbers.  That would actually allow
> resize_inode
> to work with > 16TiB filesystems and continue to avoid using meta_bg.
> 
> For the rest of the static metadata (bitmaps, inode tables) it would
> be
> possible to add more IOPS groups at the end of the current filesystem
> and add a "resize2fs -E iops=x-yG" option to have it allocate the
> static
> metadata from any of the IOPS groups.  That said, it has been a while
> since I looked at the online resize code in the kernel, so I'm not
> sure
> whether it is resize2fs or ext4 that is making these decisions
> anymore.
> 
> Cheers, Andreas
> 
> 
> 
> 
>