On Thu, 2023-09-21 at 21:27 -0600, Andreas Dilger wrote: > On Sep 19, 2023, at 11:23 PM, Ritesh Harjani (IBM) > <ritesh.list@xxxxxxxxx> wrote: > > > > Andreas Dilger <adilger@xxxxxxxxx> writes: > > > > > On Sep 12, 2023, at 12:59 AM, Bobi Jam <bobijam@xxxxxxxxxxx> > > > wrote: > > > > > > > > With LVM it is possible to create an LV with SSD storage at the > > > > beginning of the LV and HDD storage at the end of the LV, and > > > > use that > > > > to separate ext4 metadata allocations (that need small random > > > > IOs) > > > > from data allocations (that are better suited for large > > > > sequential > > > > IOs) depending on the type of underlying storage. Between 0.5- > > > > 1.0% of > > > > the filesystem capacity would need to be high-IOPS storage in > > > > order to > > > > hold all of the internal metadata. > > > > > > > > This would improve performance for inode and other metadata > > > > access, > > > > such as ls, find, e2fsck, and in general improve file access > > > > latency, > > > > modification, truncate, unlink, transaction commit, etc. > > > > > > > > This patch split largest free order group lists and average > > > > fragment > > > > size lists into other two lists for IOPS/fast storage groups, > > > > and > > > > cr 0 / cr 1 group scanning for metadata block allocation in > > > > following > > > > order: > > > > > > > > if (allocate metadata blocks) > > > > if (cr == 0) > > > > try to find group in largest free order IOPS group > > > > list > > > > if (cr == 1) > > > > try to find group in fragment size IOPS group list > > > > if (above two find failed) > > > > fall through normal group lists as before > > > > if (allocate data blocks) > > > > try to find group in normal group lists as before > > > > if (failed to find group in normal group && > > > > mb_enable_iops_data) > > > > try to find group in IOPS groups > > > > > > > > Non-metadata block allocation does not allocate from the IOPS > > > > groups > > > > if non-IOPS groups are not used up. > > > > > > Hi Ritesh, > > > I believe this updated version of the patch addresses your > > > original > > > request that it is possible to allocate blocks from the IOPS > > > block > > > groups if the non-IOPS groups are full. This is currently > > > disabled > > > by default, because in cases where the IOPS groups make up only a > > > small fraction of the space (e.g. < 1% of total capacity) having > > > data > > > blocks allocated from this space would not make a big improvement > > > to the end-user usage of the filesystem, but would semi- > > > permanently > > > hurt the ability to allocate metadata into the IOPS groups. > > > > > > We discussed on the ext4 concall various options to make this > > > more > > > useful (e.g. allowing the root user to allocate from the IOPS > > > groups > > > if the filesystem is out of space, having a heuristic to balance > > > IOPS > > > vs. non-IOPS allocations for small files, having a BPF rule that > > > can > > > specify which UID/GID/procname/filename/etc. can access this > > > space, > > > but everyone was reluctant to put any complex policy into the > > > kernel > > > to make any decision, since this eventually is wrong for some > > > usage. > > > > > > For now, there is just a simple on/off switch, and if this is > > > enabled > > > the IOPS groups are only used when all of the non-IOPS groups are > > > full. > > > Any more complex policy can be deferred to a separate patch, I > > > think. > > > > I think having a on/off switch for any user to enable/disable > > allocation > > of data from iops groups is good enough for now. Atleast users with > > larger iops disk space won't run out of ENOSPC if they enable with > > this feature. > > > > So, thanks for addressing it. I am going through the series. I will > > provide > > my review comments shortly. > > > > Meanwhile, here is my understanding of your usecase. Can you please > > correct add your inputs to this - > > > > 1. You would like to create a FS with a combination of high iops > > storage > > disk and non-high iops disk. With high iops disk space to be around > > 1 % > > of the total disk capacity. (well this is obvious as it is stated > > in the > > patch description itself) > > > > > 2. Since ofcourse ext4 currently does not support multi-drive, so > > we > > will use it using LVM and place high iops disk in front. > > > > > 3. Then at the creation of the FS we will use a cmd like this > > mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G > > /path/to/lvm > > > > Is this understanding right? > > Correct. Note that for filesystems larger than 256 TiB, when the > group > descriptor table grows larger than the size of group 0, an few extra > patches that Dongyang developed are needed to fix the sparse_super2 > option in mke2fs to allow this to pack all metadata at the start of > the > device and move the GDT backup to further out. For example, a 2TiB > filesystem it would use group 9 as the start of the first GDT backup. > > I don't expect this will be a problem for most users, and is somewhat > an independent issue from the IOPS groups, so it has been kept > separate. > > I couldn't find a version of that patch series pushed to the list, > but it is in our Gerrit (the first one is already pushed): > > https://review.whamcloud.com/52219 ;("e2fsck: check all sparse_super > backups") > https://review.whamcloud.com/52273 ;("mke2fs: set free blocks > accurately ...") > https://review.whamcloud.com/52274 ;("mke2fs: do not set BLOCK_UNINIT > ...") > https://review.whamcloud.com/51295 ;("mke2fs: try to pack GDT blocks > together") > > (Dongyang, could you please submit the last three patches in this > series). Will post the series when I finish making offline resize working with the last patch. It needs more work than I expected, e.g. when growing the filesystem as we want the GDT blocks packed together it could grow beyond group 0 to backup_bgs[0], which means backup_bgs[0] needs to be moved. Cheers Dongyang > > > I have few followup queries as well - > > > > 1. What about Thin Provisioned LVM? IIUC, the space in that is > > pre-allocated, but allocation happens at the time of write and it > > might > > so happen that both data/metadata allocations will start to sit in > > iops/non-iops group randomly? > > I think the underlying storage type would be controlled by LVM in > that > case. I don't know what kind of policy options are available with > thin > provisioned LVs, but my first thought is "don't do that with IOPS > groups" > since there is no way to know/control what the underlying storage is. > > > 2. Even in case of taditional LVM, the mapping of the physical > > blocks > > can be changed during an overwrite or discard sort of usecase > > right? So > > do we have a gurantee of the metadata always sitting on high iops > > groups > > after filesystem ages? > > No, I don't think that would happen under normal usage. The PV/LV > maps > are static after the LV is created, so overwriting a block at runtime > with ext4 would give the same type of storage as at mke2fs time. > > The exception would be with LVM snapshots, in which case I'd suggest > to > use flash PV space for the snapshot (assuming there is enough) to > avoid > overhead when blocks are COW'd. Even so, AFAIK the chunks written to > the snapshot LV are the *old* blocks and the current blocks are kept > on > the main PV, so the IOPS groups would still work properly in this > case. > > > 3. With this options of mkfs to utilize this feature, we do loose > > the > > ability to resize right? I am guessing resize will be disabled with > > sparse_super2 and/or packed_meta_blocks itself? > > Online resize was disabled in commit v5.13-rc5-20-gb1489186cc83 > "ext4: add check to prevent attempting to resize an fs with > sparse_super2". > However, I think that was a misunderstanding. It looks like online > resize > was getting confused by sparse_super2 together with resize_inode, > because > there are only 2 backup group descriptor tables, and resize_inode > expects > there to be a bunch more backups. I suspect resize would "work" if > resize_inode was disabled completely. > > The drawback is that online resize would almost immediately fall back > to meta_bg (as it does anyway for > 16TiB filesystems anyway), and > spew > the GDT blocks and other metadata across the non-IOPS storage device. > This would "work" (give you a larger filesystem), but is not ideal. > > > I think the long-term solution for this would be to fix the > interaction > with sparse_super2, so that the resize_inode could reserve GDT blocks > on the flash storage for the primary GDT and backup_bgs[0], and also > backup_bgs[1] would be kept in a group < 2M so that it does not need > to store 64-bit block numbers. That would actually allow > resize_inode > to work with > 16TiB filesystems and continue to avoid using meta_bg. > > For the rest of the static metadata (bitmaps, inode tables) it would > be > possible to add more IOPS groups at the end of the current filesystem > and add a "resize2fs -E iops=x-yG" option to have it allocate the > static > metadata from any of the IOPS groups. That said, it has been a while > since I looked at the online resize code in the kernel, so I'm not > sure > whether it is resize2fs or ext4 that is making these decisions > anymore. > > Cheers, Andreas > > > > >