On Oct 24, 2019, at 7:23 AM, Благодаренко Артём <artem.blagodarenko@xxxxxxxxx> wrote: > > Lustre FS successfully uses LDISKFS(ext4) partitions with size near 512TB. > This 512TB is current "verified limit". This means that we do not expect > any troubles in production with such large partitions. > > Our new challenge now is 1024TB because hardware allows to assemble such > partition. The question is: do you know any possible issues with EXT4 > (and e2fsprogs) for such large partitions? Hi Artem, thanks for bringing this up on the list. We've also seen that with large declustered parity RAID arrays there is a need to have 40+ disks in the array to get good rebuild performance, and with 12TB disks this will pushes LUN size to 430TB and next year beyond 512TB for 16TB disks so this is an upcoming issue. > I know about this possible problems: > 1. E2fsck is too slow. But parallel e2fsck project is being developed by Li Xi Right. This is so far only adding parallelism to the pass1 inode table scan. In our testing that is the majority of the time taken is in pass1 (3879s of 3959s for a 25% full 1PB fs, see https://jira.whamcloud.com/browse/LU-8465 for more details) so if this phase can most easily be optimized it will give the most overall improvement as well. > 2. Block groups reading takes a lot of time. We have fixes for special cases > like e2label. Bigalloc also allows to decrease metadata size, but sometimes > meta_bg is preferable. For large filesystems (over 256TB) I think meta_bg is always required, as the GDT is larger than a single block group. However, I guess with bigalloc it is possible to avoid meta_bg since the block group size increases by a factor of the chunk size as well. That means a 1PiB filesystem could avoid meta_bg if it is using a bigalloc chunk size of 16KB or larger. > 3. Aged fs and allocator that process all groups to find good group. There > is solution, but with some issues. It would be nice to see this allocator work included into upstream ext4. > 4. 32 bit inode counter. Not a problem for Lustre FS users, that prefer use > DNE (distributed namespace) for inode scaling, but probably somebody wants > to store a lot inodes on the same partition. Project was not finished. > Looks nobody require it now. > > Could you please, point me to other possible problems? While it is not specifically a problem with ext4, as Lustre OSTs get larger they will create a large number of objects in each directory, which will hurt performance as each htree level is added (about 100k, 1M, and 10M). To avoid the need for such large directories, it would be useful to reduce the number of objects created per directory by the MDS, which can be done in Lustre (https://jira.whamcloud.com/browse/LU-11912) by creating a series of directories over time. Splitting up object creation by age also has the benefit of segregating the objects by age and allowing the whole directory to drop out of cache, rather than just distributing all objects over a larger number of directories. Using a larger number of directories would just increase the cache footprint and result in more random IO (i.e. one create/unlink per directory leaf block). This would also benefit from the directory shrinking code that was posted a couple of times to the list, but has not yet landed. As old directories have their objects deleted, then eventually shrink down from millions of objects to a few hundreds, and with this feature the blocks will also shrink. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP