On Wed, Nov 18, 2020 at 07:38:46AM -0800, Saranya Muruganandam wrote: > Currently it has been popular that single disk could be more than TiB, > etc 16Tib with only one single disk, with this trend, one single > filesystem could be larger and larger and easily reach PiB with LUN system. > > The journal filesystem like ext4 need be offline to do regular > check and repair from time to time, however the problem is e2fsck > still do this using single thread, this could be challenging at scale > for two reasons: > > 1) even with readahead, IO speed still limits several tens MiB per second. > 2) could not utilize CPU cores. > > It could be challenging to try multh-threads for all phase of e2fsck, but as > first step, we might try this for most time-consuming pass1, according to > our benchmarking it cost of 80% time for whole e2fck phase. > > Pass1 is trying to scanning all valid inode of filesystem and check it one by > one, and the patchset idea is trying to split these to different threads and > trying to do this at the same time, we try to merge these inodes and corresponding > inode's extent information after threads finish. > > To simplify complexity and make it less error-prone, the fix is still serialized, > since most of time there will be only minor errors for filesystem, what's important > for us is parallel reading and checking. > > Here is a benchmarking on our Lustre filesystem with 1.2 PiB OSD ext4 based > filesystem: > > DDN SFA18KE StorageServer > DCR(DeClustering RAID) with 162 x HGST 10TB NL-SAS > Tested Server > A Virtual Machine running on SFA18KE > 8 x CPU cores (Xeon(R) Gold 6140) > 150GB memory > CentoOS7.7 (Lustre patched kernel) This introductory patch presumably came from the original patch series; hence "our Lustre file system". Just to make it clearer, it's probably better to make it clear who did which benchmarks. And Saranya, you might want to include your benchmark results since it will be easier for people to replicate. > I've tested the whole patch series using 'make test' of e2fsck itself, and i > manually set default threads to 4 which still pass almost of test suite, > failure cases are below: > > f_h_badroot f_multithread f_multithread_logfile f_multithread_no f_multithread_ok > > h_h_badroot failed because out of order checking output, and others are because > of extra multiple threads log output. And this "I" is Saranya, yes? > Andreas Dilger (2): > e2fsck: fix f_multithread_ok test > e2fsck: misc cleanups for pfsck > > Li Xi (18): > e2fsck: add -m option for multithread > e2fsck: copy context when using multi-thread fsck > e2fsck: copy fs when using multi-thread fsck > e2fsck: add assert when copying context > e2fsck: copy bitmaps when copying context > e2fsck: open io-channel when copying fs > e2fsck: create logs for mult-threads > e2fsck: optionally configure one pfsck thread > e2fsck: add start/end group for thread > e2fsck: split groups to different threads > e2fsck: print thread log properly > e2fsck: do not change global variables > e2fsck: optimize the inserting of dir_info_db > e2fsck: merge dir_info after thread finishes > e2fsck: merge icounts after thread finishes > e2fsck: merge dblist after thread finishes > e2fsck: add debug codes for multiple threads > e2fsck: merge fs flags when threads finish The fact that all of these patches are prefixed with e2fsck: hides the fact that some of these changes include changes to libext2fs. It's probably better to separate out the changes to libext2fs so we can pay special attention to issues of presering the ABI. I'll talk more about this in the individual patches. - Ted