Parallel fsck performance degradation case discussion

Ritesh Harjani <riteshh@xxxxxxxxxxxxx> · Tue, 1 Mar 2022 08:27:28 +0530

Hello,

I am working to help merge ext4's parallel fsck in upstream e2fsprogs.
Ted has provided some details here[1] on some of the work needed, to get it
accepted/merged into upstream.

However, in this email, I mostly wanted to discuss some performance(perf) observations
and to check if we have done our multi-thread fsck testing on such test cases or not.

So, I was doing some testing with different FS layouts and with different disk types
to see its performance benefits. Here are some of the observations. I wanted to know
if it is in line with your observations too.
Also to mainly discuss Case-4, to see if it is already a known limitation.

Case-1: Huge no. of 0 byte sized inodes (22M inodes)
We do see performance benefits with pfsck in this use case (I saw around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types too).

Case-2: Huge no. of 4KB-32KB sized inodes/directories (22M inodes)
We do see performance benefits with pfsck in this use case as well (again around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types).

Case-3: Large directories (with many 0 byte files within these directories)
In this case, mostly pass-2 takes significant time, but again we do see performance
improvements with pass-1 for all different disk/device setups.

Case-4: Files with heavy fragmentation i.e. lots of extents.
(creating this FS layout roughly by running script1.sh followed by script2.sh mentioned at the end of this email)
In this case we start seeing performance degradation if the I/O device is fast enough.
1. On a single HDD, we see significant perf reduction > ~30% (with pfsck compare to non pfsck).
2. With single nvme, similar perf reduction or more.
3. ramfs based single loop device setup - ~100% perf reduction.
4. ramfs based 4 loop devices with dm_delay on top and with SW raid0 config (md0) (i.e. with 4 dm-delay devices of 50G each in raid0).
    a. With delay of 0ms we see a performance degradation of around ~100%. (10s v/s 20s)
       Below is the perf profile where the performance degradation is seen (with pfsck -m 4)
		   26.37%  e2fsck  e2fsck              [.] rb_insert_extent
		   13.54%  e2fsck  e2fsck              [.] ext2fs_rb_next
			9.72%  e2fsck  libc-2.31.so        [.] _int_free
			7.83%  e2fsck  libc-2.31.so        [.] malloc
			7.45%  e2fsck  e2fsck              [.] rb_test_clear_bmap_extent
			6.46%  e2fsck  e2fsck              [.] rb_test_bmap
			4.60%  e2fsck  libpthread-2.31.so  [.] __pthread_rwlock_rdlock
			4.39%  e2fsck  libpthread-2.31.so  [.] __pthread_rwlock_unlock

    b. But with above disk setup (4 dm-delay with raid0), ~36% to 3x performance improvement is observed when the
	   delay is within the range of [1ms - 500ms] (for every read/write).

Now, I understand we might say that parallel fsck benefits can mostly be seen in case of parallel I/O.
Because otherwise, pfsck might add some extra overhead due to thread spawning, allocating per thread
structures and merge logic. But should that account to significant perf degradation in such fragmented files use case?