Re: extremely slow writes to array [now not degraded]

eyal@xxxxxxxxxxxxxx · Mon, 13 Nov 2023 20:36:15 +1100

On 13/11/2023 20.20, Johannes Truschnigg wrote:
Interesting data; thanks for providing it. Unfortunately, I am not familiar
with that part of kernel code at all, but there's two observations that I can
contribute:

According to kernel source, `ext4_mb_scan_aligned` is a "special case for
storages like raid5", where "we try to find stripe-aligned chunks for
stripe-size-multiple requests" - and it seems that on your system, it might be
trying a tad too hard. I don't have a kernel source tree handy right now to
take a look at what might have changed in the function and any of its
calle[er]s during recent times, but it's the first place I'd go take a closer
look at.

Maybe someone else is able to look into this part? One hopes.

Also, there's a recent Kernel bugzilla entry[0] that observes a similarly
pathological behavior from ext4 on a single disk of spinning rust where that
particular function appears in the call stack, and which revolves around an
mkfs-time-enabled feature which will, afaik, happen to also be set if
mke2fs(8) detects md RAID in the storage stack beneath the device it is
supposed to format (and which SHOULD get set, esp. for parity-based RAID).

Chances are you may be able to disable this particular optimization by running
`tune2fs -E stride=0` against the filesystem's backing array (be warned that I
did NOT verify if that might screw your data, which it very well could!!) and
remounting it afterwards, to check if that is indeed (part of) the underlying
cause to the poor performance you see. If you choose to try that, make sure to
record the current stride-size, so you may re-apply it at a later time
(`tune2fs -l` should do).

No, I am not ready to take this chance, but here is the relevant data anyway (see below).
However, maybe I could boot into an older kernel, but the oldest I have is 6.5.7, not that far behind.

The fact that recently the machine crashed and the array was reassembled may have offered an opportunity
for some setting to go out of wack. This is above my pay grade...

tune2fs 1.46.5 (30-Dec-2021)
Filesystem volume name:   7x12TB
Last mounted on:          /data1
Filesystem UUID:          378e74a6-e379-4bd5-ade5-f3cd85952099
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              57220480
Block count:              14648440320
Reserved block count:     0
Overhead clusters:        4921116
Free blocks:              2615571465
Free inodes:              55168125
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         128
Inode blocks per group:   8
RAID stride:              128
RAID stripe width:        640
Flex block group size:    16
Filesystem created:       Fri Oct 26 17:58:35 2018
Last mount time:          Mon Nov 13 16:28:16 2023
Last write time:          Mon Nov 13 16:28:16 2023
Mount count:              7
Maximum mount count:      -1
Last checked:             Tue Oct 31 18:15:25 2023
Check interval:           0 (<none>)
Lifetime writes:          495 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      7eb08e20-5ee6-46af-9ef9-2d1280dfae98
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x3590ae50

[0]: https://bugzilla.kernel.org/show_bug.cgi?id=217965

--
Eyal at Home (eyal@xxxxxxxxxxxxxx)