Re: extremely slow writes to array [now not degraded]

Roger Heflin <rogerheflin@xxxxxxxxx> · Mon, 13 Nov 2023 06:26:45 -0600

The tuen2fs'es are setting changes only and are designed to be
reversible and non-destructive.

You could note the stride value, umount, make the change, and remount
and see if that fixes the issue and that would expected to be safe.

If something went wrong (about all I have ever seen go wrong is some
of the settings need the fstab entry changed and/or cannot be turned
off for all fses of the same type (ie ext2/3 and be turned off but not
ext4) so are invalid and fail the mount until you revert the setting
with tune2fs), you umount, but the stride setting back and remount.

On Mon, Nov 13, 2023 at 3:36 AM <eyal@xxxxxxxxxxxxxx> wrote:
>
> On 13/11/2023 20.20, Johannes Truschnigg wrote:
> > Interesting data; thanks for providing it. Unfortunately, I am not familiar
> > with that part of kernel code at all, but there's two observations that I can
> > contribute:
> >
> > According to kernel source, `ext4_mb_scan_aligned` is a "special case for
> > storages like raid5", where "we try to find stripe-aligned chunks for
> > stripe-size-multiple requests" - and it seems that on your system, it might be
> > trying a tad too hard. I don't have a kernel source tree handy right now to
> > take a look at what might have changed in the function and any of its
> > calle[er]s during recent times, but it's the first place I'd go take a closer
> > look at.
>
> Maybe someone else is able to look into this part? One hopes.
>
> > Also, there's a recent Kernel bugzilla entry[0] that observes a similarly
> > pathological behavior from ext4 on a single disk of spinning rust where that
> > particular function appears in the call stack, and which revolves around an
> > mkfs-time-enabled feature which will, afaik, happen to also be set if
> > mke2fs(8) detects md RAID in the storage stack beneath the device it is
> > supposed to format (and which SHOULD get set, esp. for parity-based RAID).
> >
> > Chances are you may be able to disable this particular optimization by running
> > `tune2fs -E stride=0` against the filesystem's backing array (be warned that I
> > did NOT verify if that might screw your data, which it very well could!!) and
> > remounting it afterwards, to check if that is indeed (part of) the underlying
> > cause to the poor performance you see. If you choose to try that, make sure to
> > record the current stride-size, so you may re-apply it at a later time
> > (`tune2fs -l` should do).
>
> No, I am not ready to take this chance, but here is the relevant data anyway (see below).
> However, maybe I could boot into an older kernel, but the oldest I have is 6.5.7, not that far behind.
>
> The fact that recently the machine crashed and the array was reassembled may have offered an opportunity
> for some setting to go out of wack. This is above my pay grade...
>
> tune2fs 1.46.5 (30-Dec-2021)
> Filesystem volume name:   7x12TB
> Last mounted on:          /data1
> Filesystem UUID:          378e74a6-e379-4bd5-ade5-f3cd85952099
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              57220480
> Block count:              14648440320
> Reserved block count:     0
> Overhead clusters:        4921116
> Free blocks:              2615571465
> Free inodes:              55168125
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         128
> Inode blocks per group:   8
> RAID stride:              128
> RAID stripe width:        640
> Flex block group size:    16
> Filesystem created:       Fri Oct 26 17:58:35 2018
> Last mount time:          Mon Nov 13 16:28:16 2023
> Last write time:          Mon Nov 13 16:28:16 2023
> Mount count:              7
> Maximum mount count:      -1
> Last checked:             Tue Oct 31 18:15:25 2023
> Check interval:           0 (<none>)
> Lifetime writes:          495 TB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     32
> Desired extra isize:      32
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      7eb08e20-5ee6-46af-9ef9-2d1280dfae98
> Journal backup:           inode blocks
> Checksum type:            crc32c
> Checksum:                 0x3590ae50
>
> > [0]: https://bugzilla.kernel.org/show_bug.cgi?id=217965
> >
>
> --
> Eyal at Home (eyal@xxxxxxxxxxxxxx)
>