Re: problem with recovered array

Johannes Truschnigg <johannes@xxxxxxxxxxxxxxx> · Thu, 2 Nov 2023 09:34:51 +0100

Hi list,

for the record, I do not think that any of the observations the OP made can be
explained by non-pathological phenomena/patterns of behavior. Something is
very clearly wrong with how this system behaves (the reported figures do not
at all match the expected performance of even a degraded RAID6 array in my
experience) and how data written to the filesystem apparently fails to make it
into the backing devices in acceptable time.

The whole affair reeks either of "subtle kernel bug", or maybe "subtle
hardware failure", I think.

Still, it'd be interesting to know what happens when writes to the array thru
the file system are performed with O_DIRECT in effect, i.e., using `dd
oflag=direct status=progress ...` - does this yield any observable (via
`iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
report this way with varying block sizes? Are there no concerning messages in
the debug ringbuffer (`dmesg`) at this time?

I'm not sure how we'd best learn more about what the participating busy kernel
threads (Fedora 38 might have more convenient devices up its sleeve than
ftrace?) are spending their time on in particular, but I think that's what's
needed to be known to pin down the underlying cause of the problem.

-- 
with best regards:
- Johannes Truschnigg ( johannes@xxxxxxxxxxxxxxx )

www:   https://johannes.truschnigg.info/
phone: +436502133337
xmpp:  johannes@xxxxxxxxxxxxxxx
Attachment:
signature.asc

Description: PGP signature