Re: PROBLEM: repeatable lockup on RAID-6 with LUKS dm-crypt on NVMe devices when rsyncing many files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2024/08/15 18:03, Christian Theune 写道:
Hi,

small insight: even given my dataset that can reliably trigger this (after around 1.5 hours of rsyncing) it does not trigger on a specific set of files. I’ve deleted the data and started the rsync on a fresh directory (not a fresh filesystem, I can’t delete that as it carries important data) but it doesn’t always get stuck on the same files, even though rsync processes them in a repeatable order.

I’m wondering how to generate more insights from that. Maybe keeping a blktrace log might help?

It sounds like the specific pattern relies on XFS doing a specific thing there …

Wild idea: maybe running the xfstest suite on an in-memory raid 6 setup could reproduce this?

I’m guessing that the xfs people do not regularly run their test suite on a layered setup like mine with encryption and software raid?

That sounds greate.
Christian

On 15. Aug 2024, at 08:19, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:

Hi,

On 14. Aug 2024, at 10:53, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:

Hi,

On 12. Aug 2024, at 20:37, John Stoffel <john@xxxxxxxxxxx> wrote:

I'd probably just do the RAID6 tests first, get them out of the way.

Alright, those are running right now - I’ll let you know what happens.

I’m not making progress here. I can’t reproduce those on in-memory loopback raid 6. However: i can’t fully produce the rsync. For me this only triggered after around 1.5hs of progress on the NVMe which resulted in the hangup. I can only create around 20 GiB worth of raid 6 volume on this machine. I’ve tried running rsync until it exhausts the space, deleting the content and running rsync again, but I feel like this isn’t suffient to trigger the issue. :(

I’m trying to find whether any specific pattern in the files around the time it locks up might be relevant here and try to run the rsync over that
portion.

On the plus side, I have a script now that can create the various loopback settings quickly, so I can try out things as needed. Not that valuable without a reproducer, yet, though.

@Yu: you mentioned that you might be able to provide me a kernel that produces more error logging to diagnose this? Any chance we could try that route?

Yes, however, I still need some time to sort out the internal process of
raid5. I'm quite busy with some other work stuff and I'm familiar with
raid1/10, but not too much about raid5. :(

Main idea is to figure out why IO are not dispatched to underlying
disks.

Thanks,
Kuai


Christian

--
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick


Liebe Grüße,
Christian Theune






[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux