Hi, I had to put this issue aside and as Yu indicated he was busy I didn’t follow up yet. @Yu: I don’t have new insights, but I have a basically identical machine that I will start adding new data with a similar structure soon. I couldn’t directly reproduce the issue there - likely because the network is a bit slower as it’s connected from a remote side and has only 1G instead of 10G, due to the long distances. Let me know if you’re interested in following up here and I’ll try to make room on my side to get you more input as needed. Christian > On 15. Aug 2024, at 13:14, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/08/15 18:03, Christian Theune 写道: >> Hi, >> small insight: even given my dataset that can reliably trigger this (after around 1.5 hours of rsyncing) it does not trigger on a specific set of files. I’ve deleted the data and started the rsync on a fresh directory (not a fresh filesystem, I can’t delete that as it carries important data) but it doesn’t always get stuck on the same files, even though rsync processes them in a repeatable order. >> I’m wondering how to generate more insights from that. Maybe keeping a blktrace log might help? >> It sounds like the specific pattern relies on XFS doing a specific thing there … >> Wild idea: maybe running the xfstest suite on an in-memory raid 6 setup could reproduce this? >> I’m guessing that the xfs people do not regularly run their test suite on a layered setup like mine with encryption and software raid? > > That sounds greate. >> Christian >>> On 15. Aug 2024, at 08:19, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote: >>> >>> Hi, >>> >>>> On 14. Aug 2024, at 10:53, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote: >>>> >>>> Hi, >>>> >>>>> On 12. Aug 2024, at 20:37, John Stoffel <john@xxxxxxxxxxx> wrote: >>>>> >>>>> I'd probably just do the RAID6 tests first, get them out of the way. >>>> >>>> Alright, those are running right now - I’ll let you know what happens. >>> >>> I’m not making progress here. I can’t reproduce those on in-memory loopback raid 6. However: i can’t fully produce the rsync. For me this only triggered after around 1.5hs of progress on the NVMe which resulted in the hangup. I can only create around 20 GiB worth of raid 6 volume on this machine. I’ve tried running rsync until it exhausts the space, deleting the content and running rsync again, but I feel like this isn’t suffient to trigger the issue. :( >>> >>> I’m trying to find whether any specific pattern in the files around the time it locks up might be relevant here and try to run the rsync over that >>> portion. >>> >>> On the plus side, I have a script now that can create the various loopback settings quickly, so I can try out things as needed. Not that valuable without a reproducer, yet, though. >>> >>> @Yu: you mentioned that you might be able to provide me a kernel that produces more error logging to diagnose this? Any chance we could try that route? > > Yes, however, I still need some time to sort out the internal process of > raid5. I'm quite busy with some other work stuff and I'm familiar with > raid1/10, but not too much about raid5. :( > > Main idea is to figure out why IO are not dispatched to underlying > disks. > > Thanks, > Kuai > >>> >>> Christian >>> >>> -- >>> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0 >>> Flying Circus Internet Operations GmbH · https://flyingcircus.io >>> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland >>> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick >> Liebe Grüße, >> Christian Theune Liebe Grüße, Christian Theune -- Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick