On Wed, Jul 19, 2023 at 01:50:10PM +0200, Christoph Hellwig wrote: > On Wed, Jul 19, 2023 at 07:39:01AM +0200, Christoph Hellwig wrote: > > My day was already over by the time you sent this, but I looked into > > it the first thing this morning. > > > > I can't reproduce the hang, but my first thought was "why the heck do > > even end up in the fixup worker" given that there is no GUP-based > > dirtying in the thread. > > > > I can reproduce the test case hitting the fixup worker now, while > > I can't on misc-next. Looking into it now, but the rework of the > > fixup logic is a hot candidate. > > So unfortunately even the BUG seems to trigger in a very sporadic > manner, making a bisect impossible. This is made worse by me actually > hitting another hang (dmesg output below) way more frequently, but that > one actually reproduces on misc-next as well. I'm also still confused > on how we hit the fixup worker, as that means we'll need to see a page > that. > > a) was dirty so that the writeback code picks it up > b) had the delalloc bit already cleaned in the I/O tree > c) does not have the orderd bit set > > "btrfs: move the cow_fixup earlier in writepages handling" would > be the obvious candidate touching this area, even if I can't see > how it makes a difference. Any chance you could check if it is > indeed the culprit? > > And here is the more frequent hang I see with generic/475 loops: > I backed your patches out and re-ran and I hit hangs with generic/475 still, so I think you're clear. There's something awkward going on here, the below hang just looks like we're waiting for IO. The caching thread is blocking the transaction commit because it's trying to read some old blocks, and it's been waiting for them to come back for 2 minutes. That's holding everybody else up. I'll dig into all of this, misc-next is definitely fucked somehow, your stuff may just be a victim. Thanks, Josef