On Sun, Feb 6, 2022 at 1:48 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Sun, Feb 06, 2022 at 12:01:02PM +0100, FMDF wrote: > > On Wed, Feb 2, 2022 at 10:50 PM Dāvis Mosāns <davispuh@xxxxxxxxx> wrote: > > > > > > trešd., 2022. g. 2. febr., plkst. 21:13 — lietotājs Matthew Wilcox > > > (<willy@xxxxxxxxxxxxx>) rakstīja: > > > > > > > > On Wed, Feb 02, 2022 at 07:15:14PM +0200, Dāvis Mosāns wrote: > > > > > I have a corrupted file on BTRFS which has CoW disabled thus no > > > > > checksum. Trying to read this file causes the process to get stuck > > > > > forever. It doesn't return EIO. > > > > > > > > > > How can I find out why it gets stuck? > > > > > > > > > $ cat /proc/3449/stack | ./scripts/decode_stacktrace.sh vmlinux > > > > > folio_wait_bit_common (mm/filemap.c:1314) > > > > > filemap_get_pages (mm/filemap.c:2622) > > > > > filemap_read (mm/filemap.c:2676) > > > > > new_sync_read (fs/read_write.c:401 (discriminator 1)) > > > > > > > > folio_wait_bit_common() is where it waits for the page to be unlocked. > > > > Probably the problem is that btrfs isn't unlocking the page on > > > > seeing the error, so you don't get the -EIO returned? > > > > > > > > > Yeah, but how to find where that happens. > > > Anyway by pure luck I found memcpy that wrote outside of allocated > > > memory and fixing that solved this issue but I still don't know how to > > > debug this properly. > > > > > There is no special recipe for debugging "this properly" :) > > > > You wrote that "by pure luck" you found a memcpy() that wrote beyond the > > limit of allocated memory. I suppose that you found that faulty memcpy() > > somewhere in one of the function listed in the stack trace. > > I very much doubt that. The code flow here is: > > userspace calls read() -> VFS -> btrfs -> block layer -> return to btrfs > -> return to VFS, wait for read to complete. So by the time anyone's > looking at the stack trace, all you can see is the part of the call > chain in the VFS. There's no way to see where we went in btrfs, nor > in the block layer. We also can't see from the stack trace what > happened with the interrupt which _should have_ cleared the lock bit > and didn't. > OK, I agree. This appears to be is one of those special cases where the mere reading of a stack trace cannot help much... :( My argument is about a general approach to debugging some unknown code by just reading the calls chain. Many times I've been able to find out what was wrong with code I had never seen before by just following the chain of calls in subsystems that I know nothing of (e.g., a bug in "tty" that was reported by Syzbot). In this special case, if the developer doesn't know that "the interrupt [which] _should have_ cleared the lock bit and didn't." there is nothing that one can deduce from a stack trace. Here one need to know how things work, well beyond the functions that are listed in the trace. So, probably, if one needs a "recipe" for those cases, the recipe is just know the subsystem(s) at hand and know how the kernel manages interrupts. Actually I haven't deepened this issue but, by reading what Matthew writes, I doubt that a faulty memcpy() can be the culprit... Davis, are you really sure that you've fixed that bug? Regards, Fabio