If you run the kernel under gdb and set a breakpoint on panic you should be able to see the exact line where it failedand you will also be able to check the variables and what they are set to. On Thu, 5 Dec 2024 at 02:31, Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote: > > On Wed, Dec 4, 2024 at 10:00 PM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote: > > > > On Tue, Dec 3, 2024 at 4:44 PM David Howells <dhowells@xxxxxxxxxx> wrote: > > > > > > Okay, I think I see the problem. > > > > > > Looking at the following extraction from the trace: > > > > > > > netfs_rreq_ref: R=0000290e NEW r=1 > > > > netfs_read: R=0000290e READAHEAD c=00000000 ni=0 s=85e00000 l=800000 sz=280000000 > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read > > > > netfs_folio: i=f1c2900000000000 ix=86400-865ff read > > > > > > We're requesting reads of four folios, each consisting of 512 pages for a > > > total of 8MiB. > > > > > > > netfs_sreq: R=0000290e[1] DOWN SUBMT f=02 s=85e00000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[2] DOWN SUBMT f=02 s=85f00000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[3] DOWN SUBMT f=02 s=86000000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[4] DOWN SUBMT f=02 s=86100000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[5] DOWN SUBMT f=02 s=86200000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[6] DOWN SUBMT f=02 s=86300000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[7] DOWN SUBMT f=02 s=86400000 0/100000 e=0 > > > > netfs_sreq: R=0000290e[8] DOWN SUBMT f=02 s=86500000 0/100000 e=0 > > > > > > That got broken down into 8 submissions, each for a 1MiB slice. > > > > > > > netfs_sreq: R=0000290e[1] DOWN IO f=02 s=85e00000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[01] s=85e00000 ct=0/100000 pa=100000/100000 sl=0 > > > > netfs_donate: R=0000290e[01] -> [02] to-next am=100000 > > > > > > Subrequest 1 completed, but wasn't large enough to cover a whole folio, so it > > > donated its coverage forwards to subreq 2. > > > > > > > netfs_sreq: R=0000290e[6] DOWN IO f=02 s=86300000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[06] s=86300000 ct=0/100000 pa=100000/100000 sl=2 > > > > netfs_donate: R=0000290e[06] -> [05] tail-to-prev am=100000 > > > > > > Subrequest 6 completed, but wasn't large enough to cover a whole folio, so it > > > donated its coverage backwards to subreq 5. > > > > > > > netfs_sreq: R=0000290e[2] DOWN IO f=02 s=85f00000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[02] s=85e00000 ct=0/200000 pa=200000/200000 sl=0 > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read-done > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read-unlock > > > > > > Subreq 2 completed, and with the donation from subreq 1, had sufficient to > > > unlock the first folio. > > > > > > > netfs_sreq: R=0000290e[5] DOWN IO f=02 s=86200000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[05] s=86200000 ct=0/200000 pa=200000/200000 sl=2 > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read-done > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read-unlock > > > > > > Subreq 5 completed, and with the donation from subreq 6, had sufficient to > > > unlock the third folio. > > > > > > > netfs_sreq: R=0000290e[3] DOWN IO f=02 s=86000000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[03] s=86000000 ct=0/100000 pa=100000/100000 sl=1 > > > > netfs_donate: R=0000290e[03] -> [04] to-next am=100000 > > > > > > Subrequest 3 completed, but wasn't large enough to cover a whole folio, so it > > > donated its coverage forwards to subreq 4. So far, so good. > > > > > > > netfs_sreq: R=0000290e[7] DOWN IO f=02 s=86400000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[07] s=86400000 ct=0/100000 pa=100000/100000 sl=3 > > > > netfs_donate: R=0000290e[07] -> [04] to-prev am=0 > > > > > > Subreq 7 completed, but wasn't large enough to cover a whole folio, so it > > > donated its coverage backwards to subreq 4. This is a bug as subreq 7 is not > > > contiguous with subreq 4. It should instead have donated forwards to subreq > > > 8. > > > > > > > netfs_sreq: R=0000290e[4] DOWN IO f=02 s=86100000 100000/100000 e=0 > > > > netfs_sreq: R=0000290e[4] DOWN +DON f=02 s=86000000 300000/300000 e=0 > > > > netfs_progress: R=0000290e[04] s=86000000 ct=0/300000 pa=200000/300000 sl=1 > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read-done > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read-unlock > > > > > > Subreq 4 completed, and with the donation from subreq 2, had sufficient to > > > unlock the second folio. However, it also has some excess from subreq 7 that > > > it can't do anything with - and this gets lost. > > > > > > > netfs_sreq: R=0000290e[8] DOWN IO f=02 s=86500000 100000/100000 e=0 > > > > netfs_progress: R=0000290e[08] s=86500000 ct=0/100000 pa=100000/100000 sl=3 > > > > netfs_donate: R=0000290e[08] -> [04] tail-to-prev am=100000 > > > > > > Here's a repeat of the bug: subreq 8 donates to subreq 4, but, again, is not > > > contiguous. As these are happening concurrently, the other thread hasn't > > > delisted subreq 4 yet. > > > > > > > netfs_sreq: R=0000290e[4] DOWN +DON f=02 s=86000000 400000/400000 e=0 > > > > netfs_progress: R=0000290e[04] s=86200000 ct=200000/400000 pa=200000/200000 sl=2 > > > > > > The request screwed at this point: subreq 4 shows the extra stuff it has been > > > donated, but it is unable to do anything with it. There is no folio to > > > wrangle as the third slot (sl=2) in the queue has already been cleared. > > > > > > (Note that this bug shouldn't happen with the patches currently on my > > > netfs-writeback branch as I got rid of the donation mechanism in preference > > > for something simpler with single-threaded collection.) > > > > > > David > > > > > > > Hi David, > > > > Tried your submitted patch "netfs: Fix non-contiguous donation between > > completed reads" with the same workload. > > It seems to be crashing elsewhere. I'm trying to get the OOPS message > > and ftrace, but it seems to be crashing the kernel. > > > > -- > > Regards, > > Shyam > > Here's what I managed to get before the VM crashed: > > [Wed Dec 4 16:27:10 2024] BUG: kernel NULL pointer dereference, > address: 0000000000000068 > [Wed Dec 4 16:27:10 2024] #PF: supervisor read access in kernel mode > [Wed Dec 4 16:27:10 2024] #PF: error_code(0x0000) - not-present page > [Wed Dec 4 16:27:10 2024] PGD 0 P4D 0 > [Wed Dec 4 16:27:10 2024] Oops: Oops: 0000 [#1] SMP PTI > [Wed Dec 4 16:27:10 2024] CPU: 6 UID: 0 PID: 1263 Comm: kworker/6:3 > Tainted: G OE 6.12.0-mainline #10 > [Wed Dec 4 16:27:10 2024] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE > [Wed Dec 4 16:27:10 2024] Hardware name: Microsoft Corporation > Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 > 08/23/2024 > [Wed Dec 4 16:27:10 2024] Workqueue: cifsiod smb2_readv_worker [cifs] > [Wed Dec 4 16:27:10 2024] RIP: > 0010:netfs_consume_read_data.isra.0+0x2d6/0xbe0 [netfs] > [Wed Dec 4 16:27:10 2024] Code: 48 8b 43 78 4c 8b 7d c0 4c 8b 43 68 > 48 85 c0 0f 85 76 05 00 00 48 8b 55 90 48 8b 73 30 48 83 c2 70 48 39 > d6 74 16 48 8b 7d 88 <48> 8b 4f 68 48 03 4f 60 48 39 4b 60 0f 84 58 06 > 00 00 49 29 c0 48 > [Wed Dec 4 16:27:10 2024] RSP: 0018:ffffad4a4582bd98 EFLAGS: 00010283 > [Wed Dec 4 16:27:10 2024] RAX: 0000000000000000 RBX: ffff9810d6ff6280 > RCX: 0000000000100000 > [Wed Dec 4 16:27:10 2024] RDX: ffff981154071eb0 RSI: ffff9810d6ff71a8 > RDI: 0000000000000000 > [Wed Dec 4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged > CPU for >10000us 4 times, consider switching to WQ_UNBOUND > [Wed Dec 4 16:27:10 2024] RBP: ffffad4a4582be10 R08: 0000000000100000 > R09: 0000000000000000 > [Wed Dec 4 16:27:10 2024] R10: 0000000000000008 R11: fefefefefefefeff > R12: 0000000000000000 > [Wed Dec 4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged > CPU for >10000us 5 times, consider switching to WQ_UNBOUND > [Wed Dec 4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged > CPU for >10000us 7 times, consider switching to WQ_UNBOUND > [Wed Dec 4 16:27:10 2024] R13: ffff981154072028 R14: 0000000000200000 > R15: ffff981154072028 > [Wed Dec 4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged > CPU for >10000us 11 times, consider switching to WQ_UNBOUND > > > -- > Regards, > Shyam >