Re: null-ptr deref found in netfs code

Shyam Prasad N <nspmangalore@xxxxxxxxx> · Thu, 5 Dec 2024 10:17:22 +0530

On Thu, Dec 5, 2024 at 2:20 AM ronnie sahlberg <ronniesahlberg@xxxxxxxxx> wrote:
>
> If you run the kernel under gdb and set a breakpoint on panic you
> should be able to see the exact line where it failedand you will also
> be able to check the variables and what they are set to.
>
> On Thu, 5 Dec 2024 at 02:31, Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote:
> >
> > On Wed, Dec 4, 2024 at 10:00 PM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote:
> > >
> > > On Tue, Dec 3, 2024 at 4:44 PM David Howells <dhowells@xxxxxxxxxx> wrote:
> > > >
> > > > Okay, I think I see the problem.
> > > >
> > > > Looking at the following extraction from the trace:
> > > >
> > > > > netfs_rreq_ref: R=0000290e NEW         r=1
> > > > > netfs_read: R=0000290e READAHEAD c=00000000 ni=0 s=85e00000 l=800000 sz=280000000
> > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read
> > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read
> > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read
> > > > > netfs_folio: i=f1c2900000000000 ix=86400-865ff read
> > > >
> > > > We're requesting reads of four folios, each consisting of 512 pages for a
> > > > total of 8MiB.
> > > >
> > > > > netfs_sreq: R=0000290e[1] DOWN SUBMT f=02 s=85e00000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[2] DOWN SUBMT f=02 s=85f00000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[3] DOWN SUBMT f=02 s=86000000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[4] DOWN SUBMT f=02 s=86100000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[5] DOWN SUBMT f=02 s=86200000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[6] DOWN SUBMT f=02 s=86300000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[7] DOWN SUBMT f=02 s=86400000 0/100000 e=0
> > > > > netfs_sreq: R=0000290e[8] DOWN SUBMT f=02 s=86500000 0/100000 e=0
> > > >
> > > > That got broken down into 8 submissions, each for a 1MiB slice.
> > > >
> > > > > netfs_sreq: R=0000290e[1] DOWN IO    f=02 s=85e00000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[01] s=85e00000 ct=0/100000 pa=100000/100000 sl=0
> > > > > netfs_donate: R=0000290e[01] -> [02] to-next am=100000
> > > >
> > > > Subrequest 1 completed, but wasn't large enough to cover a whole folio, so it
> > > > donated its coverage forwards to subreq 2.
> > > >
> > > > > netfs_sreq: R=0000290e[6] DOWN IO    f=02 s=86300000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[06] s=86300000 ct=0/100000 pa=100000/100000 sl=2
> > > > > netfs_donate: R=0000290e[06] -> [05] tail-to-prev am=100000
> > > >
> > > > Subrequest 6 completed, but wasn't large enough to cover a whole folio, so it
> > > > donated its coverage backwards to subreq 5.
> > > >
> > > > > netfs_sreq: R=0000290e[2] DOWN IO    f=02 s=85f00000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[02] s=85e00000 ct=0/200000 pa=200000/200000 sl=0
> > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read-done
> > > > > netfs_folio: i=f1c2900000000000 ix=85e00-85fff read-unlock
> > > >
> > > > Subreq 2 completed, and with the donation from subreq 1, had sufficient to
> > > > unlock the first folio.
> > > >
> > > > > netfs_sreq: R=0000290e[5] DOWN IO    f=02 s=86200000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[05] s=86200000 ct=0/200000 pa=200000/200000 sl=2
> > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read-done
> > > > > netfs_folio: i=f1c2900000000000 ix=86200-863ff read-unlock
> > > >
> > > > Subreq 5 completed, and with the donation from subreq 6, had sufficient to
> > > > unlock the third folio.
> > > >
> > > > > netfs_sreq: R=0000290e[3] DOWN IO    f=02 s=86000000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[03] s=86000000 ct=0/100000 pa=100000/100000 sl=1
> > > > > netfs_donate: R=0000290e[03] -> [04] to-next am=100000
> > > >
> > > > Subrequest 3 completed, but wasn't large enough to cover a whole folio, so it
> > > > donated its coverage forwards to subreq 4.  So far, so good.
> > > >
> > > > > netfs_sreq: R=0000290e[7] DOWN IO    f=02 s=86400000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[07] s=86400000 ct=0/100000 pa=100000/100000 sl=3
> > > > > netfs_donate: R=0000290e[07] -> [04] to-prev am=0
> > > >
> > > > Subreq 7 completed, but wasn't large enough to cover a whole folio, so it
> > > > donated its coverage backwards to subreq 4.  This is a bug as subreq 7 is not
> > > > contiguous with subreq 4.  It should instead have donated forwards to subreq
> > > > 8.
> > > >
> > > > > netfs_sreq: R=0000290e[4] DOWN IO    f=02 s=86100000 100000/100000 e=0
> > > > > netfs_sreq: R=0000290e[4] DOWN +DON  f=02 s=86000000 300000/300000 e=0
> > > > > netfs_progress: R=0000290e[04] s=86000000 ct=0/300000 pa=200000/300000 sl=1
> > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read-done
> > > > > netfs_folio: i=f1c2900000000000 ix=86000-861ff read-unlock
> > > >
> > > > Subreq 4 completed, and with the donation from subreq 2, had sufficient to
> > > > unlock the second folio.  However, it also has some excess from subreq 7 that
> > > > it can't do anything with - and this gets lost.
> > > >
> > > > > netfs_sreq: R=0000290e[8] DOWN IO    f=02 s=86500000 100000/100000 e=0
> > > > > netfs_progress: R=0000290e[08] s=86500000 ct=0/100000 pa=100000/100000 sl=3
> > > > > netfs_donate: R=0000290e[08] -> [04] tail-to-prev am=100000
> > > >
> > > > Here's a repeat of the bug: subreq 8 donates to subreq 4, but, again, is not
> > > > contiguous.  As these are happening concurrently, the other thread hasn't
> > > > delisted subreq 4 yet.
> > > >
> > > > > netfs_sreq: R=0000290e[4] DOWN +DON  f=02 s=86000000 400000/400000 e=0
> > > > > netfs_progress: R=0000290e[04] s=86200000 ct=200000/400000 pa=200000/200000 sl=2
> > > >
> > > > The request screwed at this point: subreq 4 shows the extra stuff it has been
> > > > donated, but it is unable to do anything with it.  There is no folio to
> > > > wrangle as the third slot (sl=2) in the queue has already been cleared.
> > > >
> > > > (Note that this bug shouldn't happen with the patches currently on my
> > > > netfs-writeback branch as I got rid of the donation mechanism in preference
> > > > for something simpler with single-threaded collection.)
> > > >
> > > > David
> > > >
> > >
> > > Hi David,
> > >
> > > Tried your submitted patch "netfs: Fix non-contiguous donation between
> > > completed reads" with the same workload.
> > > It seems to be crashing elsewhere. I'm trying to get the OOPS message
> > > and ftrace, but it seems to be crashing the kernel.
> > >
> > > --
> > > Regards,
> > > Shyam
> >
> > Here's what I managed to get before the VM crashed:
> >
> > [Wed Dec  4 16:27:10 2024] BUG: kernel NULL pointer dereference,
> > address: 0000000000000068
> > [Wed Dec  4 16:27:10 2024] #PF: supervisor read access in kernel mode
> > [Wed Dec  4 16:27:10 2024] #PF: error_code(0x0000) - not-present page
> > [Wed Dec  4 16:27:10 2024] PGD 0 P4D 0
> > [Wed Dec  4 16:27:10 2024] Oops: Oops: 0000 [#1] SMP PTI
> > [Wed Dec  4 16:27:10 2024] CPU: 6 UID: 0 PID: 1263 Comm: kworker/6:3
> > Tainted: G           OE      6.12.0-mainline #10
> > [Wed Dec  4 16:27:10 2024] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> > [Wed Dec  4 16:27:10 2024] Hardware name: Microsoft Corporation
> > Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1
> > 08/23/2024
> > [Wed Dec  4 16:27:10 2024] Workqueue: cifsiod smb2_readv_worker [cifs]
> > [Wed Dec  4 16:27:10 2024] RIP:
> > 0010:netfs_consume_read_data.isra.0+0x2d6/0xbe0 [netfs]
> > [Wed Dec  4 16:27:10 2024] Code: 48 8b 43 78 4c 8b 7d c0 4c 8b 43 68
> > 48 85 c0 0f 85 76 05 00 00 48 8b 55 90 48 8b 73 30 48 83 c2 70 48 39
> > d6 74 16 48 8b 7d 88 <48> 8b 4f 68 48 03 4f 60 48 39 4b 60 0f 84 58 06
> > 00 00 49 29 c0 48
> > [Wed Dec  4 16:27:10 2024] RSP: 0018:ffffad4a4582bd98 EFLAGS: 00010283
> > [Wed Dec  4 16:27:10 2024] RAX: 0000000000000000 RBX: ffff9810d6ff6280
> > RCX: 0000000000100000
> > [Wed Dec  4 16:27:10 2024] RDX: ffff981154071eb0 RSI: ffff9810d6ff71a8
> > RDI: 0000000000000000
> > [Wed Dec  4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged
> > CPU for >10000us 4 times, consider switching to WQ_UNBOUND
> > [Wed Dec  4 16:27:10 2024] RBP: ffffad4a4582be10 R08: 0000000000100000
> > R09: 0000000000000000
> > [Wed Dec  4 16:27:10 2024] R10: 0000000000000008 R11: fefefefefefefeff
> > R12: 0000000000000000
> > [Wed Dec  4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged
> > CPU for >10000us 5 times, consider switching to WQ_UNBOUND
> > [Wed Dec  4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged
> > CPU for >10000us 7 times, consider switching to WQ_UNBOUND
> > [Wed Dec  4 16:27:10 2024] R13: ffff981154072028 R14: 0000000000200000
> > R15: ffff981154072028
> > [Wed Dec  4 16:27:10 2024] workqueue: smb2_readv_worker [cifs] hogged
> > CPU for >10000us 11 times, consider switching to WQ_UNBOUND
> >
> >
> > --
> > Regards,
> > Shyam
> >

Ack. Will check further on this today.

-- 
Regards,
Shyam