Re: general protection fault, probably for non-canonical address in nfsd

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 8 Jun 2020 15:27:43 -0400

> On Jun 7, 2020, at 11:32 AM, Hans-Peter Jansen <hpj@xxxxxxxxx> wrote:
> 
> Hi,
> 
> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular 
> crashes of nfsd here:
> 
> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:303 for /work (/work)
> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:304 for /work/vmware (/work)
> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:305 for /work/vSphere (/work)
> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general protection fault, probably for non-canonical address 0xb9159d506ba40000: 0000 [#1] SMP PTI
> 2020-06-07T01:32:43.606284+02:00 server kernel: [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G           O      5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware name: System manufacturer System Product Name/P7F-E, BIOS 0906    09/20/2010
> 2020-06-07T01:32:43.606287+02:00 server kernel: [51901.089247] RIP: 0010:cgroup_sk_free+0x26/0x80
> 2020-06-07T01:32:43.606288+02:00 server kernel: [51901.089257] Code: 00 00 00 00 66 66 66 66 90 53 48 8b 07 48 c7 c3 30 72 07 b6 a8 01 75 07 48 85 c0 48 0f 45 d8 48 8b 83 18 09 00 00 a8 03 
> 75 1a <65> 48 ff 08 f6 43 7c 01 74 02 5b c3 48 8b 43 18 a8 03 75 26 65 48
> 2020-06-07T01:32:43.606290+02:00 server kernel: [51901.089276] RSP: 0018:ffffb248c21e7e10 EFLAGS: 00010246
> 2020-06-07T01:32:43.606291+02:00 server kernel: [51901.089280] RAX: b91603a504000000 RBX: ffff99ab141a0000 RCX: 0000000000000021
> 2020-06-07T01:32:43.606292+02:00 server kernel: [51901.089284] RDX: ffffffffb6135ec4 RSI: 0000000000010080 RDI: ffff99a7159c1490
> 2020-06-07T01:32:43.606293+02:00 server kernel: [51901.089287] RBP: ffff99a7159c1200 R08: ffff99ab67a60c60 R09: 000000000002eb00
> 2020-06-07T01:32:43.606294+02:00 server kernel: [51901.089291] R10: ffffb248c0087dc0 R11: 00000000000000c6 R12: 0000000000000000
> 2020-06-07T01:32:43.606295+02:00 server kernel: [51901.089294] R13: 0000000000000103 R14: ffff99aae4934238 R15: ffff99ab31902000
> 2020-06-07T01:32:43.606296+02:00 server kernel: [51901.089299] FS:  0000000000000000(0000) GS:ffff99ab67a40000(0000) knlGS:0000000000000000
> 2020-06-07T01:32:43.606297+02:00 server kernel: [51901.089303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2020-06-07T01:32:43.606303+02:00 server kernel: [51901.089305] CR2: 00000000008e0000 CR3: 00000004df60a000 CR4: 00000000000026e0
> 2020-06-07T01:32:43.606304+02:00 server kernel: [51901.089307] Call Trace:
> 2020-06-07T01:32:43.606305+02:00 server kernel: [51901.089315]  __sk_destruct+0x10d/0x1d0
> 2020-06-07T01:32:43.606306+02:00 server kernel: [51901.089319]  inet_release+0x34/0x60
> 2020-06-07T01:32:43.606307+02:00 server kernel: [51901.089325]  __sock_release+0x81/0xb0
> 2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089358]  svc_sock_free+0x38/0x60 [sunrpc]
> 2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089374]  svc_xprt_put+0x99/0xe0 [sunrpc]
> 2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089389]  svc_recv+0x9c0/0xa40 [sunrpc]
> 2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089410]  ? nfsd_destroy+0x60/0x60 [nfsd]
> 2020-06-07T01:32:43.606311+02:00 server kernel: [51901.089417]  nfsd+0xd1/0x150 [nfsd]
> 2020-06-07T01:32:43.606312+02:00 server kernel: [51901.089420]  kthread+0x10d/0x130
> 2020-06-07T01:32:43.606313+02:00 server kernel: [51901.089423]  ? kthread_park+0x90/0x90
> 2020-06-07T01:32:43.606314+02:00 server kernel: [51901.089426]  ret_from_fork+0x35/0x40
> 
> A vSphere 5.5 host accesses this linux server with nfs v3 for backup
> purposes (a Veeam backup server want to store a new backup here). 
> 
> The kernel is tainted due to vboxdrv. The OS is openSUSE Leap 15.1,
> with the kernel and Virtualbox replaced with uptodate versions from 
> proper rpm packages (built on that very vSphere host in a OBS server
> VM..).
> 
> I used to be subscribed to this ML, but that subscription has been 
> lost 04/09, thus I cannot reply properly to the general prot. fault
> thread, started 05/12 from syzbot with Bruce looking into it.
> 
> It seems somewhat related.

Your backtrace doesn't look anything like the syzbot crashes Bruce
is looking at, and there are no fs/nfsd/ changes between v5.6.11 and
v5.6.14. His crashes appear to be related entirely to the order of
destruction of net namespaces and NFS server data structures --
nothing at the socket layer.

The net/sunrpc/ changes in that commit range have nothing to do with
socket allocation. However, this:

   [51901.089247] RIP: 0010:cgroup_sk_free+0x26/0x80

suggests something else. There is a cgroup/sk related change in that
commit range:

e2d928d5ee43 ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")

I'm not sure how to help you further, since you are not available to
test this theory for a few weeks. The best I can suggest for others is
to stick with v5.6.11-based kernels until someone with a reproducer
can bisect between .11 and .14 to confirm the theory.

> Interestingly, we're using a couple of NFS v4 mounts for subsets of
> home here, and mount /work and other shares from various 
> Tumbleweed systems with NFS v4 here without any undesired effects. 
> 
> Since the kernel upgrade, every time, this Veeam thing triggers these 
> v3 mounts, the crash happens. I've disabled this backup target for now 
> until the problem is resolved, because it effectively prevents further 
> nfs accesses to this server, and blocks our desktops until the server 
> is rebooted.
> 
> A cursory look into 5.6.{15,16} changelogs seems to imply, that this
> issue is still pending. 
> 
> Let me know, if I can provide any further info's.

--
Chuck Lever