On Thu, Jul 02, 2015 at 06:40:17AM -0400, Jeff Layton wrote: > From: Jeff Layton <jlayton@xxxxxxxxxxxxxxx> > > We had a report of a crash while stress testing the NFS client: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000150 > IP: [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90 > PGD 0 > Oops: 0000 [#1] SMP > Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_security ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_filter ip6_tables iptable_security iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw coretemp crct10dif_pclmul ppdev crc32_pclmul crc32c_intel ghash_clmulni_intel vmw_balloon serio_raw vmw_vmci i2c_piix4 shpchp parport_pc acpi_cpufreq parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih mptbase e1000 ata_generic pata_acpi > CPU: 1 PID: 399 Comm: kworker/1:1H Not tainted 4.1.0-0.rc1.git0.1.fc23.x86_64 #1 > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 > Workqueue: rpciod rpc_async_schedule [sunrpc] > task: ffff880036aea7c0 ti: ffff8800791f4000 task.ti: ffff8800791f4000 > RIP: 0010:[<ffffffff8127b698>] [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90 > RSP: 0018:ffff8800791f7c00 EFLAGS: 00010293 > RAX: ffff8800791f7c40 RBX: ffff88001f2ad8c0 RCX: ffffe8ffffc80305 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffff8800791f7c88 R08: ffff88007fc971d8 R09: 279656d600000000 > R10: 0000034a01000000 R11: 279656d600000000 R12: ffff88001f2ad918 > R13: ffff88001f2ad8c0 R14: 0000000000000000 R15: 0000000100e73040 > FS: 0000000000000000(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000150 CR3: 0000000001c0b000 CR4: 00000000000407e0 > Stack: > ffffffff8127c5b0 ffff8800791f7c18 ffffffffa0171e29 ffff8800791f7c58 > ffffffffa0171ef8 ffff8800791f7c78 0000000000000246 ffff88001ea0ba00 > ffff8800791f7c40 ffff8800791f7c40 00000000ff5d86a3 ffff8800791f7ca8 > Call Trace: > [<ffffffff8127c5b0>] ? __posix_lock_file+0x40/0x760 > [<ffffffffa0171e29>] ? rpc_make_runnable+0x99/0xa0 [sunrpc] > [<ffffffffa0171ef8>] ? rpc_wake_up_task_queue_locked.part.35+0xc8/0x250 [sunrpc] > [<ffffffff8127cd3a>] posix_lock_file_wait+0x4a/0x120 > [<ffffffffa03e4f12>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4] > [<ffffffffa03bf108>] ? nfs41_sequence_done+0xd8/0x2d0 [nfsv4] > [<ffffffffa03c116d>] do_vfs_lock+0x2d/0x30 [nfsv4] > [<ffffffffa03c251d>] nfs4_lock_done+0x1ad/0x210 [nfsv4] > [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc] > [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc] > [<ffffffffa0171a5c>] rpc_exit_task+0x2c/0xa0 [sunrpc] > [<ffffffffa0167450>] ? call_refreshresult+0x150/0x150 [sunrpc] > [<ffffffffa0172640>] __rpc_execute+0x90/0x460 [sunrpc] > [<ffffffffa0172a25>] rpc_async_schedule+0x15/0x20 [sunrpc] > [<ffffffff810baa1b>] process_one_work+0x1bb/0x410 > [<ffffffff810bacc3>] worker_thread+0x53/0x480 > [<ffffffff810bac70>] ? process_one_work+0x410/0x410 > [<ffffffff810bac70>] ? process_one_work+0x410/0x410 > [<ffffffff810c0b38>] kthread+0xd8/0xf0 > [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180 > [<ffffffff817a1aa2>] ret_from_fork+0x42/0x70 > [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180 > > Jean says: > > "Running locktests with a large number of iterations resulted in a > client crash. The test run took a while and hasn't finished after close > to 2 hours. The crash happened right after I gave up and killed the test > (after 107m) with Ctrl+C." > > The crash happened because a NULL inode pointer got passed into > locks_get_lock_context. The call chain indicates that file_inode(filp) > returned NULL, which means that f_inode was NULL. Since that's zeroed > out in __fput, that suggests that this filp pointer outlived the last > reference. > > Looking at the code, that seems possible. We copy the struct file_lock > that's passed in, but if the task is signalled at an inopportune time we > can end up trying to use that file_lock in rpciod context after the process > that requested it has already returned (and possibly put its filp > reference). > > Fix this by taking an extra reference to the filp when we allocate the > lock info, and put it in nfs4_lock_release. > > Reported-by: Jean Spector <jean@xxxxxxxxxxxxxxx> > Signed-off-by: Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # all stable-series kernels > Upstream commit: feaff8e5b2cfc3eae02cf65db7a400b0b9ffc596 Thanks, I'm queuing it for the 3.16 kernel. Cheers, -- Luís -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html