oops in nfs_flush_incompatible (and possible fix?)

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 6 Dec 2010 14:57:43 -0500

Hi Trond,

I've got a RH bug opened by one of our partners against the RHEL5.6
development kernel. That kernel got a backported version of the
lock_context patches that you merged not too long ago. I'm trying to
get permission to add you to the BZ so you can have a look for yourself
too.

Long story short, they seem to get a reproducible oops like this when
testing under heavy load:

Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
[<ffffffff88856222>] :nfs:nfs_flush_incompatible+0x6d/0xd1
PGD 7d9bc2067 PUD 7d9bbf067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/system/cpu/cpu125/cpufreq/scaling_setspeed
CPU 126
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler tun nfs fscache
nfs_acl ebtable_nat ebtables ipt_MASQUERADE iptable_nat ip_nat bridge
autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc acpi_cpufreq freq_table
mperf ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink
iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables
x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q
libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop
dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi
wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport
ksm(U) kvm_intel(U) kvm(U) joydev sg tpm_tis i2c_i801 tpm cdc_ether pcspkr
i7core_edac ide_cd i2c_core edac_mc usbnet tpm_bios cdrom bnx2 dm_raid45
dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror
dm_log dm_mod ata_piix libata shpchp mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 21273, comm: snake.exe Tainted: G      2.6.18-229.el5 #1
RIP: 0010:[<ffffffff88856222>]  [<ffffffff88856222>]
:nfs:nfs_flush_incompatible+0x6d/0xd1
RSP: 0018:ffff810a3daddb48  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8101a4c05c40 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8101d27f3ac8
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff810a3daddc80
R10: ffff8106896a7c00 R11: 0000000000000048 R12: ffff810a833ebb78
R13: ffff811213f3f6f0 R14: ffff8110c0a1c9c0 R15: 0000000000000001
FS:  00002b6188e086e0(0000) GS:ffff811e7f78c940(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000007d9bc7000 CR4: 00000000000026e0
Process snake.exe (pid: 21273, threadinfo ffff810a3dadc000, task
ffff8107d9b24100)
Stack:  ffff810a833ebb78 000000000000007e ffff81102e769e80
0000000000000000
00000000000014ac ffffffff8884b101 ffff810a3daddc80 0000007e00000000
ffff811213f3f800 ffffffff8886b8e0 000000000000007e 00000000000014ac
Call Trace:
[<ffffffff8884b101>] :nfs:nfs_write_begin+0x65/0xf8
[<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
[<ffffffff80062ff0>] thread_return+0x62/0xfe
[<ffffffff800166e8>] __generic_file_aio_write_nolock+0x369/0x3b6
[<ffffffff80063c57>] __mutex_lock_slowpath+0x68/0x9b
[<ffffffff8002187e>] generic_file_aio_write+0x65/0xc1
[<ffffffff8884b805>] :nfs:nfs_file_write+0xd8/0x14f
[<ffffffff80018338>] do_sync_write+0xc7/0x104
[<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e
[<ffffffff8005a4c5>] hrtimer_cancel+0xc/0x16
[<ffffffff80016af0>] vfs_write+0xce/0x174
[<ffffffff800173a8>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 48 8b 50 20 65 48 8b 04 25 00 00 00 00 48 3b 90 98 05 00 00
RIP  [<ffffffff88856222>] :nfs:nfs_flush_incompatible+0x6d/0xd1
RSP <ffff810a3daddb48>
CR2: 0000000000000020
<0>Kernel panic - not syncing: Fatal exception


The problem seems to be that the req->wb_lock_context is NULL so it
dies dereferencing that pointer in nfs_flush_incompatible. The original
kernel they tested didn't have the check for a NULL return from
nfs_get_lock_context. I fixed that however and they still saw the bug.

At this point, I'm thinking this is a race that might be fixed by the
following patch. These bare calls to nfs_clear_request look suspicious
to me, given that we're not checking the refcount on the request before
freeing fields in it. I haven't tested the patch yet, but I think it
looks correct.

Thoughts?

-----------------------[snip]---------------------------

nfs: remove extraneous and problematic calls to nfs_clear_request

When a nfs_page is freed, nfs_free_request is called which also calls
nfs_clear_request to clean out the lock and open contexts and free the
pagecache page.

However, a couple of places in the nfs code call nfs_clear_request
themselves. What happens here if the refcount on the request is still high?
We'll be releasing contexts and freeing pointers while the request is
possibly still in use.

Remove those bare calls to nfs_clear_context. That should only be done when
the request is being freed.

Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
 fs/nfs/read.c  |    1 -
 fs/nfs/write.c |    1 -
 2 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index e4b62c6..aedcaa7 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -152,7 +152,6 @@ static void nfs_readpage_release(struct nfs_page *req)
 			(long long)NFS_FILEID(req->wb_context->path.dentry->d_inode),
 			req->wb_bytes,
 			(long long)req_offset(req));
-	nfs_clear_request(req);
 	nfs_release_request(req);
 }
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 4c14c17..c41a435 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -422,7 +422,6 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 		iput(inode);
 	} else
 		spin_unlock(&inode->i_lock);
-	nfs_clear_request(req);
 	nfs_release_request(req);
 }
 
-- 
1.7.3.2
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html