On 06/06/2011 06:45 AM, Jeff Layton wrote:
On Sat, 4 Jun 2011 07:19:23 -0400
Jeff Layton<jlayton@xxxxxxxxxx> wrote:
On Fri, 03 Jun 2011 22:03:43 -0700
Ben Greear<greearb@xxxxxxxxxxxxxxx> wrote:
On 06/03/2011 06:42 PM, Jeff Layton wrote:
On Fri, 03 Jun 2011 14:01:11 -0700
Ben Greear<greearb@xxxxxxxxxxxxxxx> wrote:
Ok, we had some luck. Here's the backtrace and attending dmesg
output. The filer has been doing failover, but it has not gone
into a failed state...so, the system *should* be able to reconnect.
We have the system in the failed state now and will leave it that way
for a bit in case you have some commands you'd like me to run.
Aside from the hung cifs processes (anything accessing those mounts
gets into the D state), the system seems fine.
CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -11
CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -112
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Unexpected lookup error -11
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: Reconnecting tcp session
CIFS VFS: need to reconnect in sendv here
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
BUG: unable to handle kernel
Hardware name: X8ST3
NULL pointer dereference
Modules linked in: at 0000000000000020
be2iscsi
IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
bnx2iPGD 0 cnic
uio
Oops: 0000 [#1] cxgb3iPREEMPT libcxgbiSMP cxgb3
mdio
last sysfs file: /sys/devices/platform/host10/session7/target10:0:0/10:0:0:0/block/sde/sde1/stat
ib_iserCPU 2 rdma_cm
Modules linked in: ib_cm be2iscsi iw_cm iscsi_boot_sysfs ib_sa bnx2i ib_mad cnic ib_core uio ib_addr cxgb3i md4 libcxgbi nls_utf8 cxgb3 cifs mdio xt_TPROXY
ib_iser rdma_cm nf_tproxy_core ib_cm xt_socket iw_cm ib_sa ip6_tables ib_mad ib_core nf_defrag_ipv6 ib_addr md4 nls_utf8 xt_connlimit cifs xt_TPROXY
nf_tproxy_core xt_socket ip6_tables nf_defrag_ipv6 xt_connlimit 8021q garp bridge stp llc fuse macvlan wanlink(P) pktgen iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput i2c_i801 e1000e 8021q i2c_core garp igb bridge ioatdma stp iTCO_wdt llc
i7core_edac fuse iTCO_vendor_support macvlan pcspkr wanlink(P) dca pktgen edac_core iscsi_tcp microcode [last unloaded: ipt_addrtype] libiscsi_tcp
libiscsi
scsi_transport_iscsi
Pid: 5047, comm: cifsd Tainted: P 2.6.38.8+ #12 nfs lockdSupermicro X8ST3 fscache/X8ST3 nfs_acl
auth_rpcgss
RIP: 0010:[<ffffffff81356230>] [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
sunrpc
RSP: 0018:ffff8802e64e5bc0 EFLAGS: 00010286
ipv6
RAX: 0000000000000000 RBX: ffff8802e64e5c40 RCX: 0000000000000004
uinput
RDX: ffff8802e64e5e40 RSI: 0000000000000000 RDI: ffff8802e64e5c40
i2c_i801
RBP: ffff8802e64e5bf0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8802e64e5d80 R11: ffff8802e64e5e40 R12: ffff8802e64e5d10
e1000e
R13: 0000000000000000 R14: ffff8802e64e5c40 R15: ffff8802e6429f80
i2c_core
FS: 0000000000000000(0000) GS:ffff8800df440000(0000) knlGS:0000000000000000
igb
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
ioatdma
CR2: 0000000000000020 CR3: 0000000001803000 CR4: 00000000000006e0
iTCO_wdt
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
i7core_edac
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
iTCO_vendor_supportProcess cifsd (pid: 5047, threadinfo ffff8802e64e4000, task ffff88030482d880)
pcspkr
Stack:
dca ffff8802e64e5c10 edac_core ffffffff81039b72 microcode ffff880200000001 [last unloaded: ipt_addrtype] ffff880305a40fc8
ffff8802e64e5e40Pid: 4754, comm: btserver Tainted: P 2.6.38.8+ #12
0000000000000004Call Trace:
ffff8802e64e5c30 ffffffff8135792c
0000000000000000 0000000000000000 [<ffffffff8104556a>] ? warn_slowpath_common+0x80/0x98
ffff8802e64e5c40 ffffffffffffffff [<ffffffff81045597>] ? warn_slowpath_null+0x15/0x17
Call Trace:
[<ffffffffa0330a2c>] ? smb_sendv+0x7a/0x2cf [cifs]
[<ffffffff81039b72>] ? select_idle_sibling+0xec/0x127
[<ffffffff8135792c>] __sock_recvmsg+0x49/0x54
[<ffffffff81357e96>] sock_recvmsg+0xa6/0xbf
[<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
[<ffffffff81041ee4>] ? try_to_wake_up+0x1ad/0x1c8
[<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
[<ffffffff81041f0c>] ? default_wake_function+0xd/0xf
[<ffffffff8105c7e0>] ? autoremove_wake_function+0x11/0x34
[<ffffffffa0330ca2>] ? smb_send+0x21/0x23 [cifs]
[<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
[<ffffffff814164ec>] ? sub_preempt_count+0x92/0xa5
[<ffffffffa03311d1>] ? SendReceive+0x13f/0x317 [cifs]
[<ffffffff814133f8>] ? _raw_spin_unlock_irqrestore+0x3a/0x47
[<ffffffff810382c6>] ? __wake_up+0x3f/0x48
[<ffffffffa031d839>] ? CIFSSMBNegotiate+0x191/0x766 [cifs]
[<ffffffff81357ee4>] kernel_recvmsg+0x35/0x41
[<ffffffff81412526>] ? __mutex_lock_common+0x358/0x3bc
[<ffffffffa0321d20>] cifs_demultiplex_thread+0x21e/0xcd9 [cifs]
[<ffffffffa031fd7b>] ? cifs_negotiate_protocol+0x37/0x87 [cifs]
[<ffffffff8103fde4>] ? get_parent_ip+0x11/0x42
[<ffffffff8141259e>] ? __mutex_lock_slowpath+0x14/0x16
[<ffffffff8103838e>] ? need_resched+0x1e/0x28
[<ffffffffa0315fed>] ? cifs_reconnect_tcon+0x19a/0x2c9 [cifs]
[<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
[<ffffffff8105c7cf>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0321b02>] ? cifs_demultiplex_thread+0x0/0xcd9 [cifs]
[<ffffffff8105c3bf>] kthread+0x7d/0x85
[<ffffffffa031af96>] ? small_smb_init+0x27/0x70 [cifs]
[<ffffffff8100b8e4>] kernel_thread_helper+0x4/0x10
[<ffffffffa031c0ad>] ? CIFSSMBWrite2+0xa3/0x242 [cifs]
[<ffffffff8105c342>] ? kthread+0x0/0x85
[<ffffffff8100b8e0>] ? kernel_thread_helper+0x0/0x10
[<ffffffffa032a117>] ? cifs_writepages+0x461/0x714 [cifs]
Code: 48 8b 4d d8 [<ffffffff810ac1e8>] ? do_writepages+0x1f/0x28
48 8b [<ffffffff810a4735>] ? __filemap_fdatawrite_range+0x4e/0x50
55 e0 48 [<ffffffff810a4c3c>] ? filemap_fdatawrite+0x1a/0x1c
8b 75 [<ffffffff810a4c56>] ? filemap_write_and_wait+0x18/0x33
e8 ff 90 [<ffffffffa0326648>] ? cifs_flush+0x2d/0x60 [cifs]
a8 00 [<ffffffff810e901f>] ? filp_close+0x3e/0x6d
00 00 [<ffffffff810e90f6>] ? sys_close+0xa8/0xe2
48 83 c4 [<ffffffff8100aad2>] ? system_call_fastpath+0x16/0x1b
28 5b
---[ end trace 3387e7bab0a9c645 ]---
Kaboom. So you're seeing oopses too. Could you get a listing of the
place where it oopsed by following the instructions here?
http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses
I suspect that "sock" is NULL in this case too and it blew up in
kernel_recvmsg.
I added code to WARN_ON when ssocket was null. This isn't a real panic,
just a WARN_ON:
static int
smb_sendv(struct TCP_Server_Info *server, struct kvec *iov, int n_vec)
{
int rc = 0;
int i = 0;
struct msghdr smb_msg;
struct smb_hdr *smb_buffer = iov[0].iov_base;
unsigned int len = iov[0].iov_len;
unsigned int total_len;
int first_vec = 0;
unsigned int smb_buf_length = smb_buffer->smb_buf_length;
struct socket *ssocket = server->ssocket;
if (ssocket == NULL) {
cERROR(1, "need to reconnect in sendv here");
*** HERE *** WARN_ON_ONCE(1);
return -ENOTSOCK; /* BB eventually add reconnect code here */
}
A second warn-on when ENOTSOCK is perculated up to the calling stack
a bit causes the other stack dumpage. I think the one above is root
cause...need to figure out how to have it gracefully bail out and re-connect
when it hits this state, as current code just calls this general loop over
and over again.
No, your warning is there, but it's Oopsing too:
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/fs/cifs/transport.c:137 smb_sendv+0x7a/0x2cf [cifs]()
BUG: unable to handle kernel
Hardware name: X8ST3
NULL pointer dereference
Modules linked in: at 0000000000000020
be2iscsi
IP: iscsi_boot_sysfs [<ffffffff81356230>] __sock_recvmsg_nosec+0x12/0x6e
...smb_sendv is called by the "send" side which is generally a
userspace process. The oops happened on the receive side. cifsd called
kernel_recvmsg, and it looks like it passed in a NULL sock pointer.
I suspect that the following (untested) patch will fix this. I think
the symptoms that you've seen are consistent with the patch
description. Ben, would you be able to test this in your setup? This
should at least prevent the oopses.
------------------[snip]--------------------
[PATCH] cifs: don't allow cifs_reconnect to exit with NULL socket pointer
It's possible for the following set of events to happen:
cifsd calls cifs_reconnect which reconnects the socket. A userspace
process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
gets a reply. But, while processing the reply, cifsd calls
cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the
reply from the earlier NEGOTIATE completes and the tcpStatus is set to
CifsGood. cifs_reconnect then goes through and closes the socket and sets the
pointer to zero, but because the status is now CifsGood, the new socket
is not created and cifs_reconnect exits with the socket pointer set to
NULL.
Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
CifsNeedNegotiate, and by making sure that generic_ip_connect is always
called at least once in cifs_reconnect.
Note that this is not a perfect fix for this issue. It's still possible
that the NEGOTIATE reply is handled after the socket has been closed and
reconnected. In that case, the socket state will look correct but it no
NEGOTIATE was performed on it. In that situation though the server
should just shut down the socket on the next attempted send, rather
than causing the oops that occurs today.
Reported-by: Ben Greear<greearb@xxxxxxxxxxxxxxx>
Signed-off-by: Jeff Layton<jlayton@xxxxxxxxxx>
---
fs/cifs/connect.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 84c7307..8bb55bc 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -152,7 +152,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
mid_entry->callback(mid_entry);
}
- while (server->tcpStatus == CifsNeedReconnect) {
+ do {
try_to_freeze();
/* we should try only the port we connected to before */
@@ -167,7 +167,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
server->tcpStatus = CifsNeedNegotiate;
spin_unlock(&GlobalMid_Lock);
}
- }
+ } while (server->tcpStatus == CifsNeedReconnect);
return rc;
}
@@ -3371,7 +3371,7 @@ int cifs_negotiate_protocol(unsigned int xid, struct cifs_ses *ses)
}
if (rc == 0) {
spin_lock(&GlobalMid_Lock);
- if (server->tcpStatus != CifsExiting)
+ if (server->tcpStatus == CifsNeedNegotiate)
server->tcpStatus = CifsGood;
else
rc = -EHOSTDOWN;
This has some merge issues on 3.6.38.8:
<<<<<<<
while ((server->tcpStatus != CifsExiting) &&
(server->tcpStatus != CifsGood)) {
=======
do {
>>>>>>>
Should I keep your comparison for tcpStatus == CifsNeedReconnect
instead of these != comparisons above?
Thanks,
Ben
--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html