Re: NFS silent death

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I will try to debug idmapd more, but so far I don't see idmapd very informative, even with maximal verbose level and according to what I see in idmapd sources, there is almost no hope to expect a lot of information.

By the way, if problem as you suggest, then it should appear for all clients, isn't it?
Because now it happens randomly after some random period of time.

--
Anton Starikov.

On Apr 27, 2009, at 11:20 PM, Trond Myklebust wrote:

On Apr 27, 2009, at 5:12 PM, Anton Starikov <ant.starikov@xxxxxxxxx> wrote:

Thanks for your comment.


In fact, we have LDAP setup.
If I limit nss for idmapper with LDAP-only, this should remove problem, at least in theory, isn't it?

Possibly. Won't it still have to read /etc/idmapd.conf and/or files like /etc/nsswitch.conf, though?

Running idmapd in debugging mode might help figure it out...

Trond



On Apr 27, 2009, at 11:04 PM, Trond Myklebust wrote:



On Mon, 2009-04-27 at 22:13 +0200, Anton Starikov wrote:
Hello,

I got stuck around NFS problem.

There is a server which serve /home via NFSv4 and root via NFSv3.
There are number of diskless clients.

At some points some of clients hang.
Nothing in the logs and in console most of time.

I tried 2.6.27-2.6.30 kernels on both sides. It looks like generally NFSv4 stable enough, if I just mount /home to diskNess host - no hangs
during couple of months. But if I start mix both of them - then
problems come. And, there is a problem to have nfs-root on NFSv4, at
least all my attempts failed. After some time it always endup with
broken id-mapping.
And with 2.6.27-2.6.29 nfs3 used to be absolutely non-usable, if you
need writing. At least all my games with different mount|export
options used end up with the same problem with writing:
echo "test" > test_file
when file doesn't preexist, fail with "Invalid argument"
So, I had to mix them.

So far I come to 2.6.30 kernel and finally got something in the logs,
but only for one of clients, rest is still die silent.


[ 9363.096013] INFO: task rpciod/0:745 blocked for more than 120
seconds.
[ 9363.102730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9363.110905] rpciod/0      D ffffc20000010e00     0   745      2
[ 9363.117049]  ffff8805244a9b10 0000000000000046 ffff8805244a9aa0
0000000000010e00
[ 9363.124876]  0000000000010e00 0000000000010e00 0000000000010e00
0000000000010e00
[ 9363.132689]  0000000000010e00 0000000000010e00 ffff880524a667c0
ffff880124192840
[ 9363.140506] Call Trace:
[ 9363.143138]  [<ffffffff805e4bf0>] schedule+0x1c/0x44
[ 9363.148303]  [<ffffffffa011b720>] nfs_idmap_id+0x1ed/0x287 [nfs]
[ 9363.154567]  [<ffffffffa011b7f3>] nfs_map_group_to_gid+0x39/0x4f
[nfs]
[ 9363.161332] [<ffffffffa011227d>] decode_attr_group +0x110/0x1af [nfs] [ 9363.168011] [<ffffffffa011277a>] decode_getfattr+0x45e/0x960 [nfs] [ 9363.174509] [<ffffffffa01175ef>] nfs4_xdr_dec_open+0xa3/0xef [nfs]
[ 9363.181020]  [<ffffffffa00776eb>] rpcauth_unwrap_resp+0x89/0xac
[sunrpc]
[ 9363.187935] [<ffffffffa006f857>] call_decode+0x14e/0x1d3 [sunrpc] [ 9363.194326] [<ffffffffa00768f3>] __rpc_execute+0x93/0x278 [sunrpc]
[ 9363.200809]  [<ffffffffa0076b4d>] rpc_async_schedule+0x23/0x39
[sunrpc]
[ 9363.207624]  [<ffffffff8026926b>] run_workqueue+0xc9/0x189
[ 9363.213300]  [<ffffffff80269415>] worker_thread+0xea/0x10f
[ 9363.218972]  [<ffffffff8026de90>] kthread+0x69/0xac
[ 9363.224022]  [<ffffffff8020d26a>] child_rip+0xa/0x20
[ 9363.229174] INFO: task ntpd:4161 blocked for more than 120 seconds.
[ 9363.235620] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9363.243787] ntpd          D ffffc20000010e00     0  4161      1
[ 9363.249931]  ffff880125831bf8 0000000000000082 ffff880126b49ad0
0000000000010e00
[ 9363.257758]  0000000000010e00 0000000000010e00 0000000000010e00
0000000000010e00
[ 9363.265583]  0000000000010e00 0000000000010e00 ffff880122c380c0
ffff8801264184c0
[ 9363.273419] Call Trace:
[ 9363.276035]  [<ffffffff805e4bf0>] schedule+0x1c/0x44
[ 9363.281174]  [<ffffffff805e4c8e>] io_schedule+0x76/0xd0
[ 9363.286581]  [<ffffffff802c4c1c>] sync_page+0x54/0x6c
[ 9363.291815]  [<ffffffff802c4cb6>] __lock_page+0x82/0xb8
[ 9363.297224]  [<ffffffff802c4e7a>] find_lock_page+0x48/0x82
[ 9363.302892]  [<ffffffff802c57a4>] filemap_fault+0x183/0x346
[ 9363.308647]  [<ffffffff802dc7a6>] __do_fault+0x77/0x449
[ 9363.314059]  [<ffffffff802df33c>] handle_mm_fault+0x1fe/0x31b
[ 9363.319986]  [<ffffffff805e96f2>] do_page_fault+0x273/0x29e
[ 9363.325741]  [<ffffffff805e6f95>] page_fault+0x25/0x30
[ 9363.331068]  [<00007fdd26f110e0>] 0x7fdd26f110e0
[ 9363.335865] INFO: task bash:4261 blocked for more than 120 seconds.
[ 9363.342311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9363.350480] bash          D ffffc20000810e00     0  4261   4259
[ 9363.356611]  ffff880925175468 0000000000000082 0000000000000088
0000000000010e00
[ 9363.364441]  0000000000010e00 0000000000010e00 0000000000010e00
0000000000010e00
[ 9363.372253]  0000000000010e00 0000000000010e00 ffff88092649e780
ffff880d26522040
[ 9363.380078] Call Trace:
[ 9363.382709] [<ffffffff805e550d>] __mutex_lock_common +0x159/0x1fc [ 9363.388985] [<ffffffff805e55d7>] __mutex_lock_slowpath +0x27/0x3d
[ 9363.395262]  [<ffffffff805e5280>] mutex_lock+0x25/0x53
[ 9363.400600]  [<ffffffffa011b5c9>] nfs_idmap_id+0x96/0x287 [nfs]
[ 9363.406772] [<ffffffffa011b842>] nfs_map_name_to_uid +0x39/0x4f [nfs] [ 9363.413462] [<ffffffffa01120ce>] decode_attr_owner +0x110/0x1af [nfs] [ 9363.420140] [<ffffffffa0112c02>] decode_getfattr+0x8e6/0x960 [nfs]
[ 9363.426642]  [<ffffffffa01131d9>] nfs4_xdr_dec_access+0xfd/0x11e
[nfs]
[ 9363.433408]  [<ffffffffa00776eb>] rpcauth_unwrap_resp+0x89/0xac
[sunrpc]
[ 9363.440324] [<ffffffffa006f857>] call_decode+0x14e/0x1d3 [sunrpc]

Hmm... My guess is that you are deadlocked because rpciod appears to be hanging on an idmapper upcall due to an NFSv4 open() call. At the same time, since /etc/passwd and /etc/group are on an NFSv3 partition, the
idmapper will need to use rpciod in order to read them.

We therefore apparently need to move the idmapper upcall out of the
rpciod part of the open() call. I'll look into that...

Trond



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux