Frequent Crashes on rbd to nfs gateway Server

ilya.dryomov@xxxxxxxxxxx (Ilya Dryomov) · Wed, 24 Sep 2014 15:54:39 +0400

On Wed, Sep 24, 2014 at 12:20 PM, Micha Krause <micha at krausam.de> wrote:
> Hi,
>
>> So does it actually crash or it's just the blocked I/Os?  If it doesn't
>>
>> crash, you should be able to get everything off dmesg.
>
>
> it's blocked I/Os, I just wrote another mail to the list, with more dmesg
> Output
> from a Centos machine.
>
>
>>> dmesg:
>>>
>>> [18102.981064] INFO: task nfsd:2769 blocked for more than 120 seconds.
>>> [18102.981112]       Not tainted 3.14-0.bpo.1-amd64 #1
>>> [18102.981150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables
>>> this message.
>>> [18102.981216] nfsd            D ffff88003fc14340     0  2769      2
>>> 0x00000000
>>> [18102.981218]  ffff88003bac6e20 0000000000000046 0000000000000000
>>> ffff88003d47ada0
>>> [18102.981219]  0000000000014340 ffff88003ce31fd8 0000000000014340
>>> ffff88003bac6e20
>>> [18102.981221]  ffff88003ce31728 ffff8800029539f0 7fffffffffffffff
>>> 7fffffffffffffff
>>> [18102.981223] Call Trace:
>>> [18102.981225]  [<ffffffff814eedbd>] ? schedule_timeout+0x1ed/0x250
>>> [18102.981231]  [<ffffffffa04b0f92>] ? _xfs_buf_find+0xd2/0x280 [xfs]
>>> [18102.981234]  [<ffffffff8117fc2c>] ? kmem_cache_alloc+0x1bc/0x1f0
>>> [18102.981236]  [<ffffffff814f193c>] ? __down_common+0x97/0xea
>>> [18102.981241]  [<ffffffffa04b0faa>] ? _xfs_buf_find+0xea/0x280 [xfs]
>>> [18102.981243]  [<ffffffff810aa697>] ? down+0x37/0x40
>>> [18102.981247]  [<ffffffffa04b0e02>] ? xfs_buf_lock+0x32/0xf0 [xfs]
>>> [18102.981252]  [<ffffffffa04b0faa>] ? _xfs_buf_find+0xea/0x280 [xfs]
>>> [18102.981257]  [<ffffffffa04b1215>] ? xfs_buf_get_map+0x35/0x1a0 [xfs]
>>> [18102.981263]  [<ffffffffa04b2153>] ? xfs_buf_read_map+0x33/0x130 [xfs]
>>> [18102.981269]  [<ffffffffa05161da>] ? xfs_trans_read_buf_map+0x34a/0x4f0
>>> [xfs]
>>> [18102.981275]  [<ffffffffa05036f9>] ? xfs_imap_to_bp+0x69/0xf0 [xfs]
>>> [18102.981281]  [<ffffffffa0503bcd>] ? xfs_iread+0x7d/0x3f0 [xfs]
>>> [18102.981284]  [<ffffffff810e8939>] ? make_kgid+0x9/0x10
>>> [18102.981286]  [<ffffffff811b148e>] ? inode_init_always+0x10e/0x1d0
>>> [18102.981292]  [<ffffffffa04ba11a>] ? xfs_iget+0x2ba/0x810 [xfs]
>>> [18102.981298]  [<ffffffffa04fd9a6>] ? xfs_ialloc+0xe6/0x740 [xfs]
>>> [18102.981305]  [<ffffffffa04ca1ee>] ? kmem_zone_alloc+0x6e/0xf0 [xfs]
>>> [18102.981311]  [<ffffffffa04fe083>] ? xfs_dir_ialloc+0x83/0x300 [xfs]
>>> [18102.981317]  [<ffffffffa04c8e43>] ? xfs_trans_reserve+0x213/0x220
>>> [xfs]
>>> [18102.981323]  [<ffffffffa04fe87e>] ? xfs_create+0x4fe/0x720 [xfs]
>>> [18102.981329]  [<ffffffffa04bfd02>] ? xfs_vn_mknod+0xd2/0x200 [xfs]
>>> [18102.981331]  [<ffffffff811a6b54>] ? vfs_create+0xe4/0x160
>>> [18102.981335]  [<ffffffffa0400d9e>] ? do_nfsd_create+0x53e/0x610 [nfsd]
>>> [18102.981339]  [<ffffffffa0407f4d>] ? nfsd3_proc_create+0x16d/0x250
>>> [nfsd]
>>> [18102.981342]  [<ffffffffa03f9d74>] ? nfsd_dispatch+0xe4/0x230 [nfsd]
>>> [18102.981347]  [<ffffffffa035dd64>] ? svc_process_common+0x354/0x690
>>> [sunrpc]
>>> [18102.981349]  [<ffffffff81096ab0>] ? try_to_wake_up+0x280/0x280
>>> [18102.981353]  [<ffffffffa035e3fb>] ? svc_process+0x10b/0x160 [sunrpc]
>>> [18102.981359]  [<ffffffffa03f96d7>] ? nfsd+0xb7/0x130 [nfsd]
>>> [18102.981363]  [<ffffffffa03f9620>] ? nfsd_destroy+0x70/0x70 [nfsd]
>>> [18102.981365]  [<ffffffff81086d6c>] ? kthread+0xbc/0xe0
>>> [18102.981367]  [<ffffffff81086cb0>] ? flush_kthread_worker+0xa0/0xa0
>>> [18102.981369]  [<ffffffff814faecc>] ? ret_from_fork+0x7c/0xb0
>>> [18102.981371]  [<ffffffff81086cb0>] ? flush_kthread_worker+0xa0/0xa0
>>
>>
>> Is that the only hung task in dmesg?
>
>
> I think it was, could be that this Message was repeated a few times.

Like I mentioned in my other reply, I'd be very interested in any
similar messages on kernel other than 3.15.*, 3.16.1 and 3.16.2.  One
hung task stack trace is usually not enough to diagnose this sort of
problems.