On Fri, Apr 24, 2015 at 6:41 PM, Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> wrote: > Hello once again, > > I seem to have hit one more problem today: > 3 nodes test cluster, nodes running 3.18.1 kernel, > ceph-0.94.1, 3-replicas pool, backed by SSD osds. Does this mean rbd device is mapped on a node that also runs one or more osds? > > After mapping volume using rbd and trying to zero it > using dd: > > dd if=/dev/zero of=/dev/rbd0 bs=1M > > it was running fine for some time with speed ~ 200 MB/s, > but the speed was slowly dropping to ~70MB/s and then the process > hung and following backtraces started to appear in dmesg: > > Apr 24 17:09:45 vfnphav1a kernel: [340710.888081] INFO: task kworker/u8:2:15884 blocked for more than 120 seconds. > Apr 24 17:09:45 vfnphav1a kernel: [340710.895645] Not tainted 3.18.11lb6.01 #1 > Apr 24 17:09:45 vfnphav1a kernel: [340710.900612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 24 17:09:45 vfnphav1a kernel: [340710.909290] kworker/u8:2 D 0000000000000001 0 15884 2 0x00000000 > Apr 24 17:09:45 vfnphav1a kernel: [340710.917043] Workqueue: writeback bdi_writeback_workfn (flush-252:0) > Apr 24 17:09:45 vfnphav1a kernel: [340710.923998] ffff880172b73608 0000000000000046 ffff88021424a850 0000000000004000 > Apr 24 17:09:45 vfnphav1a kernel: [340710.932595] ffff8801988d3120 0000000000011640 ffff880172b70010 0000000000011640 > Apr 24 17:09:45 vfnphav1a kernel: [340710.941193] 0000000000004000 0000000000011640 ffff8800d7689890 ffff8801988d3120 > Apr 24 17:09:45 vfnphav1a kernel: [340710.949799] Call Trace: > Apr 24 17:09:45 vfnphav1a kernel: [340710.952746] [<ffffffff8149882e>] ? _raw_spin_unlock+0xe/0x30 > Apr 24 17:09:45 vfnphav1a kernel: [340710.959009] [<ffffffff8123ba6b>] ? queue_unplugged+0x5b/0xe0 > Apr 24 17:09:45 vfnphav1a kernel: [340710.965258] [<ffffffff81494149>] schedule+0x29/0x70 > Apr 24 17:09:45 vfnphav1a kernel: [340710.970728] [<ffffffff8149421c>] io_schedule+0x8c/0xd0 > Apr 24 17:09:45 vfnphav1a kernel: [340710.976462] [<ffffffff81239e95>] get_request+0x445/0x860 > Apr 24 17:09:45 vfnphav1a kernel: [340710.982366] [<ffffffff81086680>] ? bit_waitqueue+0x80/0x80 > Apr 24 17:09:45 vfnphav1a kernel: [340710.988443] [<ffffffff812358eb>] ? elv_merge+0xeb/0xf0 > Apr 24 17:09:45 vfnphav1a kernel: [340710.994167] [<ffffffff8123bdf8>] blk_queue_bio+0xc8/0x360 > Apr 24 17:09:45 vfnphav1a kernel: [340711.000159] [<ffffffff81239790>] generic_make_request+0xc0/0x100 > Apr 24 17:09:45 vfnphav1a kernel: [340711.006760] [<ffffffff81239841>] submit_bio+0x71/0x140 > Apr 24 17:09:45 vfnphav1a kernel: [340711.012489] [<ffffffff811b5aae>] _submit_bh+0x11e/0x170 > Apr 24 17:09:45 vfnphav1a kernel: [340711.018307] [<ffffffff811b5b10>] submit_bh+0x10/0x20 > Apr 24 17:09:45 vfnphav1a kernel: [340711.023865] [<ffffffff811b98e8>] __block_write_full_page.clone.0+0x198/0x340 > Apr 24 17:09:45 vfnphav1a kernel: [340711.031846] [<ffffffff811b9cb0>] ? I_BDEV+0x10/0x10 > Apr 24 17:09:45 vfnphav1a kernel: [340711.037313] [<ffffffff811b9cb0>] ? I_BDEV+0x10/0x10 > Apr 24 17:09:45 vfnphav1a kernel: [340711.042784] [<ffffffff811b9c5a>] block_write_full_page+0xba/0x100 > Apr 24 17:09:45 vfnphav1a kernel: [340711.049477] [<ffffffff811bab88>] blkdev_writepage+0x18/0x20 > Apr 24 17:09:45 vfnphav1a kernel: [340711.055642] [<ffffffff811231ca>] __writepage+0x1a/0x50 > Apr 24 17:09:45 vfnphav1a kernel: [340711.061374] [<ffffffff81124427>] write_cache_pages+0x1e7/0x4e0 > Apr 24 17:09:45 vfnphav1a kernel: [340711.067797] [<ffffffff811231b0>] ? set_page_dirty+0x60/0x60 > Apr 24 17:09:45 vfnphav1a kernel: [340711.073952] [<ffffffff81124774>] generic_writepages+0x54/0x80 > Apr 24 17:09:45 vfnphav1a kernel: [340711.080292] [<ffffffff811247c3>] do_writepages+0x23/0x40 > Apr 24 17:09:45 vfnphav1a kernel: [340711.086196] [<ffffffff811add39>] __writeback_single_inode+0x49/0x2c0 > Apr 24 17:09:45 vfnphav1a kernel: [340711.093131] [<ffffffff81086c8f>] ? wake_up_bit+0x2f/0x40 > Apr 24 17:09:45 vfnphav1a kernel: [340711.099028] [<ffffffff811af3b6>] writeback_sb_inodes+0x2d6/0x490 > Apr 24 17:09:45 vfnphav1a kernel: [340711.105625] [<ffffffff811af60e>] __writeback_inodes_wb+0x9e/0xd0 > Apr 24 17:09:45 vfnphav1a kernel: [340711.112223] [<ffffffff811af83b>] wb_writeback+0x1fb/0x320 > Apr 24 17:09:45 vfnphav1a kernel: [340711.118214] [<ffffffff811afa60>] wb_do_writeback+0x100/0x210 > Apr 24 17:09:45 vfnphav1a kernel: [340711.124466] [<ffffffff811afbe0>] bdi_writeback_workfn+0x70/0x250 > Apr 24 17:09:45 vfnphav1a kernel: [340711.131063] [<ffffffff814954de>] ? mutex_unlock+0xe/0x10 > Apr 24 17:09:45 vfnphav1a kernel: [340711.136974] [<ffffffffa02c4ef4>] ? bnx2x_release_phy_lock+0x24/0x30 [bnx2x] > Apr 24 17:09:45 vfnphav1a kernel: [340711.144530] [<ffffffff8106529a>] process_one_work+0x13a/0x450 > Apr 24 17:09:45 vfnphav1a kernel: [340711.150872] [<ffffffff810656d2>] worker_thread+0x122/0x4f0 > Apr 24 17:09:45 vfnphav1a kernel: [340711.156944] [<ffffffff81086589>] ? __wake_up_common+0x59/0x90 > Apr 24 17:09:45 vfnphav1a kernel: [340711.163280] [<ffffffff810655b0>] ? process_one_work+0x450/0x450 > Apr 24 17:09:45 vfnphav1a kernel: [340711.169790] [<ffffffff8106a98e>] kthread+0xde/0x100 > Apr 24 17:09:45 vfnphav1a kernel: [340711.175253] [<ffffffff81050dc4>] ? do_exit+0x6e4/0xaa0 > Apr 24 17:09:45 vfnphav1a kernel: [340711.180987] [<ffffffff8106a8b0>] ? __init_kthread_worker+0x40/0x40 > Apr 24 17:09:45 vfnphav1a kernel: [340711.187757] [<ffffffff81498d88>] ret_from_fork+0x58/0x90 > Apr 24 17:09:45 vfnphav1a kernel: [340711.193652] [<ffffffff8106a8b0>] ? __init_kthread_worker+0x40/0x40 > > the process started "running" after some time, but it's excruciatingly slow, with speeds about 40KB/s. > all ceph processes seem to be mostly idle.. > > From the backtrace I'm not sure if this can't be network adapter problem, since I see > some bnc2x_ locking functions, but network seems to be running fine otherwise > and I didn't have any issuess till I tried heavily using RBD.. > > If I could provide some more information, please let me know. Can you watch osd sockets in netstat for a while and describe what you are seeing or forward a few representative samples? Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com