Hi guys, Following up on my previous message. I've tried to repeat the same experiment with 3.16.2 kernel and I still have the same behaviour. The dd process got stuck after running dd for 3 times in a row. The iostat shows that the rbd0 device is fully utilised, but without any activity on the device itself: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 135.00 0.00 0.00 0.00 0.00 100.00 I've also tried enabling and disabling rbd cache, which didn't make a difference. Could someone help me with debugging the issue and getting to the root cause? Thanks Andrei ----- Original Message ----- From: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> To: ceph-users at lists.ceph.com Sent: Sunday, 14 September, 2014 12:04:15 AM Subject: writing to rbd mapped device produces hang tasks Hello guys, I've been trying to map an rbd disk to run some testing and I've noticed that while I can successfully read from the rbd image mapped to /dev/rbdX, I am failing to reliably write to it. Sometimes write tests work perfectly well, especially if I am using large block sizes. But often writes hang for a considerable amount of time and I have kernel hang task messages (shown below) in my dmesg. the hang tasks show particularly frequently when using 4K block size. However, with large block sizes writes also sometimes hang, but for sure less frequent I am using simple dd tests like: dd if=/dev/zero of=<mountpoint/file> bs=4K count=250K. I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is mapped on one of the osd servers. All osd servers are running kernel version 3.15.10-031510-generic. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task jbd2/rbd0-8:3505 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 0000000000000003 0 3505 2 0x00000000 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] ffff8803a10a7c48 0000000000000002 ffff88007963b288 ffff8803a10a7fd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 0000000000014540 0000000000014540 ffff880469f63260 ffff880866969930 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] ffff8803a10a7c58 ffff8803a10a7d88 ffff88034d142100 ffff880848519824 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [<ffffffff81778449>] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [<ffffffff812a0360>] jbd2_journal_commit_transaction+0x240/0x1510 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [<ffffffff810a5045>] ? sched_clock_cpu+0x85/0xc0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [<ffffffff810a5cba>] ? arch_vtime_task_switch+0x8a/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [<ffffffff810a5cfd>] ? vtime_common_task_switch+0x3d/0x50 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [<ffffffff810b5420>] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [<ffffffff81079f9f>] ? try_to_del_timer_sync+0x4f/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444938] [<ffffffff812a4e88>] kjournald2+0xb8/0x240 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444941] [<ffffffff810b5420>] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444943] [<ffffffff812a4dd0>] ? commit_timeout+0x10/0x10 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444949] [<ffffffff81091439>] kthread+0xc9/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444952] [<ffffffff81091370>] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444965] [<ffffffff8178567c>] ret_from_fork+0x7c/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444969] [<ffffffff81091370>] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.445141] INFO: task dd:21180 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.446595] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.448070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449910] dd D 0000000000000002 0 21180 19562 0x00000002 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449913] ffff880485a1b5d8 0000000000000002 ffff880485a1b5e8 ffff880485a1bfd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449916] 0000000000014540 0000000000014540 ffff880341833260 ffff88011086cb90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449919] ffff880485a1b5a8 ffff88046fc94e40 ffff88011086cb90 ffffffff81204da0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449921] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449927] [<ffffffff81204da0>] ? __wait_on_buffer+0x30/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449930] [<ffffffff81778449>] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449934] [<ffffffff8177851f>] io_schedule+0x8f/0xd0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449936] [<ffffffff81204dae>] sleep_on_buffer+0xe/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449940] [<ffffffff81778be2>] __wait_on_bit+0x62/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449945] [<ffffffff8120a110>] ? bio_alloc_bioset+0xa0/0x1d0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449947] [<ffffffff81204da0>] ? __wait_on_buffer+0x30/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449951] [<ffffffff81778c8c>] out_of_line_wait_on_bit+0x7c/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449954] [<ffffffff810b54a0>] ? wake_atomic_t_function+0x40/0x40 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449957] [<ffffffff81204d9e>] __wait_on_buffer+0x2e/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449962] [<ffffffff8124eb28>] ext4_wait_block_bitmap+0xd8/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449969] [<ffffffff812888de>] ext4_mb_init_cache+0x1de/0x750 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449974] [<ffffffff8115fb65>] ? find_or_create_page+0x65/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449977] [<ffffffff81288f0e>] ext4_mb_init_group+0xbe/0x130 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449984] [<ffffffff8134bb59>] ? __get_request+0x339/0x560 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449987] [<ffffffff8128948c>] ext4_mb_load_buddy+0x33c/0x360 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449990] [<ffffffff81289bf3>] ext4_mb_find_by_goal+0xa3/0x310 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449993] [<ffffffff8128a60e>] ext4_mb_regular_allocator+0x5e/0x450 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449996] [<ffffffff8127fcf0>] ? ext4_ext_find_extent+0x220/0x2a0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449999] [<ffffffff8128db48>] ext4_mb_new_blocks+0x3f8/0x570 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450002] [<ffffffff8127fbf0>] ? ext4_ext_find_extent+0x120/0x2a0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450004] [<ffffffff8127cabc>] ? ext4_ext_check_overlap.isra.28+0xbc/0xd0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450007] [<ffffffff812839a3>] ext4_ext_map_blocks+0xa13/0xb80 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450010] [<ffffffff81290800>] ? mext_replace_branches+0x220/0x490 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450014] [<ffffffff8125781f>] ext4_map_blocks+0x15f/0x550 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450017] [<ffffffff81256110>] ? mpage_prepare_extent_to_map+0x2d0/0x320 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450020] [<ffffffff81257c7e>] mpage_map_one_extent+0x6e/0x1a0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450023] [<ffffffff8125c34b>] mpage_map_and_submit_extent+0x4b/0x1e0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450026] [<ffffffff8125ca07>] ext4_writepages+0x527/0x7b0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450032] [<ffffffff8109f1f5>] ? check_preempt_curr+0x75/0xa0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450035] [<ffffffff810a2742>] ? default_wake_function+0x12/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450038] [<ffffffff810b5436>] ? autoremove_wake_function+0x16/0x40 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450044] [<ffffffff8116b3a0>] do_writepages+0x20/0x40 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450046] [<ffffffff81160229>] __filemap_fdatawrite_range+0x59/0x60 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450049] [<ffffffff81160abc>] filemap_flush+0x1c/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450051] [<ffffffff81258129>] ext4_alloc_da_blocks+0x49/0xa0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450055] [<ffffffff81250d39>] ext4_release_file+0x79/0xc0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450059] [<ffffffff811d477d>] __fput+0xbd/0x250 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450061] [<ffffffff811d495e>] ____fput+0xe/0x10 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450066] [<ffffffff8108df37>] task_work_run+0xa7/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450072] [<ffffffff81013e67>] do_notify_resume+0xc7/0xd0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.450075] [<ffffffff817859ea>] int_signal+0x12/0x17 Sep 13 21:24:45 arh-ibstorage2-ib kernel: [11894.974467] perf interrupt took too long (5003 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 Has anyone notices similar behavior? Are there any cases where people use rbd mapped images in production? Thanks Andrei _______________________________________________ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140914/ad7f0faf/attachment.htm>