Hello! I have 3-node ceph cluster under ubuntu 14.04.3 with hammer 0.94.2 from ubuntu-cloud repository. My config and crush map is attached below. After adding a volume with cinder any of my openstack instances hung after a small period of time with "[sda]: abort" message in VM's kernel log. When connecting volume directly to my compute node with rbd map --name client.openstack --keyfile client.openstack.key openstack-hdd/volume-da53d8d0-b361-4697-94ed-218b92c1541e I have the same thing: small amount of written data and hung after that: Sep 15 16:36:24 compute001 kernel: [ 1620.258823] Key type ceph registered Sep 15 16:36:24 compute001 kernel: [ 1620.259143] libceph: loaded (mon/osd proto 15/24) Sep 15 16:36:24 compute001 kernel: [ 1620.263448] rbd: loaded (major 251) Sep 15 16:36:24 compute001 kernel: [ 1620.264948] libceph: client13757843 fsid b490cb36-ab9b-4dd1-b3bf-c022061a977e Sep 15 16:36:24 compute001 kernel: [ 1620.265359] libceph: mon2 10.0.66.3:6789 session established Sep 15 16:36:24 compute001 kernel: [ 1620.275268] rbd0: p1 Sep 15 16:36:24 compute001 kernel: [ 1620.275484] rbd: rbd0: added with size 0xe600000 Sep 15 16:41:24 compute001 kernel: [ 1920.445112] INFO: task fio:31185 blocked for more than 120 seconds. Sep 15 16:41:24 compute001 kernel: [ 1920.445484] Not tainted 3.16.0-49-generic #65~14.04.1-Ubuntu Sep 15 16:41:24 compute001 kernel: [ 1920.445835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 15 16:41:24 compute001 kernel: [ 1920.446286] fio D ffff881fffab30c0 0 31185 1 0x00000004 Sep 15 16:41:24 compute001 kernel: [ 1920.446295] ffff881fba167b60 0000000000000046 ffff881fac8cbd20 ffff881fba167fd8 Sep 15 16:41:24 compute001 kernel: [ 1920.446302] 00000000000130c0 00000000000130c0 ffff881fd2a18a30 ffff881fba167c88 Sep 15 16:41:24 compute001 kernel: [ 1920.446308] ffff881fba167c90 7fffffffffffffff ffff881fac8cbd20 ffff881fac8cbd20 Sep 15 16:41:24 compute001 kernel: [ 1920.446315] Call Trace: Sep 15 16:41:24 compute001 kernel: [ 1920.446333] [<ffffffff8176aa19>] schedule+0x29/0x70 Sep 15 16:41:24 compute001 kernel: [ 1920.446338] [<ffffffff81769df9>] schedule_timeout+0x229/0x2a0 Sep 15 16:41:24 compute001 kernel: [ 1920.446350] [<ffffffff810b4c54>] ? __wake_up+0x44/0x50 Sep 15 16:41:24 compute001 kernel: [ 1920.446357] [<ffffffff810d4158>] ? __call_rcu_nocb_enqueue+0xc8/0xd0 Sep 15 16:41:24 compute001 kernel: [ 1920.446363] [<ffffffff8176b516>] wait_for_completion+0xa6/0x160 Sep 15 16:41:24 compute001 kernel: [ 1920.446370] [<ffffffff810a1b30>] ? wake_up_state+0x20/0x20 Sep 15 16:41:24 compute001 kernel: [ 1920.446380] [<ffffffff8121dbb0>] exit_aio+0xe0/0xf0 Sep 15 16:41:24 compute001 kernel: [ 1920.446388] [<ffffffff8106ae40>] mmput+0x30/0x120 Sep 15 16:41:24 compute001 kernel: [ 1920.446395] [<ffffffff8107031c>] do_exit+0x26c/0xa60 Sep 15 16:41:24 compute001 kernel: [ 1920.446401] [<ffffffff810aafd2>] ? dequeue_entity+0x142/0x5c0 Sep 15 16:41:24 compute001 kernel: [ 1920.446407] [<ffffffff81070b8f>] do_group_exit+0x3f/0xa0 Sep 15 16:41:24 compute001 kernel: [ 1920.446416] [<ffffffff81080690>] get_signal_to_deliver+0x1d0/0x6f0 Sep 15 16:41:24 compute001 kernel: [ 1920.446426] [<ffffffff81012538>] do_signal+0x48/0xad0 Sep 15 16:41:24 compute001 kernel: [ 1920.446434] [<ffffffff81094a1a>] ? hrtimer_cancel+0x1a/0x30 Sep 15 16:41:24 compute001 kernel: [ 1920.446440] [<ffffffff8121d0f7>] ? read_events+0x207/0x230 Sep 15 16:41:24 compute001 kernel: [ 1920.446445] [<ffffffff81094420>] ? hrtimer_get_res+0x50/0x50 Sep 15 16:41:24 compute001 kernel: [ 1920.446451] [<ffffffff81013029>] do_notify_resume+0x69/0xb0 Sep 15 16:41:24 compute001 kernel: [ 1920.446459] [<ffffffff8176ed4a>] int_signal+0x12/0x17 At the same time I have no problems with cephfs mounted on this host using fuse. I had rebuilt my cluster with almost default config and ended up with strange behavior: When using crush map named "crush-good" cluster is doing well. When removing unused root "default" or even osds from hosts in this root, problem comes back. Adding osds and hosts in "default" root fix the problem. Hosts storage00[1-3] listed in /etc/hosts, even [ssd|hdd]-st00[1-3] listed there with their public ips, even though I know that this is not necessary. All of OSD run on ext4 made so: mkfs.ext4 -L osd-[n] -m0 -Tlargefile /dev/drive mounted with noatime. all journals lies on separate ssds, 2 per host, one for ssd osds, one for hdd osds, made as partitions 24GB-sized. crush-good (almost copy from Ceph site :)): # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 20 osd.20 device 30 osd.30 device 31 osd.31 device 32 osd.32 device 40 osd.40 device 50 osd.50 device 51 osd.51 device 52 osd.52 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host ssd-st001 { id -1 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } host ssd-st002 { id -2 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.20 weight 1.000 } host ssd-st003 { id -3 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.40 weight 1.000 } host hdd-st001 { id -4 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.12 weight 1.000 } host hdd-st002 { id -5 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.30 weight 1.000 item osd.31 weight 1.000 item osd.32 weight 1.000 } host hdd-st003 { id -6 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.51 weight 1.000 item osd.52 weight 1.000 item osd.50 weight 1.000 } root hdd { id -7 # do not change unnecessarily # weight 9.000 alg straw hash 0 # rjenkins1 item hdd-st001 weight 3.000 item hdd-st002 weight 3.000 item hdd-st003 weight 3.000 } root ssd { id -8 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item ssd-st001 weight 1.000 item ssd-st002 weight 1.000 item ssd-st003 weight 1.000 } host storage001 { id -9 # do not change unnecessarily # weight 4.000 alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.12 weight 1.000 } host storage002 { id -11 # do not change unnecessarily # weight 4.000 alg straw2 hash 0 # rjenkins1 item osd.20 weight 1.000 item osd.30 weight 1.000 item osd.31 weight 1.000 item osd.32 weight 1.000 } host storage003 { id -12 # do not change unnecessarily # weight 4.000 alg straw2 hash 0 # rjenkins1 item osd.52 weight 1.000 item osd.51 weight 1.000 item osd.50 weight 1.000 item osd.40 weight 1.000 } root default { id -10 # do not change unnecessarily # weight 12.000 alg straw2 hash 0 # rjenkins1 item storage001 weight 4.000 item storage002 weight 4.000 item storage003 weight 4.000 } # rules rule data { ruleset 0 type replicated min_size 2 max_size 2 step take hdd step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 0 max_size 10 step take hdd step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 0 max_size 10 step take hdd step chooseleaf firstn 0 type host step emit } rule hdd { ruleset 3 type replicated min_size 0 max_size 10 step take hdd step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 4 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit } rule ssd-primary { ruleset 5 type replicated min_size 5 max_size 10 step take ssd step chooseleaf firstn 1 type host step emit step take hdd step chooseleaf firstn -1 type host step emit } # end crush map My ceph.conf: [global] fsid = 85456792-2ded-4d61-a021-20f6038f2dee mon_initial_members = storage001,storage002,storage003 public_network = 10.0.66.0/24 cluster_network = 10.0.65.0/24 auth cluster required = none auth service required = none auth client required = none filestore_xattr_use_omap = true osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 256 osd_pool_default_pgp_num = 256 [mds] mds_data = /var/lib/ceph/mds/mds.$id keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring [mds.a] public_addr = 10.0.66.1 cluster_addr = 10.0.65.1 host = storage001 [mds.b] public_addr = 10.0.66.2 cluster_addr = 10.0.65.2 host = storage002 [mds.c] public_addr = 10.0.66.3 cluster_addr = 10.0.65.3 host = storage003 [mon] mon_host = 10.0.66.1,10.0.66.2,10.0.66.3 [mon.storage001] mon_addr = 10.0.66.1 host = storage001 [mon.storage002] mon_addr = 10.0.66.2 host = storage002 [mon.storage003] mon_addr = 10.0.66.3 host = storage003 [client.openstack] keyring = /etc/ceph/client.openstack.keyring [osd] osd_crush_update_on_start = false [osd.0] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.10] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.11] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.12] cluster_addr = 10.0.65.1 public_addr = 10.0.66.1 host = storage001 [osd.20] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.30] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.31] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.32] cluster_addr = 10.0.65.2 public_addr = 10.0.66.2 host = storage002 [osd.40] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.50] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.51] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 [osd.52] cluster_addr = 10.0.65.3 public_addr = 10.0.66.3 host = storage003 My volumes: pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 92 flags hashpspool stripe_width 0 pool 4 'openstack-img' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 187 flags hashpspool stripe_width 0 pool 5 'openstack-hdd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 114 flags hashpspool stripe_width 0 pool 6 'openstack-ssd' replicated size 2 min_size 1 crush_ruleset 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 118 flags hashpspool stripe_width 0 pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 141 flags hashpspool stripe_width 0 pool 8 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 145 flags hashpspool crash_replay_interval 45 stripe_width 0 First one was added by ceph setup and not used by me. I hava only changed ruleset to 3. So, why I need "default" root with osds in it? And why this is not described in docs? Or, maybe, I have mistaken understanding it? -- WBR, Max A. Krasilnikov _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com