Strange rbd hung with non-standard crush location

"Max A. Krasilnikov" <pseudo@xxxxxxxxxxxx> · Fri, 18 Sep 2015 11:54:36 +0300

Hello!

I have 3-node ceph cluster under ubuntu 14.04.3 with hammer 0.94.2 from
ubuntu-cloud repository. My config and crush map is attached below.

After adding a volume with cinder any of my openstack instances hung after a
small period of time with "[sda]: abort" message in VM's kernel log. When
connecting volume directly to my compute node with

rbd map --name client.openstack --keyfile client.openstack.key openstack-hdd/volume-da53d8d0-b361-4697-94ed-218b92c1541e

I have the same thing: small amount of written data and hung after that:

Sep 15 16:36:24 compute001 kernel: [ 1620.258823] Key type ceph registered
Sep 15 16:36:24 compute001 kernel: [ 1620.259143] libceph: loaded (mon/osd proto 15/24)
Sep 15 16:36:24 compute001 kernel: [ 1620.263448] rbd: loaded (major 251)
Sep 15 16:36:24 compute001 kernel: [ 1620.264948] libceph: client13757843 fsid b490cb36-ab9b-4dd1-b3bf-c022061a977e
Sep 15 16:36:24 compute001 kernel: [ 1620.265359] libceph: mon2 10.0.66.3:6789 session established
Sep 15 16:36:24 compute001 kernel: [ 1620.275268]  rbd0: p1
Sep 15 16:36:24 compute001 kernel: [ 1620.275484] rbd: rbd0: added with size 0xe600000
Sep 15 16:41:24 compute001 kernel: [ 1920.445112] INFO: task fio:31185 blocked for more than 120 seconds.
Sep 15 16:41:24 compute001 kernel: [ 1920.445484]       Not tainted 3.16.0-49-generic #65~14.04.1-Ubuntu
Sep 15 16:41:24 compute001 kernel: [ 1920.445835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 15 16:41:24 compute001 kernel: [ 1920.446286] fio             D ffff881fffab30c0     0 31185      1 0x00000004
Sep 15 16:41:24 compute001 kernel: [ 1920.446295]  ffff881fba167b60 0000000000000046 ffff881fac8cbd20 ffff881fba167fd8
Sep 15 16:41:24 compute001 kernel: [ 1920.446302]  00000000000130c0 00000000000130c0 ffff881fd2a18a30 ffff881fba167c88
Sep 15 16:41:24 compute001 kernel: [ 1920.446308]  ffff881fba167c90 7fffffffffffffff ffff881fac8cbd20 ffff881fac8cbd20
Sep 15 16:41:24 compute001 kernel: [ 1920.446315] Call Trace:
Sep 15 16:41:24 compute001 kernel: [ 1920.446333]  [<ffffffff8176aa19>] schedule+0x29/0x70
Sep 15 16:41:24 compute001 kernel: [ 1920.446338]  [<ffffffff81769df9>] schedule_timeout+0x229/0x2a0
Sep 15 16:41:24 compute001 kernel: [ 1920.446350]  [<ffffffff810b4c54>] ? __wake_up+0x44/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446357]  [<ffffffff810d4158>] ? __call_rcu_nocb_enqueue+0xc8/0xd0
Sep 15 16:41:24 compute001 kernel: [ 1920.446363]  [<ffffffff8176b516>] wait_for_completion+0xa6/0x160
Sep 15 16:41:24 compute001 kernel: [ 1920.446370]  [<ffffffff810a1b30>] ? wake_up_state+0x20/0x20
Sep 15 16:41:24 compute001 kernel: [ 1920.446380]  [<ffffffff8121dbb0>] exit_aio+0xe0/0xf0
Sep 15 16:41:24 compute001 kernel: [ 1920.446388]  [<ffffffff8106ae40>] mmput+0x30/0x120
Sep 15 16:41:24 compute001 kernel: [ 1920.446395]  [<ffffffff8107031c>] do_exit+0x26c/0xa60
Sep 15 16:41:24 compute001 kernel: [ 1920.446401]  [<ffffffff810aafd2>] ? dequeue_entity+0x142/0x5c0
Sep 15 16:41:24 compute001 kernel: [ 1920.446407]  [<ffffffff81070b8f>] do_group_exit+0x3f/0xa0
Sep 15 16:41:24 compute001 kernel: [ 1920.446416]  [<ffffffff81080690>] get_signal_to_deliver+0x1d0/0x6f0
Sep 15 16:41:24 compute001 kernel: [ 1920.446426]  [<ffffffff81012538>] do_signal+0x48/0xad0
Sep 15 16:41:24 compute001 kernel: [ 1920.446434]  [<ffffffff81094a1a>] ? hrtimer_cancel+0x1a/0x30
Sep 15 16:41:24 compute001 kernel: [ 1920.446440]  [<ffffffff8121d0f7>] ? read_events+0x207/0x230
Sep 15 16:41:24 compute001 kernel: [ 1920.446445]  [<ffffffff81094420>] ? hrtimer_get_res+0x50/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446451]  [<ffffffff81013029>] do_notify_resume+0x69/0xb0
Sep 15 16:41:24 compute001 kernel: [ 1920.446459]  [<ffffffff8176ed4a>] int_signal+0x12/0x17

At the same time I have no problems with cephfs mounted on this host using fuse.

I had rebuilt my cluster with almost default config and ended up with strange
behavior:
When using crush map named "crush-good" cluster is doing well. When removing
unused root "default" or even osds from hosts in this root, problem comes back.
Adding osds and hosts in "default" root fix the problem.

Hosts storage00[1-3] listed in /etc/hosts, even [ssd|hdd]-st00[1-3] listed
there with their public ips, even though I know that this is not necessary.

All of OSD run on ext4 made so:
mkfs.ext4 -L osd-[n] -m0 -Tlargefile /dev/drive
mounted with noatime.

all journals lies on separate ssds, 2 per host, one for ssd osds, one for hdd
osds, made as partitions 24GB-sized.

crush-good (almost copy from Ceph site :)):

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 20 osd.20
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 40 osd.40
device 50 osd.50
device 51 osd.51
device 52 osd.52

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ssd-st001 {
	id -1		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
}
host ssd-st002 {
	id -2		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
}
host ssd-st003 {
	id -3		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.40 weight 1.000
}
host hdd-st001 {
	id -4		# do not change unnecessarily
	# weight 3.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
	item osd.11 weight 1.000
	item osd.12 weight 1.000
}
host hdd-st002 {
	id -5		# do not change unnecessarily
	# weight 3.000
	alg straw
	hash 0	# rjenkins1
	item osd.30 weight 1.000
	item osd.31 weight 1.000
	item osd.32 weight 1.000
}
host hdd-st003 {
	id -6		# do not change unnecessarily
	# weight 3.000
	alg straw
	hash 0	# rjenkins1
	item osd.51 weight 1.000
	item osd.52 weight 1.000
	item osd.50 weight 1.000
}
root hdd {
	id -7		# do not change unnecessarily
	# weight 9.000
	alg straw
	hash 0	# rjenkins1
	item hdd-st001 weight 3.000
	item hdd-st002 weight 3.000
	item hdd-st003 weight 3.000
}
root ssd {
	id -8		# do not change unnecessarily
	# weight 3.000
	alg straw
	hash 0	# rjenkins1
	item ssd-st001 weight 1.000
	item ssd-st002 weight 1.000
	item ssd-st003 weight 1.000
}
host storage001 {
	id -9		# do not change unnecessarily
	# weight 4.000
	alg straw2
	hash 0	# rjenkins1
	item osd.0 weight 1.000
	item osd.10 weight 1.000
	item osd.11 weight 1.000
	item osd.12 weight 1.000
}
host storage002 {
	id -11		# do not change unnecessarily
	# weight 4.000
	alg straw2
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.30 weight 1.000
	item osd.31 weight 1.000
	item osd.32 weight 1.000
}
host storage003 {
	id -12		# do not change unnecessarily
	# weight 4.000
	alg straw2
	hash 0	# rjenkins1
	item osd.52 weight 1.000
	item osd.51 weight 1.000
	item osd.50 weight 1.000
	item osd.40 weight 1.000
}
root default {
	id -10		# do not change unnecessarily
	# weight 12.000
	alg straw2
	hash 0	# rjenkins1
	item storage001 weight 4.000
	item storage002 weight 4.000
	item storage003 weight 4.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 2
	max_size 2
	step take hdd
	step chooseleaf firstn 0 type host
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 0
	max_size 10
	step take hdd
	step chooseleaf firstn 0 type host
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 0
	max_size 10
	step take hdd
	step chooseleaf firstn 0 type host
	step emit
}
rule hdd {
	ruleset 3
	type replicated
	min_size 0
	max_size 10
	step take hdd
	step chooseleaf firstn 0 type host
	step emit
}
rule ssd {
	ruleset 4
	type replicated
	min_size 0
	max_size 4
	step take ssd
	step chooseleaf firstn 0 type host
	step emit
}
rule ssd-primary {
	ruleset 5
	type replicated
	min_size 5
	max_size 10
	step take ssd
	step chooseleaf firstn 1 type host
	step emit
	step take hdd
	step chooseleaf firstn -1 type host
	step emit
}

# end crush map

My ceph.conf:

[global]
fsid = 85456792-2ded-4d61-a021-20f6038f2dee
mon_initial_members = storage001,storage002,storage003
public_network = 10.0.66.0/24
cluster_network = 10.0.65.0/24
auth cluster required = none
auth service required = none
auth client required = none
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 256
osd_pool_default_pgp_num = 256

[mds]
mds_data = /var/lib/ceph/mds/mds.$id
keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring

[mds.a]
public_addr = 10.0.66.1
cluster_addr = 10.0.65.1
host = storage001

[mds.b]
public_addr = 10.0.66.2
cluster_addr = 10.0.65.2
host = storage002

[mds.c]
public_addr = 10.0.66.3
cluster_addr = 10.0.65.3
host = storage003

[mon]
mon_host = 10.0.66.1,10.0.66.2,10.0.66.3

[mon.storage001]
mon_addr = 10.0.66.1
host = storage001

[mon.storage002]
mon_addr = 10.0.66.2
host = storage002

[mon.storage003]
mon_addr = 10.0.66.3
host = storage003

[client.openstack]
keyring = /etc/ceph/client.openstack.keyring

[osd]
osd_crush_update_on_start = false

[osd.0]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.10]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.11]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.12]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001

[osd.20]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.30]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.31]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.32]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002

[osd.40]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.50]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.51]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

[osd.52]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003

My volumes:

pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 92 flags hashpspool stripe_width 0
pool 4 'openstack-img' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 187 flags hashpspool stripe_width 0
pool 5 'openstack-hdd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 114 flags hashpspool stripe_width 0
pool 6 'openstack-ssd' replicated size 2 min_size 1 crush_ruleset 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 118 flags hashpspool stripe_width 0
pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 64 pgp_num 64 last_change 141 flags hashpspool stripe_width 0
pool 8 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 145 flags hashpspool crash_replay_interval 45 stripe_width 0

First one was added by ceph setup and not used by me. I hava only changed ruleset to 3.

So, why I need "default" root with osds in it? And why this is not described in docs? Or, maybe, I have mistaken understanding it?

-- 
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com