Slow backfilling with bluestore, ssd and metadata pools

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 21 Dec 2017 11:28:41 +0100

Hi,

we are in the process of migrating our hosts to bluestore. Each host has 
12 HDDs (6TB / 4TB) and two Intel P3700 NVME SSDs with 375 GB capacity. 
The new bluestore OSDs are created by ceph-volume:

ceph-volume lvm create --bluestore --block.db /dev/nvmeXn1pY --data 
/dev/sdX1

6 OSDs share a SSD with 30GB partitions for rocksdb; the remaining space 
is used as additional ssd based osd without specifying additional 
partitions.

Backfilling from the other nodes works fine for the hdd based OSDs, but 
is _really_ slow for the ssd based ones. With filestore moving our 
cephfs metadata pool around was a matter of 10 minutes (350MB, 8 million 
objects, 1024 PGs). With bluestore remapped a part of the pool (about 
400PGs, those affected by adding a new pair of ssd based OSDs) did not 
finish over night....

OSD config section from ceph.conf:

[osd]
osd_scrub_sleep = 0.05
osd_journal_size = 10240
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
max_pg_per_osd_hard_ratio = 4.0
osd_max_pg_per_osd_hard_ratio = 4.0
bluestore_cache_size_hdd = 5368709120
mon_max_pg_per_osd = 400

Backfilling runs with max-backfills set to 20 during day and 50 during 
night. Some numbers (ceph pg dump for the most advanced backfilling 
cephfs metadata PG, ten seconds difference):

ceph pg dump | grep backfilling | grep -v undersized | sort -k4 -n -r | 
tail -n 1 && sleep 10 && echo && ceph pg dump | grep backfilling | grep 
-v undersized | sort -k4 -n -r | tail -n 1
dumped all
8.101      7581                  0        0      4549       0 4194304 
2488     2488 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543240'1012998    543248:1923733 [78,34,49]         
78                     [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231

dumped all
8.101      7580                  0        0      4542 0           0 
2489     2489 active+remapped+backfilling 2017-12-21 09:03:30.429605 
543248'1012999    543250:1923755 [78,34,49]         
78                     [78,34,19] 78    522371'1009118 2017-12-18 
16:11:29.755231    522371'1009118 2017-12-18 16:11:29.755231

Seven objects in 10 seconds does not sound sane to me, given that only 
key-value has to be transferred.

Any hints how to tune this?

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com