Re: Luminous with osd flapping, slow requests when deep scrubbing

Eugen Block <eblock@xxxxxx> · Mon, 15 Oct 2018 12:46:48 +0000

Hi Andrei,

we have been using the script from [1] to define the number of PGs to  
deep-scrub in parallel, we currently use MAXSCRUBS=4, you could start  
with 1 to minimize performance impacts.

And these are the scrub settings from our ceph.conf:

ceph:~ # grep scrub /etc/ceph/ceph.conf
osd_scrub_begin_hour = 0
osd_scrub_end_hour = 7
osd_scrub_sleep = 0.1
osd_deep_scrub_interval = 2419200

The osd_deep_scrub_interval is set to 4 weeks so that it doesn't  
interfere with our own interval defined by the cronjob, scrubbing a  
quarter of PGs four times a week, so that every PG has been  
deep-scrubbed within one week.

Regards,
Eugen

[1]  
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/

Zitat von Andrei Mikhailovsky <andrei@xxxxxxxxxx>:

Hello,

I am currently running Luminous 12.2.8 on Ubuntu with  
4.15.0-36-generic kernel from the official ubuntu repo. The cluster  
has 4 mon + osd servers. Each osd server has the total of 9 spinning  
osds and 1 ssd for the hdd and ssd pools. The hdds are backed by the  
S3710 ssds for journaling with a ration of 1:5. The ssd pool osds  
are not using external journals. Ceph is used as a Primary storage  
for Cloudstack - all vm disk images are stored on the cluster.

I have recently migrated all osds to the bluestore, which was a long  
process with ups and downs, but I am happy to say that the migration  
is done. During the migration I've disabled the scrubbing (both deep  
and standard). After reenabling the scrubbing I have noticed the  
cluster started having a large number of slow requests and poor  
client IO (to the point of vms stall for minutes). Further  
investigation showed that the slow requests happen because of the  
osds flapping. In a single day my logs have over 1000 entries which  
report osd going down. This effects random osds. Disabling  
deep-scrubbing stabilises the cluster and the osds are no longer  
flap and the slow requests disappear. As a short term solution I've  
disabled the deepscurbbing, but was hoping to fix the issues with  
your help.

At the moment, I am running the cluster with default settings apart  
from the following settings:

[global]
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_recovery_op_priority = 1

[osd]
debug_ms = 0
debug_auth = 0
debug_osd = 0
debug_bluestore = 0
debug_bluefs = 0
debug_bdev = 0
debug_rocksdb = 0

Could you share experiences with deep scrubbing of bluestore osds?  
Are there any options that I should set to make sure the osds are  
not flapping and the client IO is still available?

Thanks

Andrei

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com