Re: Luminous with osd flapping, slow requests when deep scrubbing

Christian Balzer <chibi@xxxxxxx> · Tue, 16 Oct 2018 16:51:36 +0900

Hello,

On Mon, 15 Oct 2018 12:26:50 +0100 (BST) Andrei Mikhailovsky wrote:

> Hello, 
> 
> I am currently running Luminous 12.2.8 on Ubuntu with 4.15.0-36-generic kernel from the official ubuntu repo. The cluster has 4 mon + osd servers. Each osd server has the total of 9 spinning osds and 1 ssd for the hdd and ssd pools. The hdds are backed by the S3710 ssds for journaling with a ration of 1:5. The ssd pool osds are not using external journals. Ceph is used as a Primary storage for Cloudstack - all vm disk images are stored on the cluster. 
>

For the record, are you seeing the flapping only on HDD pools or with SSD
pools as well?

When migrating to Bluestore, did you see this starting to happen before
the migration was complete (and just on Bluestore OSDs of course)?

What's your HW like, in particular RAM? Current output of "free"?

If you didn't tune your bluestore cache you're likely just using a
fraction of the RAM for caching, making things a LOT harder for OSDs to
keep up when compared to filestore and the global (per node) page cache.

See the various bluestore cache threads here, one quite recently.

If your cluster was close to the brink with filestore just moving it to
bluestore would nicely fit into what you're seeing, especially for the
high stress and cache bypassing bluestore deep scrubbing.

Regards,

Christian
> I have recently migrated all osds to the bluestore, which was a long process with ups and downs, but I am happy to say that the migration is done. During the migration I've disabled the scrubbing (both deep and standard). After reenabling the scrubbing I have noticed the cluster started having a large number of slow requests and poor client IO (to the point of vms stall for minutes). Further investigation showed that the slow requests happen because of the osds flapping. In a single day my logs have over 1000 entries which report osd going down. This effects random osds. Disabling deep-scrubbing stabilises the cluster and the osds are no longer flap and the slow requests disappear. As a short term solution I've disabled the deepscurbbing, but was hoping to fix the issues with your help. 
> 
> At the moment, I am running the cluster with default settings apart from the following settings: 
> 
> [global] 
> osd_disk_thread_ioprio_priority = 7 
> osd_disk_thread_ioprio_class = idle 
> osd_recovery_op_priority = 1 
> 
> [osd] 
> debug_ms = 0 
> debug_auth = 0 
> debug_osd = 0 
> debug_bluestore = 0 
> debug_bluefs = 0 
> debug_bdev = 0 
> debug_rocksdb = 0 
> 
> 
> Could you share experiences with deep scrubbing of bluestore osds? Are there any options that I should set to make sure the osds are not flapping and the client IO is still available? 
> 
> Thanks 
> 
> Andrei 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com