Re: single OSDs cause cluster hickups

Denny Kreische <denny@xxxxxxxxxxx> · Fri, 15 Feb 2019 13:24:03 +0100

Hi Igor,

Thanks for your reply.
I can verify, discard is disabled in our cluster:

10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard
    "bdev_async_discard": "false",
    "bdev_enable_discard": "false",
[...]

So there must be something else causing the problems.

Thanks,
Denny

> Am 15.02.2019 um 12:41 schrieb Igor Fedotov <ifedotov@xxxxxxx>:
> 
> Hi Denny,
> 
> Do not remember exactly when discards appeared in BlueStore but they are disabled by default:
> 
> See bdev_enable_discard option.
> 
> 
> Thanks,
> 
> Igor
> 
> On 2/15/2019 2:12 PM, Denny Kreische wrote:
>> Hi,
>> 
>> two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
>> somehow we see strange behaviour since then. Single OSDs seem to block for around 5 minutes and this causes the whole cluster and connected applications to hang. This happened 5 times during the last 10 days at irregular times, it didn't happen before the upgrade.
>> 
>> OSD log shows something like this (more log here: https://pastebin.com/6BYam5r4):
>> 
>> [...]
>> 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
>> 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
>> [...]
>> 
>> In this example osd.417 seems to have a problem. I can see same log line in other osd logs with placement groups related to osd.417.
>> I assume that all placement groups related to osd.417 are hanging or blocked when osd.417 is blocked.
>> 
>> How can I see in detail what might cause a certain OSD to stop working?
>> 
>> The cluster consists of 3 different SSD vendors (micron, samsung, intel), but only micron disks are affected until now. we earlier had problems with micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for several minutes. we migrated to bluestore about a year ago. just in case, is there any kind of ssd trim/discard happening in bluestore since mimic?
>> 
>> Thanks,
>> Denny
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Denny Kreische
IT System Ingenieur und Consultant

Am Teichdamm 20
04680 Colditz

Telefon: 034381 55125
Mobil: 0176 2115 1457

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com