Re: EC cluster cascade failures and performance problems

Paul Kramme <p.kramme@xxxxxxxxxxxx> · Thu, 19 Nov 2020 21:29:15 +0100

Hi Igor,

we store 400TB backups (RDB snapshots) on the cluster, depending on the
schedule we replace all data every one to two weeks, so we are deleting
data every day.

Yes, the OSDs are killed with messages like "heartbeat_check: no reply
from 10.244.0.27:6852 osd.37 ever...", if that is what you mean.

All OSDs are reporting slow operation messages like this:

2020-11-12 08:12:09.564 7fb590569700  0
bluestore(/var/lib/ceph/osd/ceph-3) log_latency_fn slow operation
observed for _txc_committed_kv, latency = 5.22663s, txc = 0x55c0357fb200
2020-11-12 08:26:10.759 7fb58254d700  0
bluestore(/var/lib/ceph/osd/ceph-3) log_latency slow operation observed
for submit_transact, latency = 5.1896s

Thanks for the links, we are attempting to upgrade to 14.2.14 tonight
and are keeping bluefs_buffered_io=true.

Paul

Am 19.11.20 um 16:28 schrieb Igor Fedotov:
> Hi Paul,
> 
> any chances you initiated massive data removal recently?
> 
> Are there any suicide timeouts in OSD logs prior to OSD failures? Any
> log output containing "slow operation observed" there?
> 
> Please also note the following PR and tracker comments which might be
> relevant for your case.
> 
> https://github.com/ceph/ceph/pull/38044
> 
> https://tracker.ceph.com/issues/45765#note-27
> 
> 
> Thanks,
> 
> Igor
> 
> On 11/17/2020 11:40 AM, Paul Kramme wrote:
> 
>> Hello,
>>
>> currently, we are experiencing problems with a cluster used for storing
>> RBD backups. Config:
>>
>> * 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL
>> * k=4 m=2 EC
>> * dual 25GbE NIC
>> * v14.2.8
>>
>> ceph health detail shows the following messages:
>>
>> HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not
>> deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768
>> (mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for
>> 18629 sec, mon.cloud10-1517 has slow ops
>> BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
>>       osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of
>> 185 GiB) to slow device
>> PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time
>>      pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958
>>      pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935
>> [--- cut ---]
>> PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768
>> (mon_osd_snap_trim_queue_warn_on)
>>      snap trim queue for pg 18.2c5 at 41630
>>      snap trim queue for pg 18.d6 at 44079
>>      longest queue on pg 18.d6 at 44079
>>      try decreasing "osd snap trim sleep" and/or increasing "osd pg max
>> concurrent snap trims".
>> SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517
>> has slow ops
>>
>> We've made some observations on that cluster:
>> * The BlueFS spillover goes away with "ceph tell osd.0 compact" but
>> comes back eventually
>> * The blockdb/WAL SSD is highly utilized, while the HDDs are not
>> * When one OSD fails, there is a cascade failure taking down many other
>> OSDs across all nodes. Most of the time, the cluster comes back when
>> setting the nodown flag and restarting all failed OSDs one by one
>> * Sometimes, especially during maintenance, "Long heartbeat ping times
>> on front/back interface seen, longest is 1390.076 msec" messages pop up
>> * The cluster performance deteriorates sharply when upgrading from
>> 14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8
>>
>> Of these problems, the OSD cascade failure is the most important, and is
>> responsible for lenghty downtimes in the past few weeks.
>>
>> Do you have any ideas on how to combat these problems?
>>
>> Thank you,
>>
>> Paul
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx