Re: EC cluster cascade failures and performance problems

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 19 Nov 2020 18:28:09 +0300

Hi Paul,

any chances you initiated massive data removal recently?

Are there any suicide timeouts in OSD logs prior to OSD failures? Any 
log output containing "slow operation observed" there?

Please also note the following PR and tracker comments which might be 
relevant for your case.

https://github.com/ceph/ceph/pull/38044

https://tracker.ceph.com/issues/45765#note-27

Thanks,

Igor

On 11/17/2020 11:40 AM, Paul Kramme wrote:

Hello,

currently, we are experiencing problems with a cluster used for storing
RBD backups. Config:

* 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL
* k=4 m=2 EC
* dual 25GbE NIC
* v14.2.8

ceph health detail shows the following messages:

HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not
deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for
18629 sec, mon.cloud10-1517 has slow ops
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
      osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of
185 GiB) to slow device
PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time
     pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958
     pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935
[--- cut ---]
PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on)
     snap trim queue for pg 18.2c5 at 41630
     snap trim queue for pg 18.d6 at 44079
     longest queue on pg 18.d6 at 44079
     try decreasing "osd snap trim sleep" and/or increasing "osd pg max
concurrent snap trims".
SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517
has slow ops

We've made some observations on that cluster:
* The BlueFS spillover goes away with "ceph tell osd.0 compact" but
comes back eventually
* The blockdb/WAL SSD is highly utilized, while the HDDs are not
* When one OSD fails, there is a cascade failure taking down many other
OSDs across all nodes. Most of the time, the cluster comes back when
setting the nodown flag and restarting all failed OSDs one by one
* Sometimes, especially during maintenance, "Long heartbeat ping times
on front/back interface seen, longest is 1390.076 msec" messages pop up
* The cluster performance deteriorates sharply when upgrading from
14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8

Of these problems, the OSD cascade failure is the most important, and is
responsible for lenghty downtimes in the past few weeks.

Do you have any ideas on how to combat these problems?

Thank you,

Paul

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx