EC cluster cascade failures and performance problems

Paul Kramme <p.kramme@xxxxxxxxxxxx> · Tue, 17 Nov 2020 09:40:49 +0100

Hello,

currently, we are experiencing problems with a cluster used for storing
RBD backups. Config:

* 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL
* k=4 m=2 EC
* dual 25GbE NIC
* v14.2.8

ceph health detail shows the following messages:

HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not
deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for
18629 sec, mon.cloud10-1517 has slow ops
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
     osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of
185 GiB) to slow device
PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time
    pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958
    pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935
[--- cut ---]
PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on)
    snap trim queue for pg 18.2c5 at 41630
    snap trim queue for pg 18.d6 at 44079
    longest queue on pg 18.d6 at 44079
    try decreasing "osd snap trim sleep" and/or increasing "osd pg max
concurrent snap trims".
SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517
has slow ops

We've made some observations on that cluster:
* The BlueFS spillover goes away with "ceph tell osd.0 compact" but
comes back eventually
* The blockdb/WAL SSD is highly utilized, while the HDDs are not
* When one OSD fails, there is a cascade failure taking down many other
OSDs across all nodes. Most of the time, the cluster comes back when
setting the nodown flag and restarting all failed OSDs one by one
* Sometimes, especially during maintenance, "Long heartbeat ping times
on front/back interface seen, longest is 1390.076 msec" messages pop up
* The cluster performance deteriorates sharply when upgrading from
14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8

Of these problems, the OSD cascade failure is the most important, and is
responsible for lenghty downtimes in the past few weeks.

Do you have any ideas on how to combat these problems?

Thank you,

Paul

-- 
Mit freundlichen Grüßen
  Paul Kramme
Ihr Profihost Team

-------------------------------
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181     | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: info@xxxxxxxxxxxxx

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Stefan Priebe, Marc Zocher, Dr. Claus Boyens,
Daniel Hagemeier
Aufsichtsrat: Gabriele Pulvermüller (Vorsitzende)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx