Hi Igor, we store 400TB backups (RDB snapshots) on the cluster, depending on the schedule we replace all data every one to two weeks, so we are deleting data every day. Yes, the OSDs are killed with messages like "heartbeat_check: no reply from 10.244.0.27:6852 osd.37 ever...", if that is what you mean. All OSDs are reporting slow operation messages like this: 2020-11-12 08:12:09.564 7fb590569700 0 bluestore(/var/lib/ceph/osd/ceph-3) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.22663s, txc = 0x55c0357fb200 2020-11-12 08:26:10.759 7fb58254d700 0 bluestore(/var/lib/ceph/osd/ceph-3) log_latency slow operation observed for submit_transact, latency = 5.1896s Thanks for the links, we are attempting to upgrade to 14.2.14 tonight and are keeping bluefs_buffered_io=true. Paul Am 19.11.20 um 16:28 schrieb Igor Fedotov: > Hi Paul, > > any chances you initiated massive data removal recently? > > Are there any suicide timeouts in OSD logs prior to OSD failures? Any > log output containing "slow operation observed" there? > > Please also note the following PR and tracker comments which might be > relevant for your case. > > https://github.com/ceph/ceph/pull/38044 > > https://tracker.ceph.com/issues/45765#note-27 > > > Thanks, > > Igor > > On 11/17/2020 11:40 AM, Paul Kramme wrote: > >> Hello, >> >> currently, we are experiencing problems with a cluster used for storing >> RBD backups. Config: >> >> * 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL >> * k=4 m=2 EC >> * dual 25GbE NIC >> * v14.2.8 >> >> ceph health detail shows the following messages: >> >> HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not >> deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768 >> (mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for >> 18629 sec, mon.cloud10-1517 has slow ops >> BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s) >> osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of >> 185 GiB) to slow device >> PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time >> pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958 >> pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935 >> [--- cut ---] >> PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768 >> (mon_osd_snap_trim_queue_warn_on) >> snap trim queue for pg 18.2c5 at 41630 >> snap trim queue for pg 18.d6 at 44079 >> longest queue on pg 18.d6 at 44079 >> try decreasing "osd snap trim sleep" and/or increasing "osd pg max >> concurrent snap trims". >> SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517 >> has slow ops >> >> We've made some observations on that cluster: >> * The BlueFS spillover goes away with "ceph tell osd.0 compact" but >> comes back eventually >> * The blockdb/WAL SSD is highly utilized, while the HDDs are not >> * When one OSD fails, there is a cascade failure taking down many other >> OSDs across all nodes. Most of the time, the cluster comes back when >> setting the nodown flag and restarting all failed OSDs one by one >> * Sometimes, especially during maintenance, "Long heartbeat ping times >> on front/back interface seen, longest is 1390.076 msec" messages pop up >> * The cluster performance deteriorates sharply when upgrading from >> 14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8 >> >> Of these problems, the OSD cascade failure is the most important, and is >> responsible for lenghty downtimes in the past few weeks. >> >> Do you have any ideas on how to combat these problems? >> >> Thank you, >> >> Paul >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx