Hi Anthony Thank you for your respons I am looking at the"OSDs highest latency of write operations" panel of the grafana dashboard found in the ceph source in ./monitoring/grafana/dashboards/osds-overview.json. It is a topk graph that uses ceph_osd_op_w_latency_sum / ceph_osd_op_w_latency_count. During normal operations we see sometime latency spikes of 4 seconds max but during the bringing back of the rack we saw a consistent increase in latency for a lot of osds into the 20 seconds range The cluster has 1139 osds total of which we had 5 x 9 - 45 in maintenance We did not throttle the backfilling proces because we succesfully did the same maintenance before on a few occasions for other racks without problems. I will throttle backfills next time we have the same sort of maintenance in the next rack Can you elaborate a bit more what happens exactly during the peering process? I understand that the osds need to catch up. I also see that the nr of scrubs increases a lot when osds are brought back online. Is that part of the peering proces? Thx, Marcel > HDDs and concern for latency donâ??t mix. That said, you donâ??t specify > what you mean by â??latencyâ??. Does that mean average client write > latency? median? P99? Something else? > > If you have a 15 node cluster and you took a third of it down for two > hours then yeah youâ??ll have a lot to catch up on when you come back. > Bringing the nodes back one at a time can help, to spread out the peering. > Did you throttle backfill/recovery tunables all the way down to 1? In a > way that the restarted OSDs would use the throttled values as they boot? > > > > >> On Nov 5, 2020, at 6:47 AM, Marcel Kuiper <ceph@xxxxxxxx> wrote: >> >> Hi >> >> We had a rack down for 2hours for maintenance. 5 storage nodes were >> involved. We had noout en norebalance flags set before the start of the >> maintenance >> >> When the systems were brought back online we noticed a lot of osds with >> high latency (in 20 seconds range) . Mostly osds that are not on the >> storage nodes that were down. It took about 20 minutes for things to >> settle down. >> >> We're running nautilus 14.2.11. The storage nodes run bluestore and have >> 9 >> x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV >> >> - Can anyone give a reason for these high latencies? >> - Is there a way to avoid or lower these latencies when bringing systems >> back into operation? >> >> Best Regards >> >> Marcel >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx