Re: high latency after maintenance]

Wout van Heeswijk <wout@xxxxxxxx> · Fri, 6 Nov 2020 12:25:55 +0000

Hi Marcel,

The peering process is the process used by Ceph OSDs, on a per placement group basis, to agree on the state of that placement on each of the involved OSDs.

In your case, 2/3 of the placement group metadata that needs to be agreed upon/checked is on the nodes that did not undergo maintenance. You also need to consider that the acting primary OSD for everything is now hosted on the OSDs that did not undergo any maintenance.

This all means that all 'heavy' lifting is done by these nodes until the recovery/backfilling process is completed is done by the nodes that stayed online. Also consider that Ceph will, most likely, execute peering twice per pg. Once when the OSDs start again, and once when the recovery and backfillling is finished.

I really don't want to RTFM, but I don't think it is useful to copy it here:
https://docs.ceph.com/en/latest/dev/peering/#description-of-the-peering-process

Peering
the process of bringing all of the OSDs that store a Placement Group (PG) into agreement about the state of all of the objects (and their metadata) in that PG. Note that agreeing on the state does not mean that they all have the latest contents.

Kind regards,

Wout
42on

________________________________________
From: Marcel Kuiper <ceph@xxxxxxxx>
Sent: Friday, 6 November 2020 10:23
To: ceph-users@xxxxxxx
Subject:  Re: high latency after maintenance]

Hi Anthony

Thank you for your respons

I am looking at the"OSDs highest latency of write operations" panel of the
grafana dashboard found in the ceph source in
./monitoring/grafana/dashboards/osds-overview.json. It is a topk graph
that  uses ceph_osd_op_w_latency_sum / ceph_osd_op_w_latency_count.
During normal operations we see sometime latency spikes of 4 seconds max
but during the bringing back of the rack we saw a consistent increase in
latency for a lot of osds into the 20 seconds range

The cluster has 1139 osds total of which we had 5 x 9 - 45 in maintenance

We did not throttle the backfilling proces because we succesfully did the
same maintenance before on a few occasions for other racks without
problems. I will throttle backfills next time we have the same sort of
maintenance in the next rack

Can you elaborate a bit more what happens exactly during the peering
process? I understand that the osds need to catch up. I also see that the
nr of scrubs increases a lot when osds are brought back online. Is that
part of the peering proces?

Thx, Marcel

> HDDs and concern for latency donâ€™t mix.  That said, you donâ€™t specify
> what you mean by â€œlatencyâ€?.  Does that mean average client write
> latency?  median? P99? Something else?
>
> If you have a 15 node cluster and you took a third of it down for two
> hours then yeah youâ€™ll have a lot to catch up on when you come back.
> Bringing the nodes back one at a time can help, to spread out the peering.
>  Did you throttle backfill/recovery tunables all the way down to 1?  In a
> way that the restarted OSDs would use the throttled values as they boot?
>
>
>
>
>> On Nov 5, 2020, at 6:47 AM, Marcel Kuiper <ceph@xxxxxxxx> wrote:
>>
>> Hi
>>
>> We had a rack down for 2hours for maintenance. 5 storage nodes were
>> involved. We had noout en norebalance flags set before the start of the
>> maintenance
>>
>> When the systems were brought back online we noticed a lot of osds with
>> high latency (in 20 seconds range) . Mostly osds that are not on the
>> storage nodes that were down. It took about 20 minutes for things to
>> settle down.
>>
>> We're running nautilus 14.2.11. The storage nodes run bluestore and have
>> 9
>> x 8T HDD's and 3 x SSD for rocksdb. Each with 3 x 123G LV
>>
>> - Can anyone give a reason for these high latencies?
>> - Is there a way to avoid or lower these latencies when bringing systems
>> back into operation?
>>
>> Best Regards
>>
>> Marcel
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx