Re: How to slow down PG recovery when a failed OSD node come back?

"huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx> · Thu, 26 Aug 2021 10:14:51 +0200

Thanks a lot for the tips. 

I do not have the PG recovery with HDD system, and the issue only happens on SSD. I checked SSD related parameter settings, and did not find anything. Still in pursuit of what could be the root cause...

samuel 

huxiaoyu@xxxxxxxxxxxx

From: Frank Schilder
Date: 2021-08-26 09:01
To: huxiaoyu@xxxxxxxxxxxx; ceph-users
Subject: Re:  How to slow down PG recovery when a failed OSD node come back?
For luminous you should check the corresponding _ssd-config values for osd_recovery_sleep and osd_max_backfills.

However, I don't think you should see a problem with the defaults with luminous. In fact, I had good experience with making recovery even more aggressive than the defaults. You might want to look through the logs if there are other problems, for example, with peering taking very long or other OSDs being marked as down temporarily (the classic "a monitor marked me down but I'm still running"). Could be network or CPU bottlenecks.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
Sent: 25 August 2021 21:46:57
To: ceph-users
Subject:  How to slow down PG recovery when a failed OSD node come back?

Dear Cepher,

I had an all flash 3 node Ceph cluster, each node of 8 SSDs as OSDs, running Ceph release 12.2.13. I have the following setting
    osd_op_queue = wpq
    osd_op_queue_cut_off = high
and
    osd_recovery_sleep= 0.5
  osd_min_pg_log_entries = 3000
    osd_max_pg_log_entries = 10000
osd_max_backfills = 1

The problem i encountered is the following: After a failed OSD node come back and re-join, there is 3-5 mimutes period during which the recovery workload overwhelming the system, making user IO almost stall. After this 3-5 mimutes, the recovery process seems to calm down and slow down to a reasonable level, give priority to user IO workload.

What happens during the crazy 3-5 minutes? and how to reduce the negative impact then?

any suggestions and comments are highly appreciated,

best regards,

Samuel

huxiaoyu@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx