How to slow down PG recovery when a failed OSD node come back?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Cepher,

I had an all flash 3 node Ceph cluster, each node of 8 SSDs as OSDs, running Ceph release 12.2.13. I have the following setting
    osd_op_queue = wpq
    osd_op_queue_cut_off = high
and 
    osd_recovery_sleep= 0.5
  osd_min_pg_log_entries = 3000
    osd_max_pg_log_entries = 10000
 osd_max_backfills = 1

The problem i encountered is the following: After a failed OSD node come back and re-join, there is 3-5 mimutes period during which the recovery workload overwhelming the system, making user IO almost stall. After this 3-5 mimutes, the recovery process seems to calm down and slow down to a reasonable level, give priority to user IO workload.

What happens during the crazy 3-5 minutes? and how to reduce the negative impact then?

any suggestions and comments are highly appreciated,

best regards,

Samuel



huxiaoyu@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux