Re: Cluster down after network outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/12/23 09:53, Frank Schilder wrote:
Hi all,

we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops.

Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system?

osd_recovery_delay_start

We have that set on 60 seconds. So the OSD first gets some time to peer before starting recovery. That might help in this case. Worth a shot. Maybe increase it to 5 minutes or more just to get all OSDs stable before recovery starts going?

Good luck!

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux