Nautilus Cluster Struggling to Come Back Online

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I guess as a sort of follow up from my previous post.  Our Nautilus (14.2.16 on ubuntu 18.04) cluster had some sort of event that caused many of the machines to have memory errors.  The aftermath is that initially some OSDs had (and continue to have) this error https://tracker.ceph.com/issues/48827  others won't start for various reasons.

The OSDs that *will* start are badly behind the current epoch for the most part.

It sounds very similar to this:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

We are having trouble getting things back online.

I think the path forward is to:
-set noup/nodown/noout/nobackfill/and wait for the OSDs that run to come up; we were making good progress yesterday until some of the OSDs crashed with OOM errors.  We are again moving forward but understandably nervous.
-export the PGs from questionable OSDs and and then rebuild the OSDs; import the PGs if necessary (very likely).  Repeat until we are up.

Any suggestions for increasing speed?  We are using noup/nobackfill/norebalance/pause but the epoch catchup is taking a very long time.  Any tips for keeping the epoch from moving forward or speeding up the OSDs catching up? How can we estimate how long it should take?

Thank you for any ideas or assistance anyone can provide.

Will
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux