Best guess: the recovery process doesn't really stop, but it's just that the mgr is dead and it no longer reports the progress And yeah, I can confirm that having a huge number of crash reports is a problem (had a case where a monitoring script crashed due to a radosgw-admin bug... lots of crash reports) Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Apr 30, 2020 at 4:09 PM Francois Legrand <fleg@xxxxxxxxxxxxxx> wrote: > Hi everybody (again), > We recently had a lot of osd crashs (more than 30 osd crashed). This is > now fixed, but it triggered a huge rebalancing+recovery. > More or less in the same time, we noticed that the ceph crash ls (or > whatever other ceph crash command) hangs forever and never returns. > And finally, the recovery process stops regularly (after ~1 hour) but it > can be restarted by reseting the mgr daemon (systemctl restart > ceph-mgr.target on the active manager). > There is nothing in the logs (the manager still works, the service is > up, the dashboard is accessible but simply the recovery stops). > We also tryed to reboot the managers, but it doesn't solve the problem. > I guess theses two problems should be linked, but not sure. > Does anybody have a clue ? > Thanks. > F. > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx