new feature: auto removal of osds causing "stuck inactive"

Chad William Seys <cwseys@xxxxxxxxxxxxxxxx> · Fri, 28 Oct 2016 08:52:41 -0500

Hi all,
  I recently encountered a situation where some partially removed OSDs 
caused my cluster to enter a "stuck inactive" state.  The eventually 
solution was to tell ceph the OSDs were "lost".  Because all the PGs 
were replicated elsewhere on the cluster, no data was lost.

  Would it make sense or be possible for Ceph to automatically detect 
this situation ("stuck inactive" and PGs replicated elsewhere) and 
automatically take action to un-stuck the cluster?  E.g. automatically 
mark the OSD as lost or cause the OSD be down and out to have the same 
effect?

  Ideally anything that can be safely automated should be.  :)

Thanks!
C.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com