Thank you! Setting norecover has seemed to work in terms of keeping the osds up. I am glad my logs were of use to tracking this down. I am looking forward to future updates. Let me know if you need anything else. -- Adam On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > On 04/05/2018 08:11 PM, Josh Durgin wrote: >> >> On 04/05/2018 06:15 PM, Adam Tygart wrote: >>> >>> Well, the cascading crashes are getting worse. I'm routinely seeing >>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering >>> 14 or so of them to crash repeatedly for more than an hour. >>> >>> I've ran another one of them with more logging, debug osd = 20; debug >>> ms = 1 (definitely more than one crash in there): >>> http://people.cs.ksu.edu/~mozes/ceph-osd.422.log >>> >>> Anyone have any thoughts? My cluster feels like it is getting more and >>> more unstable by the hour... >> >> >> Thanks to your logs, I think I've found the root cause. It looks like a >> bug in the EC recovery code that's triggered by EC overwrites. I'm working >> on a fix. >> >> For now I'd suggest setting the noout and norecover flags to avoid >> hitting this bug any more by avoiding recovery. Backfilling with no client >> I/O would also avoid the bug. > > > I forgot to mention the tracker ticket for this bug is: > http://tracker.ceph.com/issues/23195 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com