Re: EC related osd crashes (luminous 12.2.4)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/05/2018 08:11 PM, Josh Durgin wrote:
On 04/05/2018 06:15 PM, Adam Tygart wrote:
Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.

I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than one crash in there):
http://people.cs.ksu.edu/~mozes/ceph-osd.422.log

Anyone have any thoughts? My cluster feels like it is getting more and
more unstable by the hour...

Thanks to your logs, I think I've found the root cause. It looks like a
bug in the EC recovery code that's triggered by EC overwrites. I'm working on a fix.

For now I'd suggest setting the noout and norecover flags to avoid
hitting this bug any more by avoiding recovery. Backfilling with no client I/O would also avoid the bug.

I forgot to mention the tracker ticket for this bug is:
http://tracker.ceph.com/issues/23195
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux