Re: Cascading Failure of OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Quentin Hartman wrote:

> Since I have been in ceph-land today, it reminded me that I needed to close
> the loop on this. I was finally able to isolate this problem down to a
> faulty NIC on the ceph cluster network. It "worked", but it was
> accumulating a huge number of Rx errors. My best guess is some receive
> buffer cache failed? Anyway, having a NIC go weird like that is totally
> consistent with all the weird problems I was seeing, the corrupted PGs, and
> the inability for the cluster to settle down.
> 
> As a result we've added NIC error rates to our monitoring suite on the
> cluster so we'll hopefully see this coming if it ever happens again.

Good for you. ;)

Could you post here the command that you use to get NIC error rates?

-- 
François Lafont
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux