Re: Cascading Failure of OSDs

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Wed, 1 Apr 2015 09:28:18 -0600

Right now we're just scraping the output of ifconfig:
ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'

It clunky, but it works. I'm sure there's a cleaner way, but this was expedient.

QH

On Tue, Mar 31, 2015 at 5:05 PM, Francois Lafont <flafdivers@xxxxxxx> wrote:
Hi,

Quentin Hartman wrote:

> Since I have been in ceph-land today, it reminded me that I needed to close

> the loop on this. I was finally able to isolate this problem down to a

> faulty NIC on the ceph cluster network. It "worked", but it was

> accumulating a huge number of Rx errors. My best guess is some receive

> buffer cache failed? Anyway, having a NIC go weird like that is totally

> consistent with all the weird problems I was seeing, the corrupted PGs, and

> the inability for the cluster to settle down.

>

> As a result we've added NIC error rates to our monitoring suite on the

> cluster so we'll hopefully see this coming if it ever happens again.

Good for you. ;)

Could you post here the command that you use to get NIC error rates?

--

François Lafont

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com