Hi, Quentin Hartman wrote: > Since I have been in ceph-land today, it reminded me that I needed to close > the loop on this. I was finally able to isolate this problem down to a > faulty NIC on the ceph cluster network. It "worked", but it was > accumulating a huge number of Rx errors. My best guess is some receive > buffer cache failed? Anyway, having a NIC go weird like that is totally > consistent with all the weird problems I was seeing, the corrupted PGs, and > the inability for the cluster to settle down. > > As a result we've added NIC error rates to our monitoring suite on the > cluster so we'll hopefully see this coming if it ever happens again. Good for you. ;) Could you post here the command that you use to get NIC error rates? -- François Lafont _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com