Re: PG stuck peering after host reboot

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 8 Feb 2017 10:29:34 -0800

On Wed, Feb 8, 2017 at 10:25 AM,  <george.vasilakakos@xxxxxxxxxx> wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug.
>> Unless there's a bug in the messaging layer causing this (very
>> unlikely), you have bad hardware or a bad network configuration
>> (people occasionally talk about MTU settings?). Fix that and things
>> will work; don't and the only software tweaks you could apply are more
>> likely to result in lost data than a happy cluster.
>> -Greg
>
>
> I thought of the network initially but I didn't observe packet loss between the two hosts and neither host is having trouble talking to the rest of its peers. It's these two OSDs that can't talk to each other so I figured it's not likely to be a network issue. Network monitoring does show virtually non-existent inbound traffic over those links compared to the other ports on the switch but no other peerings fail.
>
> Is there something you can suggest to do to drill down deeper?

Sadly no. It being a single route is indeed weird and hopefully
somebody with more networking background can suggest a cause. :)

> Also, am I correct in assuming that I can pull one of these OSDs from the cluster as a last resort to cause a remapping to a different to potentially give this a quick/temp fix and get the cluster serving I/O properly again?

I'd expect so!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com