Re: PG stuck peering after host reboot

<george.vasilakakos@xxxxxxxxxx> · Wed, 8 Feb 2017 18:25:44 +0000

Hi Greg,

> Yes, "bad crc" indicates that the checksums on an incoming message did
> not match what was provided — ie, the message got corrupted. You
> shouldn't try and fix that by playing around with the peering settings
> as it's not a peering bug.
> Unless there's a bug in the messaging layer causing this (very
> unlikely), you have bad hardware or a bad network configuration
> (people occasionally talk about MTU settings?). Fix that and things
> will work; don't and the only software tweaks you could apply are more
> likely to result in lost data than a happy cluster.
> -Greg

I thought of the network initially but I didn't observe packet loss between the two hosts and neither host is having trouble talking to the rest of its peers. It's these two OSDs that can't talk to each other so I figured it's not likely to be a network issue. Network monitoring does show virtually non-existent inbound traffic over those links compared to the other ports on the switch but no other peerings fail.

Is there something you can suggest to do to drill down deeper?
Also, am I correct in assuming that I can pull one of these OSDs from the cluster as a last resort to cause a remapping to a different to potentially give this a quick/temp fix and get the cluster serving I/O properly again?

Many thanks for your help,

George
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com