Re: Still CRUSH problems with 0.94.1 ? (explained)

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 22 Apr 2015 10:31:21 -0700

On Wed, Apr 22, 2015 at 3:18 AM, fred@xxxxxxxxxx <fred@xxxxxxxxxx> wrote:
> Hi all,
>
> responding to my yesterday email, I have interesting informations confirming
> that the problem is not at all related to Hammer.
> Seing really nothing explaining the weird comportment, I've reinstalled a
> Giant and had the same symptoms letting me think it had to be hardware
> related...
>
> And it was !
>
> Our nodes are dual-attached, 2 x 10Gbps link emX interfaces, LACP mode 4.
> The IP is on the bond0 interface. I've discovered that one of the em
> interface of a host didn't talk, and was in fact down.
>
> Even if the bond0 is UP and the host speaks normaly, a physical interface
> down has for consequence stuck/unclean/peering PGs !
>
> It leads me thinking that on startup the ceph-osd processes bind themself to
> one or another of the 2 emX physical interfaces, whatever their state, and
> not to the bond0 interface as they should.
>
> You put the faulty interface back UP, restart ceph on the node and all the
> stuck/unclean/peering PGs disappear.
>
> Now say you lose a physical interface during normal activity (without ceph
> restart on the node), and make some activity (like a pool create)...
> ->  peering/stuck/unclean PGs reappear demonstrating the process attachement
> to the physical interface.
>
> Now the question, as it compromises redundancy, is this comportment by
> design ?

This is definitely not the designed behavior, and we have mechanisms
in place to prevent it: OSDs heartbeat their peers on every (software)
NIC they use to communicate to each other!
I'm not very familiar with the mechanics of link aggregation, but I
suspect the fault lies there somehow. Perhaps the heartbeat packets
are all being handled by the working NIC but a bunch of other messages
are traversing the faulty one and not getting rerouted? We do open up
new sockets for the heartbeating, so depending on your configuration
that could be what's going on.

I have no idea what one needs to do to convince LACP or whatever that
two different sessions are actually part of the same flow and need to
use the same routing, though... :/
-Greg

>
> Frederic
>
>
> fred@xxxxxxxxxx <fred@xxxxxxxxxx> a écrit le 21/04/15 15:03 :
>>
>> Hi all,
>>
>> may there be a problem with the crush function during 'from scratch'
>> installation of 0.94.1-0 ?
>>
>> This has been tested many times, with ceph-deploy-1.5.22-0 or
>> ceph-deploy-1.5.23-0. Platform RHEL7.
>>
>> Each time, the new cluster ends up in a weird state never seen on my
>> previous installed versions (0.94, 0.87.1),
>> - I've seen things perhaps linked to ceph-deploy-1.5.23-0, either one or
>> more monitors being unable to form the cluster (with respawning 'python
>> /usr/sbin/ceph-create-keys' messages). But I think that's other part of the
>> issue.
>> - the main issue is visible as a warning on health of the PGs as soon as
>> the cluster is enough formed to answer a 'ceph -s'.
>>
>> - here is a 1 Mon, almost empty freshly installed cluster :
>>
>> ROOT > ceph -s
>>    cluster e581ab43-d0f5-4ea8-811f-94c8df16d044
>>     health HEALTH_WARN
>>            2 pgs degraded
>>            14 pgs peering
>>            4 pgs stale
>>            2 pgs stuck degraded
>>            25 pgs stuck inactive
>>            4 pgs stuck stale
>>            27 pgs stuck unclean
>>            2 pgs stuck undersized
>>            2 pgs undersized
>>            too few PGs per OSD (3 < min 30)
>>     monmap e1: 1 mons at {helga=10.10.10.64:6789/0}
>>            election epoch 2, quorum 0 helga
>>     osdmap e398: 60 osds: 60 up, 60 in; 2 remapped pgs
>>      pgmap v1553: 64 pgs, 1 pools, 0 bytes data, 0 objects
>>            2829 MB used, 218 TB / 218 TB avail
>>                  37 active+clean
>>                  12 peering
>>                  11 activating
>>                   2 stale+active+undersized+degraded
>>                   2 stale+remapped+peering
>>
>> with time, the number of defects is growing. They literraly explode if we
>> put objects on it.
>>
>> - a 'ceph health detail' show for example entries like this one :
>> pg 0.22 is stuck inactive since forever, current state peering, last
>> acting [18,17,0]
>>
>> - A query on the PG shows
>> ceph pg  0.22 query
>> {
>>    "state": "peering",
>> ../..
>>     "up": [
>>        18,
>>        17,
>>        0
>>    ],
>>           "blocked_by": [
>>                0,
>>                1,
>>                5,
>>                17
>>            ],
>> ../..
>> }
>>
>>
>> If my understanding of the ceph query is correct, OSDs 1, 5 and 17 have
>> nothing do do with this PG.... Where do they come from ??
>> Couldn't this be part of the "critical issues with CRUSH" 0.94.1 is meant
>> to correct ?
>>
>> Frederic
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com