Dumpling cluster can't resolve peering failures, ceph pg query blocks, auth failures in logs

florian@xxxxxxxxxxx (Florian Haas) · Wed, 17 Sep 2014 13:35:19 +0200

Thanks, I did check on that too as I'd seen this before and this was
"the usual drill", but alas, no, that wasn't the problem. This cluster
is having other issues too, though, so I probably need to look into
those first.

Cheers,
Florian

On Mon, Sep 15, 2014 at 7:29 PM, Gregory Farnum <greg at inktank.com> wrote:
> Not sure, but have you checked the clocks on their nodes? Extreme
> clock drift often results in strange cephx errors.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Sun, Sep 14, 2014 at 11:03 PM, Florian Haas <florian at hastexo.com> wrote:
>> Hi everyone,
>>
>> [Keeping this on the -users list for now. Let me know if I should
>> cross-post to -devel.]
>>
>> I've been asked to help out on a Dumpling cluster (a system
>> "bequeathed" by one admin to the next, currently on 0.67.10, was
>> originally installed with 0.67.5 and subsequently updated a few
>> times), and I'm seeing a rather odd issue there. The cluster is
>> relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather
>> non-ideal 12 OSDs but its performance issues aren't really the point
>> here.
>>
>> "ceph health detail" shows a bunch of PGs peering, but the usual
>> troubleshooting steps don't really seem to work.
>>
>> For some PGs, "ceph pg <pgid> query" just blocks, doesn't return
>> anything. Adding --debug_ms=10 shows that it's simply not getting a
>> response back from one of the OSDs it's trying to talk to, as if
>> packets dropped on the floor or were filtered out. However, opening a
>> simple TCP connection to the OSD's IP and port works perfectly fine
>> (netcat returns a Ceph signature).
>>
>> (Note, though, that because of a daemon flapping issue they at some
>> point set both "noout" and "nodown", so the cluster may not be
>> behaving as normally expected when OSDs fail to respond in time.)
>>
>> Then there are some PGs where "ceph pg <pgid> query" is a little more
>> verbose, though not exactly more successful:
>>
>> From ceph health detail:
>>
>> pg 6.c10 is stuck inactive for 1477.781394, current state peering,
>> last acting [85,16]
>>
>> ceph pg 6.b1 query:
>>
>> 2014-09-15 01:06:48.200418 7f29a6efc700  0 cephx: verify_reply
>> couldn't decrypt with error: error decoding block for decryption
>> 2014-09-15 01:06:48.200428 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
>> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
>> c=0x2c00d90).failed verifying authorize reply
>> 2014-09-15 01:06:48.200465 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
>> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
>> c=0x2c00d90).fault
>> 2014-09-15 01:06:48.201000 7f29a6efc700  0 cephx: verify_reply
>> couldn't decrypt with error: error decoding block for decryption
>> 2014-09-15 01:06:48.201008 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
>> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1
>> c=0x2c00d90).failed verifying authorize reply
>>
>> Oops. Now the admins swear they didn't touch the keys, but they are
>> also (understandably) reluctant to just kill and redeploy all those
>> OSDs, as these issues are basically scattered over a bunch of PGs
>> touching many OSDs. How would they pinpoint this to be sure that
>> they're not being bitten by a bug or misconfiguration?
>>
>> Not sure if people have seen this before ? if so, I'd be grateful for
>> some input. Lo?c, S?bastien perhaps? Or Jo?o, Greg, Sage?
>>
>> Thanks in advance for any insight people might be able to share. :)
>>
>> Cheers,
>> Florian