Dumpling cluster can't resolve peering failures, ceph pg query blocks, auth failures in logs

florian@xxxxxxxxxxx (Florian Haas) · Mon, 15 Sep 2014 08:03:26 +0200

Hi everyone,

[Keeping this on the -users list for now. Let me know if I should
cross-post to -devel.]

I've been asked to help out on a Dumpling cluster (a system
"bequeathed" by one admin to the next, currently on 0.67.10, was
originally installed with 0.67.5 and subsequently updated a few
times), and I'm seeing a rather odd issue there. The cluster is
relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather
non-ideal 12 OSDs but its performance issues aren't really the point
here.

"ceph health detail" shows a bunch of PGs peering, but the usual
troubleshooting steps don't really seem to work.

For some PGs, "ceph pg <pgid> query" just blocks, doesn't return
anything. Adding --debug_ms=10 shows that it's simply not getting a
response back from one of the OSDs it's trying to talk to, as if
packets dropped on the floor or were filtered out. However, opening a
simple TCP connection to the OSD's IP and port works perfectly fine
(netcat returns a Ceph signature).

(Note, though, that because of a daemon flapping issue they at some
point set both "noout" and "nodown", so the cluster may not be
behaving as normally expected when OSDs fail to respond in time.)

Then there are some PGs where "ceph pg <pgid> query" is a little more
verbose, though not exactly more successful:

>From ceph health detail:

pg 6.c10 is stuck inactive for 1477.781394, current state peering,
last acting [85,16]

ceph pg 6.b1 query:

2014-09-15 01:06:48.200418 7f29a6efc700  0 cephx: verify_reply
couldn't decrypt with error: error decoding block for decryption
2014-09-15 01:06:48.200428 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
c=0x2c00d90).failed verifying authorize reply
2014-09-15 01:06:48.200465 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
c=0x2c00d90).fault
2014-09-15 01:06:48.201000 7f29a6efc700  0 cephx: verify_reply
couldn't decrypt with error: error decoding block for decryption
2014-09-15 01:06:48.201008 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1
c=0x2c00d90).failed verifying authorize reply

Oops. Now the admins swear they didn't touch the keys, but they are
also (understandably) reluctant to just kill and redeploy all those
OSDs, as these issues are basically scattered over a bunch of PGs
touching many OSDs. How would they pinpoint this to be sure that
they're not being bitten by a bug or misconfiguration?

Not sure if people have seen this before ? if so, I'd be grateful for
some input. Lo?c, S?bastien perhaps? Or Jo?o, Greg, Sage?

Thanks in advance for any insight people might be able to share. :)

Cheers,
Florian