Dumpling cluster can't resolve peering failures, ceph pg query blocks, auth failures in logs

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 15 Sep 2014 10:29:40 -0700



Not sure, but have you checked the clocks on their nodes? Extreme
clock drift often results in strange cephx errors.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sun, Sep 14, 2014 at 11:03 PM, Florian Haas <florian at hastexo.com> wrote:
> Hi everyone,
>
> [Keeping this on the -users list for now. Let me know if I should
> cross-post to -devel.]
>
> I've been asked to help out on a Dumpling cluster (a system
> "bequeathed" by one admin to the next, currently on 0.67.10, was
> originally installed with 0.67.5 and subsequently updated a few
> times), and I'm seeing a rather odd issue there. The cluster is
> relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather
> non-ideal 12 OSDs but its performance issues aren't really the point
> here.
>
> "ceph health detail" shows a bunch of PGs peering, but the usual
> troubleshooting steps don't really seem to work.
>
> For some PGs, "ceph pg <pgid> query" just blocks, doesn't return
> anything. Adding --debug_ms=10 shows that it's simply not getting a
> response back from one of the OSDs it's trying to talk to, as if
> packets dropped on the floor or were filtered out. However, opening a
> simple TCP connection to the OSD's IP and port works perfectly fine
> (netcat returns a Ceph signature).
>
> (Note, though, that because of a daemon flapping issue they at some
> point set both "noout" and "nodown", so the cluster may not be
> behaving as normally expected when OSDs fail to respond in time.)
>
> Then there are some PGs where "ceph pg <pgid> query" is a little more
> verbose, though not exactly more successful:
>
> From ceph health detail:
>
> pg 6.c10 is stuck inactive for 1477.781394, current state peering,
> last acting [85,16]
>
> ceph pg 6.b1 query:
>
> 2014-09-15 01:06:48.200418 7f29a6efc700  0 cephx: verify_reply
> couldn't decrypt with error: error decoding block for decryption
> 2014-09-15 01:06:48.200428 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
> c=0x2c00d90).failed verifying authorize reply
> 2014-09-15 01:06:48.200465 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
> c=0x2c00d90).fault
> 2014-09-15 01:06:48.201000 7f29a6efc700  0 cephx: verify_reply
> couldn't decrypt with error: error decoding block for decryption
> 2014-09-15 01:06:48.201008 7f29a6efc700  0 -- 10.47.17.1:0/1020420 >>
> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1
> c=0x2c00d90).failed verifying authorize reply
>
> Oops. Now the admins swear they didn't touch the keys, but they are
> also (understandably) reluctant to just kill and redeploy all those
> OSDs, as these issues are basically scattered over a bunch of PGs
> touching many OSDs. How would they pinpoint this to be sure that
> they're not being bitten by a bug or misconfiguration?
>
> Not sure if people have seen this before ? if so, I'd be grateful for
> some input. Lo?c, S?bastien perhaps? Or Jo?o, Greg, Sage?
>
> Thanks in advance for any insight people might be able to share. :)
>
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com