Not sure, but have you checked the clocks on their nodes? Extreme clock drift often results in strange cephx errors. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Sep 14, 2014 at 11:03 PM, Florian Haas <florian at hastexo.com> wrote: > Hi everyone, > > [Keeping this on the -users list for now. Let me know if I should > cross-post to -devel.] > > I've been asked to help out on a Dumpling cluster (a system > "bequeathed" by one admin to the next, currently on 0.67.10, was > originally installed with 0.67.5 and subsequently updated a few > times), and I'm seeing a rather odd issue there. The cluster is > relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather > non-ideal 12 OSDs but its performance issues aren't really the point > here. > > "ceph health detail" shows a bunch of PGs peering, but the usual > troubleshooting steps don't really seem to work. > > For some PGs, "ceph pg <pgid> query" just blocks, doesn't return > anything. Adding --debug_ms=10 shows that it's simply not getting a > response back from one of the OSDs it's trying to talk to, as if > packets dropped on the floor or were filtered out. However, opening a > simple TCP connection to the OSD's IP and port works perfectly fine > (netcat returns a Ceph signature). > > (Note, though, that because of a daemon flapping issue they at some > point set both "noout" and "nodown", so the cluster may not be > behaving as normally expected when OSDs fail to respond in time.) > > Then there are some PGs where "ceph pg <pgid> query" is a little more > verbose, though not exactly more successful: > > From ceph health detail: > > pg 6.c10 is stuck inactive for 1477.781394, current state peering, > last acting [85,16] > > ceph pg 6.b1 query: > > 2014-09-15 01:06:48.200418 7f29a6efc700 0 cephx: verify_reply > couldn't decrypt with error: error decoding block for decryption > 2014-09-15 01:06:48.200428 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> > 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 > c=0x2c00d90).failed verifying authorize reply > 2014-09-15 01:06:48.200465 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> > 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 > c=0x2c00d90).fault > 2014-09-15 01:06:48.201000 7f29a6efc700 0 cephx: verify_reply > couldn't decrypt with error: error decoding block for decryption > 2014-09-15 01:06:48.201008 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> > 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1 > c=0x2c00d90).failed verifying authorize reply > > Oops. Now the admins swear they didn't touch the keys, but they are > also (understandably) reluctant to just kill and redeploy all those > OSDs, as these issues are basically scattered over a bunch of PGs > touching many OSDs. How would they pinpoint this to be sure that > they're not being bitten by a bug or misconfiguration? > > Not sure if people have seen this before ? if so, I'd be grateful for > some input. Lo?c, S?bastien perhaps? Or Jo?o, Greg, Sage? > > Thanks in advance for any insight people might be able to share. :) > > Cheers, > Florian > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com