Hi everyone, [Keeping this on the -users list for now. Let me know if I should cross-post to -devel.] I've been asked to help out on a Dumpling cluster (a system "bequeathed" by one admin to the next, currently on 0.67.10, was originally installed with 0.67.5 and subsequently updated a few times), and I'm seeing a rather odd issue there. The cluster is relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather non-ideal 12 OSDs but its performance issues aren't really the point here. "ceph health detail" shows a bunch of PGs peering, but the usual troubleshooting steps don't really seem to work. For some PGs, "ceph pg <pgid> query" just blocks, doesn't return anything. Adding --debug_ms=10 shows that it's simply not getting a response back from one of the OSDs it's trying to talk to, as if packets dropped on the floor or were filtered out. However, opening a simple TCP connection to the OSD's IP and port works perfectly fine (netcat returns a Ceph signature). (Note, though, that because of a daemon flapping issue they at some point set both "noout" and "nodown", so the cluster may not be behaving as normally expected when OSDs fail to respond in time.) Then there are some PGs where "ceph pg <pgid> query" is a little more verbose, though not exactly more successful: >From ceph health detail: pg 6.c10 is stuck inactive for 1477.781394, current state peering, last acting [85,16] ceph pg 6.b1 query: 2014-09-15 01:06:48.200418 7f29a6efc700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-09-15 01:06:48.200428 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).failed verifying authorize reply 2014-09-15 01:06:48.200465 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).fault 2014-09-15 01:06:48.201000 7f29a6efc700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-09-15 01:06:48.201008 7f29a6efc700 0 -- 10.47.17.1:0/1020420 >> 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).failed verifying authorize reply Oops. Now the admins swear they didn't touch the keys, but they are also (understandably) reluctant to just kill and redeploy all those OSDs, as these issues are basically scattered over a bunch of PGs touching many OSDs. How would they pinpoint this to be sure that they're not being bitten by a bug or misconfiguration? Not sure if people have seen this before ? if so, I'd be grateful for some input. Lo?c, S?bastien perhaps? Or Jo?o, Greg, Sage? Thanks in advance for any insight people might be able to share. :) Cheers, Florian