On 15/11/2019 11:23, Simon Ironside wrote: > Hi Florian, > > Any chance the key your compute nodes are using for the RBD pool is > missing 'allow command "osd blacklist"' from its mon caps? > > Simon Hi Simon, I received this off-list but then subsequently saw this message pop up in the list archive, so I hope it's OK to reply on-list? So that cap was indeed missing, thanks for the hint! However, I am still trying to understand how this is related to the issue we saw. The only documentation-ish article that I found about osd blacklist caps is this: https://access.redhat.com/solutions/3391211 We can also confirm a bunch of "access denied" messages when trying to blacklist an OSD in the mon logs. So the content of that article definitely applies to our situation, I'm just not sure I follow how the absence of that capability caused this issue. The article talks about RBD watchers, not locks. To the best of my knowledge, a watcher operates like a lease on the image, which is periodically renewed. If not renewed in 30 seconds of client inactivity, the cluster considers the client dead. (Please correct me if I'm wrong.) For us, that didn't help. We had to actively remove locks with "rbd lock rm". Is the article using the wrong terms? Is there a link between watchers and locks that I'm unaware of? Semi-relatedly, as I understand it OSD blacklisting happens based either on an IP address, or on a socket address (IP:port). While this comes in handy in host evacuation, it doesn't in in-place recovery (see question 4 in my original message). - If the blacklist happens based on IP address alone (and that's what seems to be what the client attempts to be doing, based on our log messages), then it would break recovery-in-place after a hard reboot altogether. - Even if the client would blacklist based on an address:port pair, it would be just very unlikely that an RBD client used the same source port to connect after the node recovers in place, but not impossible. So I am wondering: is this incorrect documentation, or incorrect behavior, or am I simply making dead-wrong assumptions? Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com