Hi, I used kraken 11.1.1 from official deb repo which has the mentioned patch merged in, worked without problems. For reference, here are the steps I made to fix the cluster: 1) Setup ceph client with newest kraken version, ensure it can connect to the cluster 2) Get the broken image id: # rbd children $BASE_POOL/$BASE_IMAGE@$SNAP_NAME 2016-12-23 12:50:16.825034 7fa8d3271000 -1 librbd: Error looking up name for image id f5999a87a80af30 in pool $CLONE_POOL Here the id is: f5999a87a80af30 3) find the key in rbd_children omap entries used to make the child relationship, search for the image id in output of: # rados -p $CLONE_POOL listomapvals rbd_children The preceeding key will be the one we're looking for, dump key bytes to a file: # echo -ne "\x03\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x62\x65\x30\x31\x38\x35\x37\x35\x66\x36\x65\x67\x30\x61\x35\xf7\x53\x00\x00\x00\x00\x00\x00" > /tmp/key.bin Double check we've got the right key data, output should contain broken image id: # rados -p $CLONE_POOL getomapval rbd_children --omap-key-file /tmp/key.bin 4) Remove the broken child relation # rados -p $CLONE_POOL rmomapkey rbd_children --omap-key-file /tmp/key.bin After that I had no problems to unprotect and remove the snapshot. Thanks all for help, one more production cluster fixed today :) Bartek On Thu, 22 Dec 2016 13:21:40 +0000 Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 22 Dec 2016, Bartłomiej Święcki wrote: > > Hi, > > > > I have problems runnign Kraken tools on Hammer/Jewel cluster (official 11.1.0 debs), > > it asserts: > > > > /build/ceph-11.1.0/src/mon/MonMap.cc: In function 'void MonMap::sanitize_mons(std::map<std::basic_string<char>, entity_addr_t>&)' thread 7fffd37fe700 time 2016-12-22 12:26:23.457058 > > /build/ceph-11.1.0/src/mon/MonMap.cc: 70: FAILED assert(mon_info.count(p.first)) > > See http://tracker.ceph.com/issues/18265 and > https://github.com/ceph/ceph/pull/12611 > > sage > > > > I tried to debug it a bit and it looks like mon_info has temporary mon names: > > > > (gdb) p mon_info > > $1 = std::map with 3 elements = {["noname-a"] = {name = "noname-a", ..... > > > > while it checks for a correct one: > > > > (gdb) p p > > $2 = {first = "mon-01-690d38c0-2567-447b-bdfb-0edd137183db", .... > > > > > > Anyway, I was thinking about the missing image problem - maybe it would be easier > > to recreate removed image? Would restoring rbd_header object be enough? > > > > > > P.S. Adding ceph-devel > > > > On Thu, 22 Dec 2016 10:10:09 +0100 > > Bartłomiej Święcki <bartlomiej.swiecki@xxxxxxxxxxxx> wrote: > > > > > Hi Jason, > > > > > > I'll test kraken tools since it happened on production, everything works there > > > since the clone is flattened after being created and the production equivalent > > > of "test" user can access the image only after it has been flattened. > > > > > > The issue happened when someone accidentally removed not-yet-flattened image > > > using the user with weaker permissions. Good to hear this has been spotted > > > already. > > > > > > Thanks for help, > > > Bartek > > > > > > > > > > > > On Wed, 21 Dec 2016 11:53:57 -0500 > > > Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > > > > > > You are unfortunately the second person today to hit an issue where > > > > "rbd remove" incorrectly proceeds when it hits a corner-case error. > > > > > > > > First things first, when you configure your new user, you needed to > > > > give it "rx" permissions to the parent image's pool. If you attempted > > > > the clone operation using the "test" user, the clone would have > > > > immediately failed due to this issue. > > > > > > > > Second, unless this is a test cluster where you can delete the > > > > "rbd_children" object in the "rbd" pool (i.e. you don't have any > > > > additional clones in the rbd pool) via the rados CLI, you will need to > > > > use the Kraken release candidate (or master branch) version of the > > > > rados CLI to manually manipulate the "rbd_children" object to remove > > > > the dangling reference to the deleted image. > > > > > > > > On Wed, Dec 21, 2016 at 6:57 AM, Bartłomiej Święcki > > > > <bartlomiej.swiecki@xxxxxxxxxxxx> wrote: > > > > > Hi, > > > > > > > > > > I'm currently investigating a case where Ceph cluster ended up with inconsistent clone information. > > > > > > > > > > Here's a what I did to quickly reproduce: > > > > > * Created new cluster (tested in hammer 0.94.6 and jewel 10.2.3) > > > > > * Created two pools: test and rbd > > > > > * Created base image in pool test, created snapshot, protected it and created clone of this snapshot in pool rbd: > > > > > # rbd -p test create --size 10 --image-format 2 base > > > > > # rbd -p test snap create base@base > > > > > # rbd -p test snap protect base@base > > > > > # rbd clone test/base@base rbd/destination > > > > > * Created new user called "test" with rwx permissions to rbd pool only: > > > > > caps: [mon] allow r > > > > > caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=rbd > > > > > * Using this newly creted user I removed the cloned image in rbd pool, had errors but finally removed the image: > > > > > # rbd --id test -p rbd rm destination > > > > > 2016-12-21 11:50:03.758221 7f32b7459700 -1 librbd::image::OpenRequest: failed to retreive name: (1) Operation not permitted > > > > > 2016-12-21 11:50:03.758288 7f32b6c58700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (1) Operation not permitted > > > > > 2016-12-21 11:50:03.758312 7f32b6c58700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (1) Operation not permitted > > > > > 2016-12-21 11:50:03.758333 7f32b6c58700 -1 librbd::image::OpenRequest: failed to refresh image: (1) Operation not permitted > > > > > 2016-12-21 11:50:03.759366 7f32b6c58700 -1 librbd::ImageState: failed to open image: (1) Operation not permitted > > > > > Removing image: 100% complete...done. > > > > > > > > > > At this point there's no cloned image but the original snapshot still has reference to it: > > > > > > > > > > # rbd -p test snap unprotect base@base > > > > > 2016-12-21 11:53:47.359060 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) [29b0238e1f29] in pool 'rbd' > > > > > 2016-12-21 11:53:47.359678 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: encountered error: (16) Device or resource busy > > > > > 2016-12-21 11:53:47.359691 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: ret_val=-16 > > > > > 2016-12-21 11:53:47.360627 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: ret_val=-16 > > > > > rbd: unprotecting snap failed: (16) Device or resource busy > > > > > > > > > > # rbd -p test children base@base > > > > > rbd: listing children failed: (2) No such file or directory2016-12-21 > > > > > 11:53:08.716987 7ff2b2eaad80 -1 librbd: Error looking up name for image > > > > > id 29b0238e1f29 in pool rbd > > > > > > > > > > > > > > > Any ideas on how this could be fixed? > > > > > > > > > > > > > > > Thanks, > > > > > Bartek > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > -- > > > > Jason > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com