Re: [ceph-users] Clone data inconsistency in hammer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have problems runnign Kraken tools on Hammer/Jewel cluster (official 11.1.0 debs),
it asserts:

/build/ceph-11.1.0/src/mon/MonMap.cc: In function 'void MonMap::sanitize_mons(std::map<std::basic_string<char>, entity_addr_t>&)' thread 7fffd37fe700 time 2016-12-22 12:26:23.457058
/build/ceph-11.1.0/src/mon/MonMap.cc: 70: FAILED assert(mon_info.count(p.first))

I tried to debug it a bit and it looks like mon_info has temporary mon names:

(gdb) p mon_info
$1 = std::map with 3 elements = {["noname-a"] = {name = "noname-a", .....

while it checks for a correct one:

(gdb) p p
$2 = {first = "mon-01-690d38c0-2567-447b-bdfb-0edd137183db", ....


Anyway, I was thinking about the missing image problem - maybe it would be easier
to recreate removed image? Would restoring rbd_header object be enough?


P.S. Adding ceph-devel

On Thu, 22 Dec 2016 10:10:09 +0100
Bartłomiej Święcki <bartlomiej.swiecki@xxxxxxxxxxxx> wrote:

> Hi Jason,
> 
> I'll test kraken tools since it happened on production, everything works there
> since the clone is flattened after being created and the production equivalent
> of "test" user can access the image only after it has been flattened.
> 
> The issue happened when someone accidentally removed not-yet-flattened image
> using the user with weaker permissions. Good to hear this has been spotted
> already.
> 
> Thanks for help,
> Bartek
> 
> 
> 
> On Wed, 21 Dec 2016 11:53:57 -0500
> Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> 
> > You are unfortunately the second person today to hit an issue where
> > "rbd remove" incorrectly proceeds when it hits a corner-case error.
> > 
> > First things first, when you configure your new user, you needed to
> > give it "rx" permissions to the parent image's pool. If you attempted
> > the clone operation using the "test" user, the clone would have
> > immediately failed due to this issue.
> > 
> > Second, unless this is a test cluster where you can delete the
> > "rbd_children" object in the "rbd" pool (i.e. you don't have any
> > additional clones in the rbd pool) via the rados CLI, you will need to
> > use the Kraken release candidate (or master branch) version of the
> > rados CLI to manually manipulate the "rbd_children" object to remove
> > the dangling reference to the deleted image.
> > 
> > On Wed, Dec 21, 2016 at 6:57 AM, Bartłomiej Święcki
> > <bartlomiej.swiecki@xxxxxxxxxxxx> wrote:
> > > Hi,
> > >
> > > I'm currently investigating a case where Ceph cluster ended up with inconsistent clone information.
> > >
> > > Here's a what I did to quickly reproduce:
> > > * Created new cluster (tested in hammer 0.94.6 and jewel 10.2.3)
> > > * Created two pools: test and rbd
> > > * Created base image in pool test, created snapshot, protected it and created clone of this snapshot in pool rbd:
> > >         # rbd -p test create --size 10 --image-format 2 base
> > >         # rbd -p test snap create base@base
> > >         # rbd -p test snap protect base@base
> > >         # rbd clone test/base@base rbd/destination
> > > * Created new user called "test" with rwx permissions to rbd pool only:
> > >         caps: [mon] allow r
> > >         caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=rbd
> > > * Using this newly creted user I removed the cloned image in rbd pool, had errors but finally removed the image:
> > >         # rbd --id test -p rbd rm destination
> > >         2016-12-21 11:50:03.758221 7f32b7459700 -1 librbd::image::OpenRequest: failed to retreive name: (1) Operation not permitted
> > >         2016-12-21 11:50:03.758288 7f32b6c58700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (1) Operation not permitted
> > >         2016-12-21 11:50:03.758312 7f32b6c58700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (1) Operation not permitted
> > >         2016-12-21 11:50:03.758333 7f32b6c58700 -1 librbd::image::OpenRequest: failed to refresh image: (1) Operation not permitted
> > >         2016-12-21 11:50:03.759366 7f32b6c58700 -1 librbd::ImageState: failed to open image: (1) Operation not permitted
> > >         Removing image: 100% complete...done.
> > >
> > > At this point there's no cloned image but the original snapshot still has reference to it:
> > >
> > > # rbd -p test snap unprotect base@base
> > > 2016-12-21 11:53:47.359060 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) [29b0238e1f29] in pool 'rbd'
> > > 2016-12-21 11:53:47.359678 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: encountered error: (16) Device or resource busy
> > > 2016-12-21 11:53:47.359691 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: ret_val=-16
> > > 2016-12-21 11:53:47.360627 7fee037fe700 -1 librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: ret_val=-16
> > > rbd: unprotecting snap failed: (16) Device or resource busy
> > >
> > > # rbd -p test children base@base
> > > rbd: listing children failed: (2) No such file or directory2016-12-21
> > > 11:53:08.716987 7ff2b2eaad80 -1 librbd: Error looking up name for image
> > > id 29b0238e1f29 in pool rbd
> > >
> > >
> > > Any ideas on how this could be fixed?
> > >
> > >
> > > Thanks,
> > > Bartek
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> > 
> > -- 
> > Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux