Hello Jason, > Just so I can attempt to repeat this: Thanks. > (1) you had an image that was built using Hammer clients and OSDs with > exclusive lock disabled Yes. It was created with the hammer rbd defaults. > (2) you updated your clients and OSDs to Jewel > (3) you restarted your OSDs and live-migrated your VMs to pick up the > Jewel changes No. I updated the clients only and did a live migration for all VMs to load up the jewel librbd. After that i updated the mons + restart and than updated the osds + restart. > (4) you enabled exclusive-lock, object-map, and fast-diff on a running VM Yes. > (5) you rebuilt the image's object map (while the VM was running?) Yes. > (6) things started breaking at this point Yes but not on all VMs and only while creating and deleting snapshots. Greets, Stefan > > On Sun, May 14, 2017 at 1:42 PM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> I verified it. After a live migration of the VM i'm able to successfully >> disable fast-diff,exclusive-lock,object-map. >> >> The problem only seems to occur at all if a client has connected to >> hammer without exclusive lock. Than got upgraded to jewel and exclusive >> lock gets enabled. >> >> Greets, >> Stefan >> >> Am 14.05.2017 um 19:33 schrieb Stefan Priebe - Profihost AG: >>> Hello Jason, >>> >>> Am 14.05.2017 um 14:04 schrieb Jason Dillaman: >>>> It appears as though there is client.27994090 at 10.255.0.13 that >>>> currently owns the exclusive lock on that image. I am assuming the log >>>> is from "rbd feature disable"? >>> Yes. >>> >>>> If so, I can see that it attempts to >>>> acquire the lock and the other side is not appropriately responding to >>>> the request. >>>> >>>> Assuming your system is still in this state, is there any chance to >>>> get debug rbd=20 logs from that client by using the client's asok file >>>> and "ceph --admin-daemon /path/to/client/asok config set debug_rbd 20" >>>> and re-run the attempt to disable exclusive lock? >>> >>> It's a VM running qemu with librbd. It seems there is no default socket. >>> If there is no way to activate it later - i don't think so. I can try to >>> activate it in ceph.conf and migrate it to another node. But i'm not >>> sure whether the problem persist after migration or if librbd is >>> somewhat like reinitialized. >>> >>>> Also, what version of Ceph is that client running? >>> Client and Server are on ceph 10.2.7. >>> >>> Greets, >>> Stefan >>> >>>> Jason >>>> >>>> On Sun, May 14, 2017 at 1:55 AM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> Hello Jason, >>>>> >>>>> as it still happens and VMs are crashing. I wanted to disable >>>>> exclusive-lock,fast-diff again. But i detected that there are images >>>>> where the rbd commands runs in an endless loop. >>>>> >>>>> I canceled the command after 60s and used --debug-rbd=20. Will send the >>>>> log off list. >>>>> >>>>> Thanks! >>>>> >>>>> Greets, >>>>> Stefan >>>>> >>>>> Am 13.05.2017 um 19:19 schrieb Stefan Priebe - Profihost AG: >>>>>> Hello Jason, >>>>>> >>>>>> it seems to be related to fstrim and discard. I cannot reproduce it for >>>>>> images were we don't use trim - but it's still the case it's working >>>>>> fine for images created with jewel and it is not for images pre jewel. >>>>>> The only difference i can find is that the images created with jewel >>>>>> also support deep-flatten. >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> >>>>>> Am 11.05.2017 um 22:28 schrieb Jason Dillaman: >>>>>>> Assuming the only log messages you are seeing are the following: >>>>>>> >>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1 >>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>> object map in-memory >>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1 >>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>> object map on-disk >>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1 >>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>> >>>>>>> It looks like that can only occur if somehow the object-map on disk is >>>>>>> larger than the actual image size. If that's the case, how the image >>>>>>> got into that state is unknown to me at this point. >>>>>>> >>>>>>> On Thu, May 11, 2017 at 3:23 PM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> Hi Jason, >>>>>>>> >>>>>>>> it seems i can at least circumvent the crashes. Since i restarted ALL >>>>>>>> osds after enabling exclusive lock and rebuilding the object maps it had >>>>>>>> no new crashes. >>>>>>>> >>>>>>>> What still makes me wonder are those >>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>>> >>>>>>>> messages. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> >>>>>>>> Am 08.05.2017 um 14:50 schrieb Stefan Priebe - Profihost AG: >>>>>>>>> Hi, >>>>>>>>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman: >>>>>>>>>> You are saying that you had v2 RBD images created against Hammer OSDs >>>>>>>>>> and client libraries where exclusive lock, object map, etc were never >>>>>>>>>> enabled. You then upgraded the OSDs and clients to Jewel and at some >>>>>>>>>> point enabled exclusive lock (and I'd assume object map) on these >>>>>>>>>> images >>>>>>>>> >>>>>>>>> Yes i did: >>>>>>>>> for img in $(rbd -p cephstor5 ls -l | grep -v "@" | awk '{ print $1 }'); >>>>>>>>> do rbd -p cephstor5 feature enable $img >>>>>>>>> exclusive-lock,object-map,fast-diff || echo $img; done >>>>>>>>> >>>>>>>>>> -- or were the exclusive lock and object map features already >>>>>>>>>> enabled under Hammer? >>>>>>>>> >>>>>>>>> No as they were not the rbd defaults. >>>>>>>>> >>>>>>>>>> The fact that you encountered an object map error on an export >>>>>>>>>> operation is surprising to me. Does that error re-occur if you >>>>>>>>>> perform the export again? If you can repeat it, it would be very >>>>>>>>>> helpful if you could run the export with "--debug-rbd=20" and capture >>>>>>>>>> the generated logs. >>>>>>>>> >>>>>>>>> No i can't repeat it. It happens every night but for different images. >>>>>>>>> But i never saw it for a vm twice. If i do he export again it works fine. >>>>>>>>> >>>>>>>>> I'm doing an rbd export or an rbd export-diff --from-snap it depends on >>>>>>>>> the VM and day since the last snapshot. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 6, 2017 at 2:38 PM, Stefan Priebe - Profihost AG >>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> also i'm getting these errors only for pre jewel images: >>>>>>>>>>> >>>>>>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1 >>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>>>>> object map in-memory >>>>>>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1 >>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>>>>> object map on-disk >>>>>>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1 >>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>>>>>> >>>>>>>>>>> while running export-diff. >>>>>>>>>>> >>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> Am 06.05.2017 um 07:37 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>> Hello Json, >>>>>>>>>>>> >>>>>>>>>>>> while doing further testing it happens only with images created with >>>>>>>>>>>> hammer and that got upgraded to jewel AND got enabled exclusive lock. >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 04.05.2017 um 14:20 schrieb Jason Dillaman: >>>>>>>>>>>>> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the >>>>>>>>>>>>> command and post the resulting log to a new ticket at [1]? I'd also be >>>>>>>>>>>>> interested if you could re-create that >>>>>>>>>>>>> "librbd::object_map::InvalidateRequest" issue repeatably. >>>>>>>>>>>>> n >>>>>>>>>>>>> [1] http://tracker.ceph.com/projects/rbd/issues >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> Example: >>>>>>>>>>>>>> # rbd rm cephstor2/vm-136-disk-1 >>>>>>>>>>>>>> Removing image: 99% complete... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Stuck at 99% and never completes. This is an image which got corrupted >>>>>>>>>>>>>> for an unknown reason. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>> I'm not sure whether this is related but our backup system uses rbd >>>>>>>>>>>>>>> snapshots and reports sometimes messages like these: >>>>>>>>>>>>>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1 >>>>>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> since we've upgraded from hammer to jewel 10.2.7 and enabled >>>>>>>>>>>>>>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM >>>>>>>>>>>>>>>> filesystems. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sometimes the VMs are just crashing with FS errors and a restart can >>>>>>>>>>>>>>>> solve the problem. Sometimes the whole VM is not even bootable and we >>>>>>>>>>>>>>>> need to import a backup. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> All of them have the same problem that you can't revert to an older >>>>>>>>>>>>>>>> snapshot. The rbd command just hangs at 99% forever. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is this a known issue - anythink we can check? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>>> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com