I was able to re-create the issue where "rbd feature disable" hangs if the client experienced a long comms failure with the OSDs, and I have a proposed fix posted [1]. Unfortunately, I haven't been successful in repeating any stalled IO, discard issues, nor export-diff logged errors. I'll keep trying to reproduce, but if you can generate debug-level logging from one of these events it would be greatly appreciated. [1] https://github.com/ceph/ceph/pull/15093 On Mon, May 15, 2017 at 1:29 PM, Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> wrote: > Hello Jason, >> Just so I can attempt to repeat this: > > Thanks. > >> (1) you had an image that was built using Hammer clients and OSDs with >> exclusive lock disabled > Yes. It was created with the hammer rbd defaults. > >> (2) you updated your clients and OSDs to Jewel >> (3) you restarted your OSDs and live-migrated your VMs to pick up the >> Jewel changes > > No. I updated the clients only and did a live migration for all VMs to > load up the jewel librbd. > > After that i updated the mons + restart and than updated the osds + restart. > >> (4) you enabled exclusive-lock, object-map, and fast-diff on a running VM > Yes. > >> (5) you rebuilt the image's object map (while the VM was running?) > Yes. > >> (6) things started breaking at this point > Yes but not on all VMs and only while creating and deleting snapshots. > > Greets, > Stefan > > >> >> On Sun, May 14, 2017 at 1:42 PM, Stefan Priebe - Profihost AG >> <s.priebe@xxxxxxxxxxxx> wrote: >>> I verified it. After a live migration of the VM i'm able to successfully >>> disable fast-diff,exclusive-lock,object-map. >>> >>> The problem only seems to occur at all if a client has connected to >>> hammer without exclusive lock. Than got upgraded to jewel and exclusive >>> lock gets enabled. >>> >>> Greets, >>> Stefan >>> >>> Am 14.05.2017 um 19:33 schrieb Stefan Priebe - Profihost AG: >>>> Hello Jason, >>>> >>>> Am 14.05.2017 um 14:04 schrieb Jason Dillaman: >>>>> It appears as though there is client.27994090 at 10.255.0.13 that >>>>> currently owns the exclusive lock on that image. I am assuming the log >>>>> is from "rbd feature disable"? >>>> Yes. >>>> >>>>> If so, I can see that it attempts to >>>>> acquire the lock and the other side is not appropriately responding to >>>>> the request. >>>>> >>>>> Assuming your system is still in this state, is there any chance to >>>>> get debug rbd=20 logs from that client by using the client's asok file >>>>> and "ceph --admin-daemon /path/to/client/asok config set debug_rbd 20" >>>>> and re-run the attempt to disable exclusive lock? >>>> >>>> It's a VM running qemu with librbd. It seems there is no default socket. >>>> If there is no way to activate it later - i don't think so. I can try to >>>> activate it in ceph.conf and migrate it to another node. But i'm not >>>> sure whether the problem persist after migration or if librbd is >>>> somewhat like reinitialized. >>>> >>>>> Also, what version of Ceph is that client running? >>>> Client and Server are on ceph 10.2.7. >>>> >>>> Greets, >>>> Stefan >>>> >>>>> Jason >>>>> >>>>> On Sun, May 14, 2017 at 1:55 AM, Stefan Priebe - Profihost AG >>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>> Hello Jason, >>>>>> >>>>>> as it still happens and VMs are crashing. I wanted to disable >>>>>> exclusive-lock,fast-diff again. But i detected that there are images >>>>>> where the rbd commands runs in an endless loop. >>>>>> >>>>>> I canceled the command after 60s and used --debug-rbd=20. Will send the >>>>>> log off list. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Greets, >>>>>> Stefan >>>>>> >>>>>> Am 13.05.2017 um 19:19 schrieb Stefan Priebe - Profihost AG: >>>>>>> Hello Jason, >>>>>>> >>>>>>> it seems to be related to fstrim and discard. I cannot reproduce it for >>>>>>> images were we don't use trim - but it's still the case it's working >>>>>>> fine for images created with jewel and it is not for images pre jewel. >>>>>>> The only difference i can find is that the images created with jewel >>>>>>> also support deep-flatten. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> >>>>>>> Am 11.05.2017 um 22:28 schrieb Jason Dillaman: >>>>>>>> Assuming the only log messages you are seeing are the following: >>>>>>>> >>>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1 >>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>> object map in-memory >>>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1 >>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>> object map on-disk >>>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1 >>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>>> >>>>>>>> It looks like that can only occur if somehow the object-map on disk is >>>>>>>> larger than the actual image size. If that's the case, how the image >>>>>>>> got into that state is unknown to me at this point. >>>>>>>> >>>>>>>> On Thu, May 11, 2017 at 3:23 PM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> Hi Jason, >>>>>>>>> >>>>>>>>> it seems i can at least circumvent the crashes. Since i restarted ALL >>>>>>>>> osds after enabling exclusive lock and rebuilding the object maps it had >>>>>>>>> no new crashes. >>>>>>>>> >>>>>>>>> What still makes me wonder are those >>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>>>> >>>>>>>>> messages. >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>> Am 08.05.2017 um 14:50 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>> Hi, >>>>>>>>>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman: >>>>>>>>>>> You are saying that you had v2 RBD images created against Hammer OSDs >>>>>>>>>>> and client libraries where exclusive lock, object map, etc were never >>>>>>>>>>> enabled. You then upgraded the OSDs and clients to Jewel and at some >>>>>>>>>>> point enabled exclusive lock (and I'd assume object map) on these >>>>>>>>>>> images >>>>>>>>>> >>>>>>>>>> Yes i did: >>>>>>>>>> for img in $(rbd -p cephstor5 ls -l | grep -v "@" | awk '{ print $1 }'); >>>>>>>>>> do rbd -p cephstor5 feature enable $img >>>>>>>>>> exclusive-lock,object-map,fast-diff || echo $img; done >>>>>>>>>> >>>>>>>>>>> -- or were the exclusive lock and object map features already >>>>>>>>>>> enabled under Hammer? >>>>>>>>>> >>>>>>>>>> No as they were not the rbd defaults. >>>>>>>>>> >>>>>>>>>>> The fact that you encountered an object map error on an export >>>>>>>>>>> operation is surprising to me. Does that error re-occur if you >>>>>>>>>>> perform the export again? If you can repeat it, it would be very >>>>>>>>>>> helpful if you could run the export with "--debug-rbd=20" and capture >>>>>>>>>>> the generated logs. >>>>>>>>>> >>>>>>>>>> No i can't repeat it. It happens every night but for different images. >>>>>>>>>> But i never saw it for a vm twice. If i do he export again it works fine. >>>>>>>>>> >>>>>>>>>> I'm doing an rbd export or an rbd export-diff --from-snap it depends on >>>>>>>>>> the VM and day since the last snapshot. >>>>>>>>>> >>>>>>>>>> Greets, >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 6, 2017 at 2:38 PM, Stefan Priebe - Profihost AG >>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> also i'm getting these errors only for pre jewel images: >>>>>>>>>>>> >>>>>>>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1 >>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>>>>>> object map in-memory >>>>>>>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1 >>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating >>>>>>>>>>>> object map on-disk >>>>>>>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1 >>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 >>>>>>>>>>>> >>>>>>>>>>>> while running export-diff. >>>>>>>>>>>> >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 06.05.2017 um 07:37 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>> Hello Json, >>>>>>>>>>>>> >>>>>>>>>>>>> while doing further testing it happens only with images created with >>>>>>>>>>>>> hammer and that got upgraded to jewel AND got enabled exclusive lock. >>>>>>>>>>>>> >>>>>>>>>>>>> Greets, >>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> Am 04.05.2017 um 14:20 schrieb Jason Dillaman: >>>>>>>>>>>>>> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the >>>>>>>>>>>>>> command and post the resulting log to a new ticket at [1]? I'd also be >>>>>>>>>>>>>> interested if you could re-create that >>>>>>>>>>>>>> "librbd::object_map::InvalidateRequest" issue repeatably. >>>>>>>>>>>>>> n >>>>>>>>>>>>>> [1] http://tracker.ceph.com/projects/rbd/issues >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG >>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>> Example: >>>>>>>>>>>>>>> # rbd rm cephstor2/vm-136-disk-1 >>>>>>>>>>>>>>> Removing image: 99% complete... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stuck at 99% and never completes. This is an image which got corrupted >>>>>>>>>>>>>>> for an unknown reason. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>>> I'm not sure whether this is related but our backup system uses rbd >>>>>>>>>>>>>>>> snapshots and reports sometimes messages like these: >>>>>>>>>>>>>>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1 >>>>>>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> since we've upgraded from hammer to jewel 10.2.7 and enabled >>>>>>>>>>>>>>>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM >>>>>>>>>>>>>>>>> filesystems. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sometimes the VMs are just crashing with FS errors and a restart can >>>>>>>>>>>>>>>>> solve the problem. Sometimes the whole VM is not even bootable and we >>>>>>>>>>>>>>>>> need to import a backup. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> All of them have the same problem that you can't revert to an older >>>>>>>>>>>>>>>>> snapshot. The rbd command just hangs at 99% forever. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is this a known issue - anythink we can check? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> >> >> >> -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com