Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Sun, 14 May 2017 07:55:26 +0200

Hello Jason,

as it still happens and VMs are crashing. I wanted to disable
exclusive-lock,fast-diff again. But i detected that there are images
where the rbd commands runs in an endless loop.

I canceled the command after 60s and used --debug-rbd=20. Will send the
log off list.

Thanks!

Greets,
Stefan

Am 13.05.2017 um 19:19 schrieb Stefan Priebe - Profihost AG:
> Hello Jason,
> 
> it seems to be related to fstrim and discard. I cannot reproduce it for
> images were we don't use trim - but it's still the case it's working
> fine for images created with jewel and it is not for images pre jewel.
> The only difference i can find is that the images created with jewel
> also support deep-flatten.
> 
> Greets,
> Stefan
> 
> Am 11.05.2017 um 22:28 schrieb Jason Dillaman:
>> Assuming the only log messages you are seeing are the following:
>>
>> 2017-05-06 03:20:50.830626 7f7876a64700 -1
>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>> object map in-memory
>> 2017-05-06 03:20:50.830634 7f7876a64700 -1
>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>> object map on-disk
>> 2017-05-06 03:20:50.831250 7f7877265700 -1
>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>
>> It looks like that can only occur if somehow the object-map on disk is
>> larger than the actual image size. If that's the case, how the image
>> got into that state is unknown to me at this point.
>>
>> On Thu, May 11, 2017 at 3:23 PM, Stefan Priebe - Profihost AG
>> <s.priebe@xxxxxxxxxxxx> wrote:
>>> Hi Jason,
>>>
>>> it seems i can at least circumvent the crashes. Since i restarted ALL
>>> osds after enabling exclusive lock and rebuilding the object maps it had
>>> no new crashes.
>>>
>>> What still makes me wonder are those
>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>
>>> messages.
>>>
>>> Greets,
>>> Stefan
>>>
>>> Am 08.05.2017 um 14:50 schrieb Stefan Priebe - Profihost AG:
>>>> Hi,
>>>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
>>>>> You are saying that you had v2 RBD images created against Hammer OSDs
>>>>> and client libraries where exclusive lock, object map, etc were never
>>>>> enabled. You then upgraded the OSDs and clients to Jewel and at some
>>>>> point enabled exclusive lock (and I'd assume object map) on these
>>>>> images
>>>>
>>>> Yes i did:
>>>> for img in $(rbd -p cephstor5 ls -l | grep -v "@" | awk '{ print $1 }');
>>>> do rbd -p cephstor5 feature enable $img
>>>> exclusive-lock,object-map,fast-diff || echo $img; done
>>>>
>>>>> -- or were the exclusive lock and object map features already
>>>>> enabled under Hammer?
>>>>
>>>> No as they were not the rbd defaults.
>>>>
>>>>> The fact that you encountered an object map error on an export
>>>>> operation is surprising to me.  Does that error re-occur if you
>>>>> perform the export again? If you can repeat it, it would be very
>>>>> helpful if you could run the export with "--debug-rbd=20" and capture
>>>>> the generated logs.
>>>>
>>>> No i can't repeat it. It happens every night but for different images.
>>>> But i never saw it for a vm twice. If i do he export again it works fine.
>>>>
>>>> I'm doing an rbd export or an rbd export-diff --from-snap it depends on
>>>> the VM and day since the last snapshot.
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>>>
>>>>> On Sat, May 6, 2017 at 2:38 PM, Stefan Priebe - Profihost AG
>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> also i'm getting these errors only for pre jewel images:
>>>>>>
>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1
>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>>>> object map in-memory
>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1
>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>>>> object map on-disk
>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1
>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>>>>
>>>>>> while running export-diff.
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>> Am 06.05.2017 um 07:37 schrieb Stefan Priebe - Profihost AG:
>>>>>>> Hello Json,
>>>>>>>
>>>>>>> while doing further testing it happens only with images created with
>>>>>>> hammer and that got upgraded to jewel AND got enabled exclusive lock.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 04.05.2017 um 14:20 schrieb Jason Dillaman:
>>>>>>>> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
>>>>>>>> command and post the resulting log to a new ticket at [1]? I'd also be
>>>>>>>> interested if you could re-create that
>>>>>>>> "librbd::object_map::InvalidateRequest" issue repeatably.
>>>>>>>> n
>>>>>>>> [1] http://tracker.ceph.com/projects/rbd/issues
>>>>>>>>
>>>>>>>> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG
>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>> Example:
>>>>>>>>> # rbd rm cephstor2/vm-136-disk-1
>>>>>>>>> Removing image: 99% complete...
>>>>>>>>>
>>>>>>>>> Stuck at 99% and never completes. This is an image which got corrupted
>>>>>>>>> for an unknown reason.
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>> I'm not sure whether this is related but our backup system uses rbd
>>>>>>>>>> snapshots and reports sometimes messages like these:
>>>>>>>>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>>>>>>>>>
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>>>>>>>>>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>>>>>>>>>>> filesystems.
>>>>>>>>>>>
>>>>>>>>>>> Sometimes the VMs are just crashing with FS errors and a restart can
>>>>>>>>>>> solve the problem. Sometimes the whole VM is not even bootable and we
>>>>>>>>>>> need to import a backup.
>>>>>>>>>>>
>>>>>>>>>>> All of them have the same problem that you can't revert to an older
>>>>>>>>>>> snapshot. The rbd command just hangs at 99% forever.
>>>>>>>>>>>
>>>>>>>>>>> Is this a known issue - anythink we can check?
>>>>>>>>>>>
>>>>>>>>>>> Greets,
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com