Re: corrupted rbd filesystems since jewel

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 11 May 2017 16:28:06 -0400

Assuming the only log messages you are seeing are the following:

2017-05-06 03:20:50.830626 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
object map in-memory
2017-05-06 03:20:50.830634 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
object map on-disk
2017-05-06 03:20:50.831250 7f7877265700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0

It looks like that can only occur if somehow the object-map on disk is
larger than the actual image size. If that's the case, how the image
got into that state is unknown to me at this point.

On Thu, May 11, 2017 at 3:23 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
> Hi Jason,
>
> it seems i can at least circumvent the crashes. Since i restarted ALL
> osds after enabling exclusive lock and rebuilding the object maps it had
> no new crashes.
>
> What still makes me wonder are those
> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>
> messages.
>
> Greets,
> Stefan
>
> Am 08.05.2017 um 14:50 schrieb Stefan Priebe - Profihost AG:
>> Hi,
>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
>>> You are saying that you had v2 RBD images created against Hammer OSDs
>>> and client libraries where exclusive lock, object map, etc were never
>>> enabled. You then upgraded the OSDs and clients to Jewel and at some
>>> point enabled exclusive lock (and I'd assume object map) on these
>>> images
>>
>> Yes i did:
>> for img in $(rbd -p cephstor5 ls -l | grep -v "@" | awk '{ print $1 }');
>> do rbd -p cephstor5 feature enable $img
>> exclusive-lock,object-map,fast-diff || echo $img; done
>>
>>> -- or were the exclusive lock and object map features already
>>> enabled under Hammer?
>>
>> No as they were not the rbd defaults.
>>
>>> The fact that you encountered an object map error on an export
>>> operation is surprising to me.  Does that error re-occur if you
>>> perform the export again? If you can repeat it, it would be very
>>> helpful if you could run the export with "--debug-rbd=20" and capture
>>> the generated logs.
>>
>> No i can't repeat it. It happens every night but for different images.
>> But i never saw it for a vm twice. If i do he export again it works fine.
>>
>> I'm doing an rbd export or an rbd export-diff --from-snap it depends on
>> the VM and day since the last snapshot.
>>
>> Greets,
>> Stefan
>>
>>>
>>> On Sat, May 6, 2017 at 2:38 PM, Stefan Priebe - Profihost AG
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> also i'm getting these errors only for pre jewel images:
>>>>
>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1
>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>> object map in-memory
>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1
>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>> object map on-disk
>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1
>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>>
>>>> while running export-diff.
>>>>
>>>> Stefan
>>>>
>>>> Am 06.05.2017 um 07:37 schrieb Stefan Priebe - Profihost AG:
>>>>> Hello Json,
>>>>>
>>>>> while doing further testing it happens only with images created with
>>>>> hammer and that got upgraded to jewel AND got enabled exclusive lock.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 04.05.2017 um 14:20 schrieb Jason Dillaman:
>>>>>> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
>>>>>> command and post the resulting log to a new ticket at [1]? I'd also be
>>>>>> interested if you could re-create that
>>>>>> "librbd::object_map::InvalidateRequest" issue repeatably.
>>>>>> n
>>>>>> [1] http://tracker.ceph.com/projects/rbd/issues
>>>>>>
>>>>>> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>> Example:
>>>>>>> # rbd rm cephstor2/vm-136-disk-1
>>>>>>> Removing image: 99% complete...
>>>>>>>
>>>>>>> Stuck at 99% and never completes. This is an image which got corrupted
>>>>>>> for an unknown reason.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> I'm not sure whether this is related but our backup system uses rbd
>>>>>>>> snapshots and reports sometimes messages like these:
>>>>>>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>>>>>>>
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>>>>>>>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>>>>>>>>> filesystems.
>>>>>>>>>
>>>>>>>>> Sometimes the VMs are just crashing with FS errors and a restart can
>>>>>>>>> solve the problem. Sometimes the whole VM is not even bootable and we
>>>>>>>>> need to import a backup.
>>>>>>>>>
>>>>>>>>> All of them have the same problem that you can't revert to an older
>>>>>>>>> snapshot. The rbd command just hangs at 99% forever.
>>>>>>>>>
>>>>>>>>> Is this a known issue - anythink we can check?
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com