Re: corrupted rbd filesystems since jewel

Jason Dillaman <jdillama@xxxxxxxxxx> · Sun, 14 May 2017 08:04:29 -0400

It appears as though there is client.27994090 at 10.255.0.13 that
currently owns the exclusive lock on that image. I am assuming the log
is from "rbd feature disable"? If so, I can see that it attempts to
acquire the lock and the other side is not appropriately responding to
the request.

Assuming your system is still in this state, is there any chance to
get debug rbd=20 logs from that client by using the client's asok file
and "ceph --admin-daemon /path/to/client/asok config set debug_rbd 20"
and re-run the attempt to disable exclusive lock? Also, what version
of Ceph is that client running?

Jason

On Sun, May 14, 2017 at 1:55 AM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
> Hello Jason,
>
> as it still happens and VMs are crashing. I wanted to disable
> exclusive-lock,fast-diff again. But i detected that there are images
> where the rbd commands runs in an endless loop.
>
> I canceled the command after 60s and used --debug-rbd=20. Will send the
> log off list.
>
> Thanks!
>
> Greets,
> Stefan
>
> Am 13.05.2017 um 19:19 schrieb Stefan Priebe - Profihost AG:
>> Hello Jason,
>>
>> it seems to be related to fstrim and discard. I cannot reproduce it for
>> images were we don't use trim - but it's still the case it's working
>> fine for images created with jewel and it is not for images pre jewel.
>> The only difference i can find is that the images created with jewel
>> also support deep-flatten.
>>
>> Greets,
>> Stefan
>>
>> Am 11.05.2017 um 22:28 schrieb Jason Dillaman:
>>> Assuming the only log messages you are seeing are the following:
>>>
>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1
>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>> object map in-memory
>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1
>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>> object map on-disk
>>> 2017-05-06 03:20:50.831250 7f7877265700 -1
>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>
>>> It looks like that can only occur if somehow the object-map on disk is
>>> larger than the actual image size. If that's the case, how the image
>>> got into that state is unknown to me at this point.
>>>
>>> On Thu, May 11, 2017 at 3:23 PM, Stefan Priebe - Profihost AG
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>> Hi Jason,
>>>>
>>>> it seems i can at least circumvent the crashes. Since i restarted ALL
>>>> osds after enabling exclusive lock and rebuilding the object maps it had
>>>> no new crashes.
>>>>
>>>> What still makes me wonder are those
>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>>
>>>> messages.
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 08.05.2017 um 14:50 schrieb Stefan Priebe - Profihost AG:
>>>>> Hi,
>>>>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
>>>>>> You are saying that you had v2 RBD images created against Hammer OSDs
>>>>>> and client libraries where exclusive lock, object map, etc were never
>>>>>> enabled. You then upgraded the OSDs and clients to Jewel and at some
>>>>>> point enabled exclusive lock (and I'd assume object map) on these
>>>>>> images
>>>>>
>>>>> Yes i did:
>>>>> for img in $(rbd -p cephstor5 ls -l | grep -v "@" | awk '{ print $1 }');
>>>>> do rbd -p cephstor5 feature enable $img
>>>>> exclusive-lock,object-map,fast-diff || echo $img; done
>>>>>
>>>>>> -- or were the exclusive lock and object map features already
>>>>>> enabled under Hammer?
>>>>>
>>>>> No as they were not the rbd defaults.
>>>>>
>>>>>> The fact that you encountered an object map error on an export
>>>>>> operation is surprising to me.  Does that error re-occur if you
>>>>>> perform the export again? If you can repeat it, it would be very
>>>>>> helpful if you could run the export with "--debug-rbd=20" and capture
>>>>>> the generated logs.
>>>>>
>>>>> No i can't repeat it. It happens every night but for different images.
>>>>> But i never saw it for a vm twice. If i do he export again it works fine.
>>>>>
>>>>> I'm doing an rbd export or an rbd export-diff --from-snap it depends on
>>>>> the VM and day since the last snapshot.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>>>
>>>>>> On Sat, May 6, 2017 at 2:38 PM, Stefan Priebe - Profihost AG
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> also i'm getting these errors only for pre jewel images:
>>>>>>>
>>>>>>> 2017-05-06 03:20:50.830626 7f7876a64700 -1
>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>>>>> object map in-memory
>>>>>>> 2017-05-06 03:20:50.830634 7f7876a64700 -1
>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
>>>>>>> object map on-disk
>>>>>>> 2017-05-06 03:20:50.831250 7f7877265700 -1
>>>>>>> librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
>>>>>>>
>>>>>>> while running export-diff.
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 06.05.2017 um 07:37 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> Hello Json,
>>>>>>>>
>>>>>>>> while doing further testing it happens only with images created with
>>>>>>>> hammer and that got upgraded to jewel AND got enabled exclusive lock.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 04.05.2017 um 14:20 schrieb Jason Dillaman:
>>>>>>>>> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
>>>>>>>>> command and post the resulting log to a new ticket at [1]? I'd also be
>>>>>>>>> interested if you could re-create that
>>>>>>>>> "librbd::object_map::InvalidateRequest" issue repeatably.
>>>>>>>>> n
>>>>>>>>> [1] http://tracker.ceph.com/projects/rbd/issues
>>>>>>>>>
>>>>>>>>> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG
>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>> Example:
>>>>>>>>>> # rbd rm cephstor2/vm-136-disk-1
>>>>>>>>>> Removing image: 99% complete...
>>>>>>>>>>
>>>>>>>>>> Stuck at 99% and never completes. This is an image which got corrupted
>>>>>>>>>> for an unknown reason.
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>> I'm not sure whether this is related but our backup system uses rbd
>>>>>>>>>>> snapshots and reports sometimes messages like these:
>>>>>>>>>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>>>>>>>>>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>>>>>>>>>>
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>>>>>>>>>>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>>>>>>>>>>>> filesystems.
>>>>>>>>>>>>
>>>>>>>>>>>> Sometimes the VMs are just crashing with FS errors and a restart can
>>>>>>>>>>>> solve the problem. Sometimes the whole VM is not even bootable and we
>>>>>>>>>>>> need to import a backup.
>>>>>>>>>>>>
>>>>>>>>>>>> All of them have the same problem that you can't revert to an older
>>>>>>>>>>>> snapshot. The rbd command just hangs at 99% forever.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this a known issue - anythink we can check?
>>>>>>>>>>>>
>>>>>>>>>>>> Greets,
>>>>>>>>>>>> Stefan
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com