Re: corrupted rbd filesystems since jewel

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 17 May 2017 15:13:02 -0400

Can you share your current OSD configuration? It's very curious that
your scrub is getting randomly stuck on a few objects for hours at a
time until an OSD is reset.

On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
> Hello Jason,
>
> minutes ago i had another case where i restarted the osd which was shown
> in objecter_requests output.
>
> It seems also other scrubs and deep scrubs were hanging.
>
> Output before:
>                 4095 active+clean
>                    1 active+clean+scrubbing
>
> Output after restart:
>                 4084 active+clean
>                    7 active+clean+scrubbing+deep
>                    5 active+clean+scrubbing
>
> both values are changing every few seconds again doing a lot of scrubs
> and deep scubs.
>
> Greets,
> Stefan
> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG:
>> Hi,
>>
>> that command does not exist.
>>
>> But at least ceph -s permanently reports 1 pg in scrubbing with no change.
>>
>> Log attached as well.
>>
>> Greets,
>> Stefan
>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman:
>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I
>>> can quickly scan the new log if you directly send it to me.
>>>
>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>> can send the osd log - if you want?
>>>>
>>>> Stefan
>>>>
>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG:
>>>>> Hello Jason,
>>>>>
>>>>> the command
>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa
>>>>>
>>>>> hangs as well. Doing absolutely nothing... waiting forever.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman:
>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is
>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm
>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs
>>>>>> during this)? If that succeeds while your VM remains blocked on that
>>>>>> remove op, it looks like there is some problem in the OSD where ops
>>>>>> queued on a scrub are not properly awoken when the scrub completes.
>>>>>>
>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>> Hello Jason,
>>>>>>>
>>>>>>> after enabling the log and generating a gcore dump, the request was
>>>>>>> successful ;-(
>>>>>>>
>>>>>>> So the log only contains the successfull request. So i was only able to
>>>>>>> catch the successful request. I can send you the log on request.
>>>>>>>
>>>>>>> Luckily i had another VM on another Cluster behaving the same.
>>>>>>>
>>>>>>> This time osd.23:
>>>>>>> # ceph --admin-daemon
>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok
>>>>>>> objecter_requests
>>>>>>> {
>>>>>>>     "ops": [
>>>>>>>         {
>>>>>>>             "tid": 18777,
>>>>>>>             "pg": "2.cebed0aa",
>>>>>>>             "osd": 23,
>>>>>>>             "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>>             "object_locator": "@2",
>>>>>>>             "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>>             "target_object_locator": "@2",
>>>>>>>             "paused": 0,
>>>>>>>             "used_replica": 0,
>>>>>>>             "precalc_pgid": 0,
>>>>>>>             "last_sent": "1.83513e+06s",
>>>>>>>             "attempts": 1,
>>>>>>>             "snapid": "head",
>>>>>>>             "snap_context": "28a43=[]",
>>>>>>>             "mtime": "2017-05-17 16:51:06.0.455475s",
>>>>>>>             "osd_ops": [
>>>>>>>                 "delete"
>>>>>>>             ]
>>>>>>>         }
>>>>>>>     ],
>>>>>>>     "linger_ops": [
>>>>>>>         {
>>>>>>>             "linger_id": 1,
>>>>>>>             "pg": "2.f0709c34",
>>>>>>>             "osd": 23,
>>>>>>>             "object_id": "rbd_header.21aafa6b8b4567",
>>>>>>>             "object_locator": "@2",
>>>>>>>             "target_object_id": "rbd_header.21aafa6b8b4567",
>>>>>>>             "target_object_locator": "@2",
>>>>>>>             "paused": 0,
>>>>>>>             "used_replica": 0,
>>>>>>>             "precalc_pgid": 0,
>>>>>>>             "snapid": "head",
>>>>>>>             "registered": "1"
>>>>>>>         }
>>>>>>>     ],
>>>>>>>     "pool_ops": [],
>>>>>>>     "pool_stat_ops": [],
>>>>>>>     "statfs_ops": [],
>>>>>>>     "command_ops": []
>>>>>>> }
>>>>>>>
>>>>>>> OSD Logfile of OSD 23 attached.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman:
>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>> You mean the request no matter if it is successful or not? Which log
>>>>>>>>> level should be set to 20?
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is
>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that
>>>>>>>> high while attempting to capture the blocked op.
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>
>>>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com