Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 17 May 2017 20:55:00 +0200

Hello Jason,

minutes ago i had another case where i restarted the osd which was shown
in objecter_requests output.

It seems also other scrubs and deep scrubs were hanging.

Output before:
                4095 active+clean
                   1 active+clean+scrubbing

Output after restart:
                4084 active+clean
                   7 active+clean+scrubbing+deep
                   5 active+clean+scrubbing

both values are changing every few seconds again doing a lot of scrubs
and deep scubs.

Greets,
Stefan
Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG:
> Hi,
> 
> that command does not exist.
> 
> But at least ceph -s permanently reports 1 pg in scrubbing with no change.
> 
> Log attached as well.
> 
> Greets,
> Stefan
> Am 17.05.2017 um 20:20 schrieb Jason Dillaman:
>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I
>> can quickly scan the new log if you directly send it to me.
>>
>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG
>> <s.priebe@xxxxxxxxxxxx> wrote:
>>> can send the osd log - if you want?
>>>
>>> Stefan
>>>
>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG:
>>>> Hello Jason,
>>>>
>>>> the command
>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa
>>>>
>>>> hangs as well. Doing absolutely nothing... waiting forever.
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman:
>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is
>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm
>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs
>>>>> during this)? If that succeeds while your VM remains blocked on that
>>>>> remove op, it looks like there is some problem in the OSD where ops
>>>>> queued on a scrub are not properly awoken when the scrub completes.
>>>>>
>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG
>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>> Hello Jason,
>>>>>>
>>>>>> after enabling the log and generating a gcore dump, the request was
>>>>>> successful ;-(
>>>>>>
>>>>>> So the log only contains the successfull request. So i was only able to
>>>>>> catch the successful request. I can send you the log on request.
>>>>>>
>>>>>> Luckily i had another VM on another Cluster behaving the same.
>>>>>>
>>>>>> This time osd.23:
>>>>>> # ceph --admin-daemon
>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok
>>>>>> objecter_requests
>>>>>> {
>>>>>>     "ops": [
>>>>>>         {
>>>>>>             "tid": 18777,
>>>>>>             "pg": "2.cebed0aa",
>>>>>>             "osd": 23,
>>>>>>             "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>             "object_locator": "@2",
>>>>>>             "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>             "target_object_locator": "@2",
>>>>>>             "paused": 0,
>>>>>>             "used_replica": 0,
>>>>>>             "precalc_pgid": 0,
>>>>>>             "last_sent": "1.83513e+06s",
>>>>>>             "attempts": 1,
>>>>>>             "snapid": "head",
>>>>>>             "snap_context": "28a43=[]",
>>>>>>             "mtime": "2017-05-17 16:51:06.0.455475s",
>>>>>>             "osd_ops": [
>>>>>>                 "delete"
>>>>>>             ]
>>>>>>         }
>>>>>>     ],
>>>>>>     "linger_ops": [
>>>>>>         {
>>>>>>             "linger_id": 1,
>>>>>>             "pg": "2.f0709c34",
>>>>>>             "osd": 23,
>>>>>>             "object_id": "rbd_header.21aafa6b8b4567",
>>>>>>             "object_locator": "@2",
>>>>>>             "target_object_id": "rbd_header.21aafa6b8b4567",
>>>>>>             "target_object_locator": "@2",
>>>>>>             "paused": 0,
>>>>>>             "used_replica": 0,
>>>>>>             "precalc_pgid": 0,
>>>>>>             "snapid": "head",
>>>>>>             "registered": "1"
>>>>>>         }
>>>>>>     ],
>>>>>>     "pool_ops": [],
>>>>>>     "pool_stat_ops": [],
>>>>>>     "statfs_ops": [],
>>>>>>     "command_ops": []
>>>>>> }
>>>>>>
>>>>>> OSD Logfile of OSD 23 attached.
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman:
>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>> You mean the request no matter if it is successful or not? Which log
>>>>>>>> level should be set to 20?
>>>>>>>
>>>>>>>
>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is
>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that
>>>>>>> high while attempting to capture the blocked op.
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com