Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 17 May 2017 20:18:23 +0200

can send the osd log - if you want?

Stefan

Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG:
> Hello Jason,
> 
> the command
> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa
> 
> hangs as well. Doing absolutely nothing... waiting forever.
> 
> Greets,
> Stefan
> 
> Am 17.05.2017 um 17:05 schrieb Jason Dillaman:
>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is
>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm
>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs
>> during this)? If that succeeds while your VM remains blocked on that
>> remove op, it looks like there is some problem in the OSD where ops
>> queued on a scrub are not properly awoken when the scrub completes.
>>
>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG
>> <s.priebe@xxxxxxxxxxxx> wrote:
>>> Hello Jason,
>>>
>>> after enabling the log and generating a gcore dump, the request was
>>> successful ;-(
>>>
>>> So the log only contains the successfull request. So i was only able to
>>> catch the successful request. I can send you the log on request.
>>>
>>> Luckily i had another VM on another Cluster behaving the same.
>>>
>>> This time osd.23:
>>> # ceph --admin-daemon
>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok
>>> objecter_requests
>>> {
>>>     "ops": [
>>>         {
>>>             "tid": 18777,
>>>             "pg": "2.cebed0aa",
>>>             "osd": 23,
>>>             "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>             "object_locator": "@2",
>>>             "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>             "target_object_locator": "@2",
>>>             "paused": 0,
>>>             "used_replica": 0,
>>>             "precalc_pgid": 0,
>>>             "last_sent": "1.83513e+06s",
>>>             "attempts": 1,
>>>             "snapid": "head",
>>>             "snap_context": "28a43=[]",
>>>             "mtime": "2017-05-17 16:51:06.0.455475s",
>>>             "osd_ops": [
>>>                 "delete"
>>>             ]
>>>         }
>>>     ],
>>>     "linger_ops": [
>>>         {
>>>             "linger_id": 1,
>>>             "pg": "2.f0709c34",
>>>             "osd": 23,
>>>             "object_id": "rbd_header.21aafa6b8b4567",
>>>             "object_locator": "@2",
>>>             "target_object_id": "rbd_header.21aafa6b8b4567",
>>>             "target_object_locator": "@2",
>>>             "paused": 0,
>>>             "used_replica": 0,
>>>             "precalc_pgid": 0,
>>>             "snapid": "head",
>>>             "registered": "1"
>>>         }
>>>     ],
>>>     "pool_ops": [],
>>>     "pool_stat_ops": [],
>>>     "statfs_ops": [],
>>>     "command_ops": []
>>> }
>>>
>>> OSD Logfile of OSD 23 attached.
>>>
>>> Greets,
>>> Stefan
>>>
>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman:
>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>> You mean the request no matter if it is successful or not? Which log
>>>>> level should be set to 20?
>>>>
>>>>
>>>> I'm hoping you can re-create the hung remove op when OSD logging is
>>>> increased -- "debug osd = 20" would be nice if you can turn it up that
>>>> high while attempting to capture the blocked op.
>>>>
>>
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com