Re: corrupted rbd filesystems since jewel

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 17 May 2017 21:26:35 +0200

Am 17.05.2017 um 21:21 schrieb Jason Dillaman:
> Any chance you still have debug logs enabled on OSD 23 after you
> restarted it and the scrub froze again? 

No but i can do that ;-) Hopefully it freezes again.

Stefan

> 
> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG
> <s.priebe@xxxxxxxxxxxx> wrote:
>> Hello,
>>
>> now it shows again:
>>>>                 4095 active+clean
>>>>                    1 active+clean+scrubbing
>>
>> and:
>> # ceph pg dump | grep -i scrub
>> dumped all in format plain
>> pg_stat objects mip     degr    misp    unf     bytes   log     disklog
>> state   state_stamp     v       reported        up      up_primary
>> acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub
>> deep_scrub_stamp
>> 2.aa    4040    0       0       0       0       10128667136     3010
>> 3010    active+clean+scrubbing  2017-05-11 09:37:37.962700
>> 181936'11196478  181936:8688051  [23,41,9]       23      [23,41,9]
>> 23      176730'10793226 2017-05-10 03:43:20.849784      171715'10548192
>>    2017-05-04 14:27:39.210713
>>
>> So it seems the same scrub is stuck again... even after restarting the
>> osd. It just took some time until the scrub of this pg happened again.
>>
>> Greets,
>> Stefan
>> Am 17.05.2017 um 21:13 schrieb Jason Dillaman:
>>> Can you share your current OSD configuration? It's very curious that
>>> your scrub is getting randomly stuck on a few objects for hours at a
>>> time until an OSD is reset.
>>>
>>> On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>> Hello Jason,
>>>>
>>>> minutes ago i had another case where i restarted the osd which was shown
>>>> in objecter_requests output.
>>>>
>>>> It seems also other scrubs and deep scrubs were hanging.
>>>>
>>>> Output before:
>>>>                 4095 active+clean
>>>>                    1 active+clean+scrubbing
>>>>
>>>> Output after restart:
>>>>                 4084 active+clean
>>>>                    7 active+clean+scrubbing+deep
>>>>                    5 active+clean+scrubbing
>>>>
>>>> both values are changing every few seconds again doing a lot of scrubs
>>>> and deep scubs.
>>>>
>>>> Greets,
>>>> Stefan
>>>> Am 17.05.2017 um 20:36 schrieb Stefan Priebe - Profihost AG:
>>>>> Hi,
>>>>>
>>>>> that command does not exist.
>>>>>
>>>>> But at least ceph -s permanently reports 1 pg in scrubbing with no change.
>>>>>
>>>>> Log attached as well.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>> Am 17.05.2017 um 20:20 schrieb Jason Dillaman:
>>>>>> Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I
>>>>>> can quickly scan the new log if you directly send it to me.
>>>>>>
>>>>>> On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>> can send the osd log - if you want?
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> Hello Jason,
>>>>>>>>
>>>>>>>> the command
>>>>>>>> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0000000000000aaa
>>>>>>>>
>>>>>>>> hangs as well. Doing absolutely nothing... waiting forever.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 17.05.2017 um 17:05 schrieb Jason Dillaman:
>>>>>>>>> OSD 23 notes that object rbd_data.21aafa6b8b4567.0000000000000aaa is
>>>>>>>>> waiting for a scrub. What happens if you run "rados -p <rbd pool> rm
>>>>>>>>> rbd_data.21aafa6b8b4567.0000000000000aaa" (capturing the OSD 23 logs
>>>>>>>>> during this)? If that succeeds while your VM remains blocked on that
>>>>>>>>> remove op, it looks like there is some problem in the OSD where ops
>>>>>>>>> queued on a scrub are not properly awoken when the scrub completes.
>>>>>>>>>
>>>>>>>>> On Wed, May 17, 2017 at 10:57 AM, Stefan Priebe - Profihost AG
>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>> Hello Jason,
>>>>>>>>>>
>>>>>>>>>> after enabling the log and generating a gcore dump, the request was
>>>>>>>>>> successful ;-(
>>>>>>>>>>
>>>>>>>>>> So the log only contains the successfull request. So i was only able to
>>>>>>>>>> catch the successful request. I can send you the log on request.
>>>>>>>>>>
>>>>>>>>>> Luckily i had another VM on another Cluster behaving the same.
>>>>>>>>>>
>>>>>>>>>> This time osd.23:
>>>>>>>>>> # ceph --admin-daemon
>>>>>>>>>> /var/run/ceph/ceph-client.admin.22969.140085040783360.asok
>>>>>>>>>> objecter_requests
>>>>>>>>>> {
>>>>>>>>>>     "ops": [
>>>>>>>>>>         {
>>>>>>>>>>             "tid": 18777,
>>>>>>>>>>             "pg": "2.cebed0aa",
>>>>>>>>>>             "osd": 23,
>>>>>>>>>>             "object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>>>>>             "object_locator": "@2",
>>>>>>>>>>             "target_object_id": "rbd_data.21aafa6b8b4567.0000000000000aaa",
>>>>>>>>>>             "target_object_locator": "@2",
>>>>>>>>>>             "paused": 0,
>>>>>>>>>>             "used_replica": 0,
>>>>>>>>>>             "precalc_pgid": 0,
>>>>>>>>>>             "last_sent": "1.83513e+06s",
>>>>>>>>>>             "attempts": 1,
>>>>>>>>>>             "snapid": "head",
>>>>>>>>>>             "snap_context": "28a43=[]",
>>>>>>>>>>             "mtime": "2017-05-17 16:51:06.0.455475s",
>>>>>>>>>>             "osd_ops": [
>>>>>>>>>>                 "delete"
>>>>>>>>>>             ]
>>>>>>>>>>         }
>>>>>>>>>>     ],
>>>>>>>>>>     "linger_ops": [
>>>>>>>>>>         {
>>>>>>>>>>             "linger_id": 1,
>>>>>>>>>>             "pg": "2.f0709c34",
>>>>>>>>>>             "osd": 23,
>>>>>>>>>>             "object_id": "rbd_header.21aafa6b8b4567",
>>>>>>>>>>             "object_locator": "@2",
>>>>>>>>>>             "target_object_id": "rbd_header.21aafa6b8b4567",
>>>>>>>>>>             "target_object_locator": "@2",
>>>>>>>>>>             "paused": 0,
>>>>>>>>>>             "used_replica": 0,
>>>>>>>>>>             "precalc_pgid": 0,
>>>>>>>>>>             "snapid": "head",
>>>>>>>>>>             "registered": "1"
>>>>>>>>>>         }
>>>>>>>>>>     ],
>>>>>>>>>>     "pool_ops": [],
>>>>>>>>>>     "pool_stat_ops": [],
>>>>>>>>>>     "statfs_ops": [],
>>>>>>>>>>     "command_ops": []
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> OSD Logfile of OSD 23 attached.
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> Am 17.05.2017 um 16:26 schrieb Jason Dillaman:
>>>>>>>>>>> On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>> You mean the request no matter if it is successful or not? Which log
>>>>>>>>>>>> level should be set to 20?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm hoping you can re-create the hung remove op when OSD logging is
>>>>>>>>>>> increased -- "debug osd = 20" would be nice if you can turn it up that
>>>>>>>>>>> high while attempting to capture the blocked op.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com