Re: HEALTH_WARN 1 MDSs report slow metadata IOs

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Wed, 17 Jul 2019 09:47:02 +0200

Hi,

thanks for the hint!! This did it.

I indeed found stuck requests using "ceph daemon  mds.xxx
objecter_requests".
I then restarted the osds involved in those requests one by one and now
the problems are gone and the status is back to HEALTH_OK.

Thanks again

Dietmar

On 7/17/19 9:08 AM, Yan, Zheng wrote:
> Check if there is any hang request in 'ceph daemon  mds.xxx objecter_requests'
> 
> On Tue, Jul 16, 2019 at 11:51 PM Dietmar Rieder
> <dietmar.rieder@xxxxxxxxxxx> wrote:
>>
>> On 7/16/19 4:11 PM, Dietmar Rieder wrote:
>>> Hi,
>>>
>>> We are running ceph version 14.1.2 with cephfs only.
>>>
>>> I just noticed that one of our pgs had scrub errors which I could repair
>>>
>>> # ceph health detail
>>> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
>>> 1 scrub errors; Possible data damage: 1 pg inconsistent
>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>>     mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>>> oldest blocked for 47743 secs
>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>>     mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>     pg 6.e0b is active+clean+inconsistent, acting
>>> [194,23,116,183,149,82,42,132,26]
>>>
>>>
>>> Apparently I was able to repair the pg:
>>>
>>> #  rados list-inconsistent-pg hdd-ec-data-pool
>>> ["6.e0b"]
>>>
>>> # ceph pg repair 6.e0b
>>> instructing pg 6.e0bs0 on osd.194 to repair
>>>
>>> [...]
>>> 2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
>>> 6.e0b repair starts
>>> 2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
>>> 6.e0b repair ok, 0 fixed
>>> [....]
>>>
>>>
>>> However I still have HEALTH_WARN do to slow metadata IOs.
>>>
>>> # ceph health detail
>>> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>>     mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>>> oldest blocked for 51123 secs
>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>>     mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs
>>>
>>>
>>> I already rebooted all my client machines accessing the cephfs via
>>> kernel client, but the HEALTH_WARN status is still the one above.
>>>
>>> In the MDS log I see tons of the following messages:
>>>
>>> [...]
>>> 2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>>> slow request 1920.184123 seconds old, received at 2019-07-16
>>> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
>>> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
>>> caller_gid=50000{}) currently failed to rdlock, waiting
>>> 2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
>>> to version 12642 from mon.0
>>> 2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>>> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
>>> 2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
>>> to version 12643 from mon.0
>>> [...]
>>>
>>> How can I get back to normal?
>>>
>>> I'd be grateful for any help
>>
>>
>> after I restarted the 3 mds daemons I got rid of the blocked client
>> requests but there is still the slow metadata IOs warning:
>>
>>
>> # ceph health detail
>> HEALTH_WARN 1 MDSs report slow metadata IOs
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>     mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 563 secs
>>
>> the mds log has now these messages every ~5 seconds:
>> [...]
>> 2019-07-16 17:31:20.456 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13638 from mon.2
>> 2019-07-16 17:31:24.529 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13639 from mon.2
>> 2019-07-16 17:31:28.560 7f38947a2700  1 mds.cephmds-01 Updating MDS map
>> to version 13640 from mon.2
>> [...]
>>
>> What does this tell me? Can I do something about it?
>> For now I stopped all IO.
>>
>> Best
>>   Dietmar
>>
>>
>>
>>
>> --
>> _________________________________________
>> D i e t m a r  R i e d e r, Mag.Dr.
>> Innsbruck Medical University
>> Biocenter - Division for Bioinformatics
>> Innrain 80, 6020 Innsbruck
>> Phone: +43 512 9003 71402
>> Fax: +43 512 9003 73100
>> Email: dietmar.rieder@xxxxxxxxxxx
>> Web:   http://www.icbi.at
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com