Re: HEALTH_WARN 1 MDSs report slow metadata IOs

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Wed, 17 Jul 2019 08:42:34 +0200

On 7/16/19 5:34 PM, Dietmar Rieder wrote:
> On 7/16/19 4:11 PM, Dietmar Rieder wrote:
>> Hi,
>>
>> We are running ceph version 14.1.2 with cephfs only.
>>
>> I just noticed that one of our pgs had scrub errors which I could repair
>>
>> # ceph health detail
>> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
>> 1 scrub errors; Possible data damage: 1 pg inconsistent
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>     mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 47743 secs
>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>     mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>     pg 6.e0b is active+clean+inconsistent, acting
>> [194,23,116,183,149,82,42,132,26]
>>
>>
>> Apparently I was able to repair the pg:
>>
>> #  rados list-inconsistent-pg hdd-ec-data-pool
>> ["6.e0b"]
>>
>> # ceph pg repair 6.e0b
>> instructing pg 6.e0bs0 on osd.194 to repair
>>
>> [...]
>> 2019-07-16 15:07:13.700 7f851d720700  0 log_channel(cluster) log [DBG] :
>> 6.e0b repair starts
>> 2019-07-16 15:10:23.852 7f851d720700  0 log_channel(cluster) log [DBG] :
>> 6.e0b repair ok, 0 fixed
>> [....]
>>
>>
>> However I still have HEALTH_WARN do to slow metadata IOs.
>>
>> # ceph health detail
>> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>     mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs,
>> oldest blocked for 51123 secs
>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>     mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs
>>
>>
>> I already rebooted all my client machines accessing the cephfs via
>> kernel client, but the HEALTH_WARN status is still the one above.
>>
>> In the MDS log I see tons of the following messages:
>>
>> [...]
>> 2019-07-16 16:08:17.770 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>> slow request 1920.184123 seconds old, received at 2019-07-16
>> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs
>> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059,
>> caller_gid=50000{}) currently failed to rdlock, waiting
>> 2019-07-16 16:08:19.069 7f7282533700  1 mds.cephmds-01 Updating MDS map
>> to version 12642 from mon.0
>> 2019-07-16 16:08:22.769 7f727fd2e700  0 log_channel(cluster) log [WRN] :
>> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs
>> 2019-07-16 16:08:26.683 7f7282533700  1 mds.cephmds-01 Updating MDS map
>> to version 12643 from mon.0
>> [...]
>>
>> How can I get back to normal?
>>
>> I'd be grateful for any help
> 
> 
> after I restarted the 3 mds daemons I got rid of the blocked client
> requests but there is still the slow metadata IOs warning:
> 
> 
> # ceph health detail
> HEALTH_WARN 1 MDSs report slow metadata IOs
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>     mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
> oldest blocked for 563 secs
> 
> the mds log has now these messages every ~5 seconds:
> [...]
> 2019-07-16 17:31:20.456 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13638 from mon.2
> 2019-07-16 17:31:24.529 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13639 from mon.2
> 2019-07-16 17:31:28.560 7f38947a2700  1 mds.cephmds-01 Updating MDS map
> to version 13640 from mon.2
> [...]
> 
> What does this tell me? Can I do something about it?
> For now I stopped all IO.
> 

I now waited about 12h with no (cephfs was mounted but no users were
accessing it) IO but the slow metadata IOs warning is still there:

# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs,
oldest blocked for 40194 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdscephmds-01(mds.0): 1 slow requests are blocked > 30 secs

Ceph fs dump gives the following output:

# ceph fs dump
dumped fsmap epoch 24544
e24544
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 3

Filesystem 'cephfs' (3)
fs_name cephfs
epoch   24544
flags   3c
created 2017-10-05 13:04:39.518807
modified        2019-07-17 08:39:46.316309
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  10365
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=3944424}
failed
damaged
stopped
data_pools      [6,4]
metadata_pool   5
inline_data     disabled
balancer
standby_count_wanted    1
3944424:
[v2:10.0.3.21:6800/1174400705,v1:10.0.3.21:6803/1174400705] 'cephmds-01'
mds.0.16442 up:active seq 10249
3914531:
[v2:10.0.3.22:6800/4207539690,v1:10.0.3.22:6801/4207539690] 'cephmds-02'
mds.0.0 up:standby-replay seq 33

Standby daemons:

3914555:
[v2:10.0.3.23:6800/1847716317,v1:10.0.3.23:6801/1847716317] 'cephmds-03'
mds.-1.0 up:standby seq 2

What can be the reason for the slow metadata IOs  warning in a situation
of hours with no client IO.

Has someone an idea how to fix this?

Best
   Dietmar

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com