On 7/16/19 5:34 PM, Dietmar Rieder wrote: > On 7/16/19 4:11 PM, Dietmar Rieder wrote: >> Hi, >> >> We are running ceph version 14.1.2 with cephfs only. >> >> I just noticed that one of our pgs had scrub errors which I could repair >> >> # ceph health detail >> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; >> 1 scrub errors; Possible data damage: 1 pg inconsistent >> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs >> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs, >> oldest blocked for 47743 secs >> MDS_SLOW_REQUEST 1 MDSs report slow requests >> mdscephmds-01(mds.0): 2 slow requests are blocked > 30 secs >> OSD_SCRUB_ERRORS 1 scrub errors >> PG_DAMAGED Possible data damage: 1 pg inconsistent >> pg 6.e0b is active+clean+inconsistent, acting >> [194,23,116,183,149,82,42,132,26] >> >> >> Apparently I was able to repair the pg: >> >> # rados list-inconsistent-pg hdd-ec-data-pool >> ["6.e0b"] >> >> # ceph pg repair 6.e0b >> instructing pg 6.e0bs0 on osd.194 to repair >> >> [...] >> 2019-07-16 15:07:13.700 7f851d720700 0 log_channel(cluster) log [DBG] : >> 6.e0b repair starts >> 2019-07-16 15:10:23.852 7f851d720700 0 log_channel(cluster) log [DBG] : >> 6.e0b repair ok, 0 fixed >> [....] >> >> >> However I still have HEALTH_WARN do to slow metadata IOs. >> >> # ceph health detail >> HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests >> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs >> mdscephmds-01(mds.0): 3 slow metadata IOs are blocked > 30 secs, >> oldest blocked for 51123 secs >> MDS_SLOW_REQUEST 1 MDSs report slow requests >> mdscephmds-01(mds.0): 5 slow requests are blocked > 30 secs >> >> >> I already rebooted all my client machines accessing the cephfs via >> kernel client, but the HEALTH_WARN status is still the one above. >> >> In the MDS log I see tons of the following messages: >> >> [...] >> 2019-07-16 16:08:17.770 7f727fd2e700 0 log_channel(cluster) log [WRN] : >> slow request 1920.184123 seconds old, received at 2019-07-16 >> 15:36:17.586647: client_request(client.3902814:84 getattr pAsLsXsFs >> #0x10001daa8ad 2019-07-16 15:36:17.585355 caller_uid=40059, >> caller_gid=50000{}) currently failed to rdlock, waiting >> 2019-07-16 16:08:19.069 7f7282533700 1 mds.cephmds-01 Updating MDS map >> to version 12642 from mon.0 >> 2019-07-16 16:08:22.769 7f727fd2e700 0 log_channel(cluster) log [WRN] : >> 5 slow requests, 0 included below; oldest blocked for > 49539.644840 secs >> 2019-07-16 16:08:26.683 7f7282533700 1 mds.cephmds-01 Updating MDS map >> to version 12643 from mon.0 >> [...] >> >> How can I get back to normal? >> >> I'd be grateful for any help > > > after I restarted the 3 mds daemons I got rid of the blocked client > requests but there is still the slow metadata IOs warning: > > > # ceph health detail > HEALTH_WARN 1 MDSs report slow metadata IOs > MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs > mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs, > oldest blocked for 563 secs > > the mds log has now these messages every ~5 seconds: > [...] > 2019-07-16 17:31:20.456 7f38947a2700 1 mds.cephmds-01 Updating MDS map > to version 13638 from mon.2 > 2019-07-16 17:31:24.529 7f38947a2700 1 mds.cephmds-01 Updating MDS map > to version 13639 from mon.2 > 2019-07-16 17:31:28.560 7f38947a2700 1 mds.cephmds-01 Updating MDS map > to version 13640 from mon.2 > [...] > > What does this tell me? Can I do something about it? > For now I stopped all IO. > I now waited about 12h with no (cephfs was mounted but no users were accessing it) IO but the slow metadata IOs warning is still there: # ceph health detail HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdscephmds-01(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 40194 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdscephmds-01(mds.0): 1 slow requests are blocked > 30 secs Ceph fs dump gives the following output: # ceph fs dump dumped fsmap epoch 24544 e24544 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 3 Filesystem 'cephfs' (3) fs_name cephfs epoch 24544 flags 3c created 2017-10-05 13:04:39.518807 modified 2019-07-17 08:39:46.316309 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 10365 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=3944424} failed damaged stopped data_pools [6,4] metadata_pool 5 inline_data disabled balancer standby_count_wanted 1 3944424: [v2:10.0.3.21:6800/1174400705,v1:10.0.3.21:6803/1174400705] 'cephmds-01' mds.0.16442 up:active seq 10249 3914531: [v2:10.0.3.22:6800/4207539690,v1:10.0.3.22:6801/4207539690] 'cephmds-02' mds.0.0 up:standby-replay seq 33 Standby daemons: 3914555: [v2:10.0.3.23:6800/1847716317,v1:10.0.3.23:6801/1847716317] 'cephmds-03' mds.-1.0 up:standby seq 2 What can be the reason for the slow metadata IOs warning in a situation of hours with no client IO. Has someone an idea how to fix this? Best Dietmar
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com