Re: Health error: 1 MDSs report slow metadata IOs, 1 MDSs report slow requests

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 24 Sep 2019 13:35:42 +0200

Hi,

you need to fix the non active PGs first. They are also probably the 
reason for the blocked requests.

Regards,

Burkhard

On 9/24/19 1:30 PM, Thomas wrote:
Hi,
ceph health reports
1 MDSs report slow metadata IOs
1 MDSs report slow requests

This is the complete output of ceph -s:
root@ld3955:~# ceph -s
   cluster:
     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
     health: HEALTH_ERR
             1 MDSs report slow metadata IOs
             1 MDSs report slow requests
             72 nearfull osd(s)
             1 pool(s) nearfull
             Reduced data availability: 33 pgs inactive, 32 pgs peering
             Degraded data redundancy: 123285/153918525 objects degraded
(0.080%), 27 pgs degraded, 27 pgs undersized
             Degraded data redundancy (low space): 116 pgs backfill_toofull
             3 pools have too many placement groups
             54 slow requests are blocked > 32 sec
             179 stuck requests are blocked > 4096 sec

   services:
     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 21h)
     mgr: ld5507(active, since 21h), standbys: ld5506, ld5505
     mds: pve_cephfs:1 {0=ld3955=up:active} 1 up:standby
     osd: 368 osds: 368 up, 368 in; 140 remapped pgs

   data:
     pools:   6 pools, 8872 pgs
     objects: 51.31M objects, 196 TiB
     usage:   591 TiB used, 561 TiB / 1.1 PiB avail
     pgs:     0.372% pgs not active
              123285/153918525 objects degraded (0.080%)
              621911/153918525 objects misplaced (0.404%)
              8714 active+clean
              90   active+remapped+backfill_toofull
              26   active+undersized+degraded+remapped+backfill_toofull
              16   peering
              16   remapped+peering
              7    active+remapped+backfill_wait
              1    activating
              1    active+recovery_wait+degraded
              1    active+recovery_wait+undersized+remapped

In the log I find these relevant entries:
2019-09-24 13:24:37.073695 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18618.873983 secs
2019-09-24 13:24:42.073757 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18623.874055 secs
2019-09-24 13:24:47.073852 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18628.874149 secs
2019-09-24 13:24:52.073941 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18633.874237 secs
2019-09-24 13:24:57.074073 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18638.874354 secs
2019-09-24 13:25:02.074118 mds.ld3955 [WRN] 2 slow requests, 0 included
below; oldest blocked for > 18643.874415 secs

Cephfs is residing on a pool "hdd" with dedicated HDDs (4x 17 1.6TB).
This pool is used for RBDs, too.

Question:
How can I identify the 2 slow requests?
And how can I kill these requests?

Regards
Thomas
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx