Re: ceph status reports: slow ops - this is related to long running process /usr/bin/ceph-osd

Wido den Hollander <wido@xxxxxxxx> · Tue, 18 Feb 2020 17:44:52 +0100

On 10/8/19 3:53 PM, Thomas wrote:
> Hi,
> ceph status reports:
> root@ld3955:~# ceph -s
>   cluster:
>     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>     health: HEALTH_ERR
>             1 filesystem is degraded
>             1 filesystem has a failed mds daemon
>             1 filesystem is offline
>             insufficient standby MDS daemons available
>             4 nearfull osd(s)
>             1 pool(s) nearfull
>             Reduced data availability: 59 pgs inactive, 16 pgs peering
>             Degraded data redundancy: 597/153910758 objects degraded
> (0.000%), 2 pgs degraded, 1 pg undersized
>             Degraded data redundancy (low space): 23 pgs backfill_toofull
>             1 pgs not deep-scrubbed in time
>             4 pgs not scrubbed in time
>             3 pools have too many placement groups
>             164 slow requests are blocked > 32 sec
>             1082 stuck requests are blocked > 4096 sec
>             1490 slow ops, oldest one blocked for 19711 sec, daemons
> [osd,0,osd,175,osd,186,osd,5,osd,6,osd,63,osd,68,osd,9,mon,ld5505,mon,ld5506]...
> have slow ops.
> 
>   services:
>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 5h)
>     mgr: ld5507(active, since 5h), standbys: ld5506, ld5505
>     mds: pve_cephfs:0/1, 1 failed
>     osd: 419 osds: 416 up, 416 in; 6024 remapped pgs
> 
>   data:
>     pools:   6 pools, 8864 pgs
>     objects: 51.30M objects, 196 TiB
>     usage:   594 TiB used, 907 TiB / 1.5 PiB avail
>     pgs:     0.666% pgs not active
>              597/153910758 objects degraded (0.000%)
>              52964415/153910758 objects misplaced (34.412%)
>              5954 active+remapped+backfill_wait
>              2786 active+clean
>              40   active+remapped+backfilling
>              35   activating
>              23   active+remapped+backfill_wait+backfill_toofull
>              16   peering
>              7    activating+remapped
>              1    activating+undersized+degraded
>              1    active+clean+scrubbing
>              1    active+recovering+degraded
> 
>   io:
>     client:   3.5 KiB/s wr, 0 op/s rd, 0 op/s wr
>     recovery: 551 MiB/s, 137 objects/s
> 
> I'm concerned about the slow ops on osd.0 and osd.9.
> On the relevant OSD node I can see 2 relevant services running for hours:
> ceph       14795       1 99 09:58 ?        08:49:22 /usr/bin/ceph-osd -f
> --cluster ceph --id 9 --setuser ceph --setgroup ceph
> ceph       15394       1 99 09:58 ?        07:10:00 /usr/bin/ceph-osd -f
> --cluster ceph --id 0 --setuser ceph --setgroup ceph
> 
> In the relevant osd log I can find similar messages:
> root@ld5505:~# tail -f /var/log/ceph/ceph-osd.0.log
> 2019-10-08 15:35:32.830 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:33.806 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:34.842 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:35.862 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 

This triggered me as I saw this happening twice on a cluster.

I created a issue in the tracker as I think it might be the same thing:
https://tracker.ceph.com/issues/44184

Wido

> root@ld5505:~# tail -f /var/log/ceph/ceph-osd.9.log
> 2019-10-08 15:35:38.822 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:39.854 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:40.850 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:41.862 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 
> Question:
> How can I analyse and solve the issue with slow ops?
> 
> THX
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx