Re: Identify slow ops

Thomas Schneider <74cmonty@xxxxxxxxx> · Mon, 23 Mar 2020 09:40:22 +0100

Hi,

I have upgraded to 14.2.8 and rebooted all nodes sequentially including
all 3 MON services.
However the slow ops are still displayed with increasing block time.
# ceph -s
  cluster:
    id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
    health: HEALTH_WARN
            17 daemons have recently crashed
            2263 slow ops, oldest one blocked for 183885 sec, mon.ld5505
has slow ops

  services:
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 2d)
    mgr: ld5505(active, since 2d), standbys: ld5506, ld5507
    mds: cephfs:2 {0=ld4257=up:active,1=ld5508=up:active} 2
up:standby-replay 3 up:standby
    osd: 442 osds: 441 up (since 38h), 441 in (since 38h)

  data:
    pools:   7 pools, 19628 pgs
    objects: 68.65M objects, 262 TiB
    usage:   786 TiB used, 744 TiB / 1.5 PiB avail
    pgs:     19628 active+clean

  io:
    client:   3.3 KiB/s rd, 3.1 MiB/s wr, 7 op/s rd, 25 op/s wr

I have the impression that this is not a harmless bug anymore.

Please advise how to proceed.

THX

Am 17.02.2020 um 18:31 schrieb Paul Emmerich:
> that's probably just https://tracker.ceph.com/issues/43893
> (a harmless bug)
>
> Restart the mons to get rid of the message
>
> Paul
>
> -- Paul Emmerich Looking for help with your Ceph cluster? Contact us
> at https://croit.io croit GmbH Freseniusstr. 31h 81247 München
> www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 17, 2020 at 2:59 PM
> Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>> Hi,
>>
>> the current output of ceph -s reports a warning:
>> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops
>> This time is increasing.
>>
>> root@ld3955:~# ceph -s
>>   cluster:
>>     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>>     health: HEALTH_WARN
>>             9 daemons have recently crashed
>>             2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
>> has slow ops
>>
>>   services:
>>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
>>     mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
>>     mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
>> up:standby-replay 3 up:standby
>>     osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
>>
>>   data:
>>     pools:   7 pools, 19628 pgs
>>     objects: 65.78M objects, 251 TiB
>>     usage:   753 TiB used, 779 TiB / 1.5 PiB avail
>>     pgs:     19628 active+clean
>>
>>   io:
>>     client:   427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
>>
>> The details are as follows:
>> root@ld3955:~# ceph health detail
>> HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
>> blocked for 347755 sec, mon.ld5505 has slow ops
>> RECENT_CRASH 9 daemons have recently crashed
>>     mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
>>     mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
>>     mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
>>     mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
>>     mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
>>     mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
>>     mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
>>     mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
>>     mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
>> SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
>> slow ops
>>
>> There's no error on services (mgr, mon, osd).
>>
>> Can you please advise how to identify the root cause of this slow ops?
>>
>> THX
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx