Re: Identify slow ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

there's no issue with full OSDs / pools after setting weight on specific
OSDs.
  cluster:
    id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
    health: HEALTH_WARN
            2 slow ops, oldest one blocked for 345057 sec, mon.ld5505
has slow ops

  services:
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
    mgr: ld5505(active, since 2w), standbys: ld5506, ld5507
    mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
up:standby-replay 3 up:standby
    osd: 442 osds: 442 up (since 4w), 442 in (since 3d)

  data:
    pools:   7 pools, 19628 pgs
    objects: 67.21M objects, 256 TiB
    usage:   769 TiB used, 762 TiB / 1.5 PiB avail
    pgs:     19628 active+clean

  io:
    client:   458 KiB/s rd, 21 MiB/s wr, 907 op/s rd, 666 op/s wr


Regards
Thomas

Am 08.03.2020 um 00:33 schrieb ceph@xxxxxxxxxx:
> Hi Thomas,
>
> I would First try to get more space - as ceph will block io when your disks are full - perhaps your PGs are unbalaced.
>
> Does ceph osd df tree give any hint?
>
> Or is this already resolved?
>
> Hth
> Mehmet 
>
> Am 5. März 2020 09:26:13 MEZ schrieb Thomas Schneider <74cmonty@xxxxxxxxx>:
>> Hi,
>>
>> I have stopped all 3 MON services sequentially.
>> After starting the 3 MON services again, the slow ops where gone.
>> However, just after 1 min. of MON service uptime, the slow ops are back
>> again, and the blocked time is increasing constantly.
>>
>> root@ld3955:/home/ceph-scripts
>> # ceph -w
>>   cluster:
>>     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>>     health: HEALTH_WARN
>>             17 nearfull osd(s)
>>             1 pool(s) nearfull
>>             2 slow ops, oldest one blocked for 63 sec, mon.ld5505 has
>> slow ops
>>
>>   services:
>>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 67s)
>>     mgr: ld5505(active, since 11d), standbys: ld5506, ld5507
>>     mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
>> up:standby-replay 3 up:standby
>>     osd: 442 osds: 442 up (since 4w), 442 in (since 4w); 10 remapped
>> pgs
>>
>>   data:
>>     pools:   7 pools, 19628 pgs
>>     objects: 72.14M objects, 275 TiB
>>     usage:   826 TiB used, 705 TiB / 1.5 PiB avail
>>     pgs:     16920/216422157 objects misplaced (0.008%)
>>              19618 active+clean
>>              10    active+remapped+backfilling
>>
>>   io:
>>     client:   454 KiB/s rd, 15 MiB/s wr, 905 op/s rd, 463 op/s wr
>>     recovery: 125 MiB/s, 31 objects/s
>>
>>
>> 2020-03-05 09:21:48.647440 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 63 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:21:53.648708 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 68 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:21:58.650186 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 73 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:03.651447 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 78 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:08.653066 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 83 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:13.654699 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 88 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:18.655912 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 93 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:23.657263 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 98 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:28.658514 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 103 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:33.659965 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 108 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:38.661360 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 113 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:43.662727 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 118 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:48.663940 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 123 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:53.685451 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 128 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:22:58.691603 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 133 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:03.692841 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 138 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:08.694502 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 143 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:13.695991 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 148 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:18.697689 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 153 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:23.698945 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 158 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:28.700331 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 163 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:33.701754 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 168 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:38.703021 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 173 sec, mon.ld5505 has slow ops (SLOW_OPS)
>> 2020-03-05 09:23:43.704396 mon.ld5505 [WRN] Health check update: 2 slow
>> ops, oldest one blocked for 178 sec, mon.ld5505 has slow ops (SLOW_OPS)
>>
>> I have the impression that this is not a harmless bug anymore.
>>
>> Please advise how to proceed.
>>
>> THX
>>
>>
>> Am 17.02.2020 um 18:31 schrieb Paul Emmerich:
>>> that's probably just https://tracker.ceph.com/issues/43893
>>> (a harmless bug)
>>>
>>> Restart the mons to get rid of the message
>>>
>>> Paul
>>>
>>> -- Paul Emmerich Looking for help with your Ceph cluster? Contact us
>>> at https://croit.io croit GmbH Freseniusstr. 31h 81247 München
>>> www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 17, 2020 at 2:59 PM
>>> Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> the current output of ceph -s reports a warning:
>>>> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow
>> ops
>>>> This time is increasing.
>>>>
>>>> root@ld3955:~# ceph -s
>>>>   cluster:
>>>>     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>>>>     health: HEALTH_WARN
>>>>             9 daemons have recently crashed
>>>>             2 slow ops, oldest one blocked for 347335 sec,
>> mon.ld5505
>>>> has slow ops
>>>>
>>>>   services:
>>>>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
>>>>     mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
>>>>     mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
>>>> up:standby-replay 3 up:standby
>>>>     osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
>>>>
>>>>   data:
>>>>     pools:   7 pools, 19628 pgs
>>>>     objects: 65.78M objects, 251 TiB
>>>>     usage:   753 TiB used, 779 TiB / 1.5 PiB avail
>>>>     pgs:     19628 active+clean
>>>>
>>>>   io:
>>>>     client:   427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
>>>>
>>>> The details are as follows:
>>>> root@ld3955:~# ceph health detail
>>>> HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
>>>> blocked for 347755 sec, mon.ld5505 has slow ops
>>>> RECENT_CRASH 9 daemons have recently crashed
>>>>     mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
>>>>     mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
>>>>     mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
>>>>     mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
>>>>     mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
>>>>     mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
>>>>     mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
>>>>     mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
>>>>     mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
>>>> SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505
>> has
>>>> slow ops
>>>>
>>>> There's no error on services (mgr, mon, osd).
>>>>
>>>> Can you please advise how to identify the root cause of this slow
>> ops?
>>>> THX
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux