Hi Thomas, I would First try to get more space - as ceph will block io when your disks are full - perhaps your PGs are unbalaced. Does ceph osd df tree give any hint? Or is this already resolved? Hth Mehmet Am 5. März 2020 09:26:13 MEZ schrieb Thomas Schneider <74cmonty@xxxxxxxxx>: >Hi, > >I have stopped all 3 MON services sequentially. >After starting the 3 MON services again, the slow ops where gone. >However, just after 1 min. of MON service uptime, the slow ops are back >again, and the blocked time is increasing constantly. > >root@ld3955:/home/ceph-scripts ># ceph -w > cluster: > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > health: HEALTH_WARN > 17 nearfull osd(s) > 1 pool(s) nearfull > 2 slow ops, oldest one blocked for 63 sec, mon.ld5505 has >slow ops > > services: > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 67s) > mgr: ld5505(active, since 11d), standbys: ld5506, ld5507 > mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2 >up:standby-replay 3 up:standby > osd: 442 osds: 442 up (since 4w), 442 in (since 4w); 10 remapped >pgs > > data: > pools: 7 pools, 19628 pgs > objects: 72.14M objects, 275 TiB > usage: 826 TiB used, 705 TiB / 1.5 PiB avail > pgs: 16920/216422157 objects misplaced (0.008%) > 19618 active+clean > 10 active+remapped+backfilling > > io: > client: 454 KiB/s rd, 15 MiB/s wr, 905 op/s rd, 463 op/s wr > recovery: 125 MiB/s, 31 objects/s > > >2020-03-05 09:21:48.647440 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 63 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:21:53.648708 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 68 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:21:58.650186 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 73 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:03.651447 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 78 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:08.653066 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 83 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:13.654699 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 88 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:18.655912 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 93 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:23.657263 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 98 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:28.658514 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 103 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:33.659965 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 108 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:38.661360 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 113 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:43.662727 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 118 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:48.663940 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 123 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:53.685451 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 128 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:22:58.691603 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 133 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:03.692841 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 138 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:08.694502 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 143 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:13.695991 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 148 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:18.697689 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 153 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:23.698945 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 158 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:28.700331 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 163 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:33.701754 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 168 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:38.703021 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 173 sec, mon.ld5505 has slow ops (SLOW_OPS) >2020-03-05 09:23:43.704396 mon.ld5505 [WRN] Health check update: 2 slow >ops, oldest one blocked for 178 sec, mon.ld5505 has slow ops (SLOW_OPS) > >I have the impression that this is not a harmless bug anymore. > >Please advise how to proceed. > >THX > > >Am 17.02.2020 um 18:31 schrieb Paul Emmerich: >> that's probably just https://tracker.ceph.com/issues/43893 >> (a harmless bug) >> >> Restart the mons to get rid of the message >> >> Paul >> >> -- Paul Emmerich Looking for help with your Ceph cluster? Contact us >> at https://croit.io croit GmbH Freseniusstr. 31h 81247 München >> www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 17, 2020 at 2:59 PM >> Thomas Schneider <74cmonty@xxxxxxxxx> wrote: >>> Hi, >>> >>> the current output of ceph -s reports a warning: >>> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow >ops >>> This time is increasing. >>> >>> root@ld3955:~# ceph -s >>> cluster: >>> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae >>> health: HEALTH_WARN >>> 9 daemons have recently crashed >>> 2 slow ops, oldest one blocked for 347335 sec, >mon.ld5505 >>> has slow ops >>> >>> services: >>> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d) >>> mgr: ld5507(active, since 8m), standbys: ld5506, ld5505 >>> mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2 >>> up:standby-replay 3 up:standby >>> osd: 442 osds: 442 up (since 8d), 442 in (since 9d) >>> >>> data: >>> pools: 7 pools, 19628 pgs >>> objects: 65.78M objects, 251 TiB >>> usage: 753 TiB used, 779 TiB / 1.5 PiB avail >>> pgs: 19628 active+clean >>> >>> io: >>> client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr >>> >>> The details are as follows: >>> root@ld3955:~# ceph health detail >>> HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one >>> blocked for 347755 sec, mon.ld5505 has slow ops >>> RECENT_CRASH 9 daemons have recently crashed >>> mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z >>> mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z >>> mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z >>> mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z >>> mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z >>> mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z >>> mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z >>> mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z >>> mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z >>> SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 >has >>> slow ops >>> >>> There's no error on services (mgr, mon, osd). >>> >>> Can you please advise how to identify the root cause of this slow >ops? >>> >>> THX >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >_______________________________________________ >ceph-users mailing list -- ceph-users@xxxxxxx >To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx