Re: SLOW_OPS

Murilo Morais <murilo@xxxxxxxxxxxxxx> · Fri, 16 Dec 2022 13:16:12 -0300

Eugen, I haven't tried. The first thing I did was restart the OSDs to
restore services that were stuck. This morning it happened again on a
totally random disk. If it happens again (I hope not) I will run this
command.

I don't understand why SLOW_OPS appears without any evidence of failure. Is
there no mechanism to automatically recover from this event?

Em sex., 16 de dez. de 2022 às 11:20, Eugen Block <eblock@xxxxxx> escreveu:

> Have you tried catching an OSD's dump_blocked_ops with cephadm?
>
> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
>
> > Eugen, thanks for answering.
> >
> > I understand that there is not enough memory, but I managed to recover a
> > lot of the memory that was in use.
> > Right now I can't upgrade to 48, but it's already planned.
> >
> > After yesterday's episode I managed to recover a lot of the memory that
> was
> > in use.
> >
> > Until then everything was normal, but at this very moment it happened
> > again, but without high disk traffic and not much RAM in use. This leads
> me
> > to believe that the problem would not be due to lack of memory, as there
> is
> > a lot of free memory at the moment.
> >
> > Em qua., 14 de dez. de 2022 às 13:34, Eugen Block <eblock@xxxxxx>
> escreveu:
> >
> >> With 12 OSDs and a default of 4 GB RAM per OSD you would at least
> >> require 48 GB, usually a little more. Even if you reduced the memory
> >> target per OSD it doesn’t mean they can deal with the workload. There
> >> was a thread explaining that a couple of weeks ago.
> >>
> >> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
> >>
> >> > Good morning everyone.
> >> >
> >> > Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
> >> > restarting the OSDs that were showing this problem everything was
> solved
> >> > (there were VMs stuck because of this), what I'm breaking my head is
> to
> >> > know the reason for having SLOW_OPS.
> >> >
> >> > In the logs I saw that the problem started at 04:00 AM and continued
> >> until
> >> > 07:50 AM (when I restarted the OSDs).
> >> >
> >> > I'm suspicious of some exaggerated settings that I applied and forgot
> >> there
> >> > in the initial setup while performing a test, which may have caused a
> >> high
> >> > use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in
> >> this
> >> > case was to put 512 PGs in two pools, one of which was affected.
> >> >
> >> > In the logs I saw that the problem started when some VMs started to
> >> perform
> >> > backup actions, increasing the writing a little (to a maximum of 300
> >> MBps),
> >> > after a few seconds a disk started to show this WARN and also this
> line:
> >> > Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by
> type
> >> [
> >> > 'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
> >> > 'cephfs.ds_disk.data' : 69])
> >> >
> >> > Then he presented these:
> >> > Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
> log
> >> > [WRN] : Health check update: 0 slow ops, oldest one blocked for 36
> sec,
> >> > daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
> >> > [...]
> >> > Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
> log
> >> > [WRN] : Health check update: 149 slow ops, oldest one blocked for 6696
> >> sec,
> >> > daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)
> >> >
> >> > I've already checked the SMART, they're all OK, I've checked the
> graphs
> >> > generated in Grafana and none of the disks saturate, there haven't
> been
> >> any
> >> > incidents related to the network, that is, I haven't identified any
> other
> >> > problem that could cause this.
> >> >
> >> > What could have caused this event? What can I do to prevent it from
> >> > happening again?
> >> >
> >> > Below is some information about the cluster:
> >> > 5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and
> connected
> >> > through 40Gb interfaces.
> >> >
> >> > # ceph osd tree
> >> > ID   CLASS  WEIGHT     TYPE NAME       STATUS  REWEIGHT  PRI-AFF
> >> >  -1         163.73932  root default
> >> >  -3          32.74786      host dcs1
> >> >   0    hdd    2.72899          osd.0       up   1.00000  1.00000
> >> >   1    hdd    2.72899          osd.1       up   1.00000  1.00000
> >> >   2    hdd    2.72899          osd.2       up   1.00000  1.00000
> >> >   3    hdd    2.72899          osd.3       up   1.00000  1.00000
> >> >   4    hdd    2.72899          osd.4       up   1.00000  1.00000
> >> >   5    hdd    2.72899          osd.5       up   1.00000  1.00000
> >> >   6    hdd    2.72899          osd.6       up   1.00000  1.00000
> >> >   7    hdd    2.72899          osd.7       up   1.00000  1.00000
> >> >   8    hdd    2.72899          osd.8       up   1.00000  1.00000
> >> >   9    hdd    2.72899          osd.9       up   1.00000  1.00000
> >> >  10    hdd    2.72899          osd.10      up   1.00000  1.00000
> >> >  11    hdd    2.72899          osd.11      up   1.00000  1.00000
> >> >  -5          32.74786      host dcs2
> >> >  12    hdd    2.72899          osd.12      up   1.00000  1.00000
> >> >  13    hdd    2.72899          osd.13      up   1.00000  1.00000
> >> >  14    hdd    2.72899          osd.14      up   1.00000  1.00000
> >> >  15    hdd    2.72899          osd.15      up   1.00000  1.00000
> >> >  16    hdd    2.72899          osd.16      up   1.00000  1.00000
> >> >  17    hdd    2.72899          osd.17      up   1.00000  1.00000
> >> >  18    hdd    2.72899          osd.18      up   1.00000  1.00000
> >> >  19    hdd    2.72899          osd.19      up   1.00000  1.00000
> >> >  20    hdd    2.72899          osd.20      up   1.00000  1.00000
> >> >  21    hdd    2.72899          osd.21      up   1.00000  1.00000
> >> >  22    hdd    2.72899          osd.22      up   1.00000  1.00000
> >> >  23    hdd    2.72899          osd.23      up   1.00000  1.00000
> >> >  -7          32.74786      host dcs3
> >> >  24    hdd    2.72899          osd.24      up   1.00000  1.00000
> >> >  25    hdd    2.72899          osd.25      up   1.00000  1.00000
> >> >  26    hdd    2.72899          osd.26      up   1.00000  1.00000
> >> >  27    hdd    2.72899          osd.27      up   1.00000  1.00000
> >> >  28    hdd    2.72899          osd.28      up   1.00000  1.00000
> >> >  29    hdd    2.72899          osd.29      up   1.00000  1.00000
> >> >  30    hdd    2.72899          osd.30      up   1.00000  1.00000
> >> >  31    hdd    2.72899          osd.31      up   1.00000  1.00000
> >> >  32    hdd    2.72899          osd.32      up   1.00000  1.00000
> >> >  33    hdd    2.72899          osd.33      up   1.00000  1.00000
> >> >  34    hdd    2.72899          osd.34      up   1.00000  1.00000
> >> >  35    hdd    2.72899          osd.35      up   1.00000  1.00000
> >> >  -9          32.74786      host dcs4
> >> >  36    hdd    2.72899          osd.36      up   1.00000  1.00000
> >> >  37    hdd    2.72899          osd.37      up   1.00000  1.00000
> >> >  38    hdd    2.72899          osd.38      up   1.00000  1.00000
> >> >  39    hdd    2.72899          osd.39      up   1.00000  1.00000
> >> >  40    hdd    2.72899          osd.40      up   1.00000  1.00000
> >> >  41    hdd    2.72899          osd.41      up   1.00000  1.00000
> >> >  42    hdd    2.72899          osd.42      up   1.00000  1.00000
> >> >  43    hdd    2.72899          osd.43      up   1.00000  1.00000
> >> >  44    hdd    2.72899          osd.44      up   1.00000  1.00000
> >> >  45    hdd    2.72899          osd.45      up   1.00000  1.00000
> >> >  46    hdd    2.72899          osd.46      up   1.00000  1.00000
> >> >  47    hdd    2.72899          osd.47      up   1.00000  1.00000
> >> > -11          32.74786      host dcs5
> >> >  48    hdd    2.72899          osd.48      up   1.00000  1.00000
> >> >  49    hdd    2.72899          osd.49      up   1.00000  1.00000
> >> >  50    hdd    2.72899          osd.50      up   1.00000  1.00000
> >> >  51    hdd    2.72899          osd.51      up   1.00000  1.00000
> >> >  52    hdd    2.72899          osd.52      up   1.00000  1.00000
> >> >  53    hdd    2.72899          osd.53      up   1.00000  1.00000
> >> >  54    hdd    2.72899          osd.54      up   1.00000  1.00000
> >> >  55    hdd    2.72899          osd.55      up   1.00000  1.00000
> >> >  56    hdd    2.72899          osd.56      up   1.00000  1.00000
> >> >  57    hdd    2.72899          osd.57      up   1.00000  1.00000
> >> >  58    hdd    2.72899          osd.58      up   1.00000  1.00000
> >> >  59    hdd    2.72899          osd.59      up   1.00000  1.00000
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx