Re: SLOW_OPS

Eugen Block <eblock@xxxxxx> · Mon, 19 Dec 2022 12:25:07 +0000

Slow ops aren't necessarily an indication of a failure. Does the  
affected OSD log something during that time? Are snap-trims or  
deep-scrubs running when the slow ops occur? Do you have the cephfs  
mounted on OSD servers? What's the 'ceph -s' when that occurs, are  
there any other warnings except slow ops?

Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:

Eugen, I haven't tried. The first thing I did was restart the OSDs to
restore services that were stuck. This morning it happened again on a
totally random disk. If it happens again (I hope not) I will run this
command.

I don't understand why SLOW_OPS appears without any evidence of failure. Is
there no mechanism to automatically recover from this event?

Em sex., 16 de dez. de 2022 às 11:20, Eugen Block <eblock@xxxxxx> escreveu:

Have you tried catching an OSD's dump_blocked_ops with cephadm?

Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:

> Eugen, thanks for answering.
>
> I understand that there is not enough memory, but I managed to recover a
> lot of the memory that was in use.
> Right now I can't upgrade to 48, but it's already planned.
>
> After yesterday's episode I managed to recover a lot of the memory that
was
> in use.
>
> Until then everything was normal, but at this very moment it happened
> again, but without high disk traffic and not much RAM in use. This leads
me
> to believe that the problem would not be due to lack of memory, as there
is
> a lot of free memory at the moment.
>
> Em qua., 14 de dez. de 2022 às 13:34, Eugen Block <eblock@xxxxxx>
escreveu:
>
>> With 12 OSDs and a default of 4 GB RAM per OSD you would at least
>> require 48 GB, usually a little more. Even if you reduced the memory
>> target per OSD it doesn’t mean they can deal with the workload. There
>> was a thread explaining that a couple of weeks ago.
>>
>> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
>>
>> > Good morning everyone.
>> >
>> > Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
>> > restarting the OSDs that were showing this problem everything was
solved
>> > (there were VMs stuck because of this), what I'm breaking my head is
to
>> > know the reason for having SLOW_OPS.
>> >
>> > In the logs I saw that the problem started at 04:00 AM and continued
>> until
>> > 07:50 AM (when I restarted the OSDs).
>> >
>> > I'm suspicious of some exaggerated settings that I applied and forgot
>> there
>> > in the initial setup while performing a test, which may have caused a
>> high
>> > use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in
>> this
>> > case was to put 512 PGs in two pools, one of which was affected.
>> >
>> > In the logs I saw that the problem started when some VMs started to
>> perform
>> > backup actions, increasing the writing a little (to a maximum of 300
>> MBps),
>> > after a few seconds a disk started to show this WARN and also this
line:
>> > Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by
type
>> [
>> > 'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
>> > 'cephfs.ds_disk.data' : 69])
>> >
>> > Then he presented these:
>> > Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
log
>> > [WRN] : Health check update: 0 slow ops, oldest one blocked for 36
sec,
>> > daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
>> > [...]
>> > Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
log
>> > [WRN] : Health check update: 149 slow ops, oldest one blocked for 6696
>> sec,
>> > daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)
>> >
>> > I've already checked the SMART, they're all OK, I've checked the
graphs
>> > generated in Grafana and none of the disks saturate, there haven't
been
>> any
>> > incidents related to the network, that is, I haven't identified any
other
>> > problem that could cause this.
>> >
>> > What could have caused this event? What can I do to prevent it from
>> > happening again?
>> >
>> > Below is some information about the cluster:
>> > 5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and
connected
>> > through 40Gb interfaces.
>> >
>> > # ceph osd tree
>> > ID   CLASS  WEIGHT     TYPE NAME       STATUS  REWEIGHT  PRI-AFF
>> >  -1         163.73932  root default
>> >  -3          32.74786      host dcs1
>> >   0    hdd    2.72899          osd.0       up   1.00000  1.00000
>> >   1    hdd    2.72899          osd.1       up   1.00000  1.00000
>> >   2    hdd    2.72899          osd.2       up   1.00000  1.00000
>> >   3    hdd    2.72899          osd.3       up   1.00000  1.00000
>> >   4    hdd    2.72899          osd.4       up   1.00000  1.00000
>> >   5    hdd    2.72899          osd.5       up   1.00000  1.00000
>> >   6    hdd    2.72899          osd.6       up   1.00000  1.00000
>> >   7    hdd    2.72899          osd.7       up   1.00000  1.00000
>> >   8    hdd    2.72899          osd.8       up   1.00000  1.00000
>> >   9    hdd    2.72899          osd.9       up   1.00000  1.00000
>> >  10    hdd    2.72899          osd.10      up   1.00000  1.00000
>> >  11    hdd    2.72899          osd.11      up   1.00000  1.00000
>> >  -5          32.74786      host dcs2
>> >  12    hdd    2.72899          osd.12      up   1.00000  1.00000
>> >  13    hdd    2.72899          osd.13      up   1.00000  1.00000
>> >  14    hdd    2.72899          osd.14      up   1.00000  1.00000
>> >  15    hdd    2.72899          osd.15      up   1.00000  1.00000
>> >  16    hdd    2.72899          osd.16      up   1.00000  1.00000
>> >  17    hdd    2.72899          osd.17      up   1.00000  1.00000
>> >  18    hdd    2.72899          osd.18      up   1.00000  1.00000
>> >  19    hdd    2.72899          osd.19      up   1.00000  1.00000
>> >  20    hdd    2.72899          osd.20      up   1.00000  1.00000
>> >  21    hdd    2.72899          osd.21      up   1.00000  1.00000
>> >  22    hdd    2.72899          osd.22      up   1.00000  1.00000
>> >  23    hdd    2.72899          osd.23      up   1.00000  1.00000
>> >  -7          32.74786      host dcs3
>> >  24    hdd    2.72899          osd.24      up   1.00000  1.00000
>> >  25    hdd    2.72899          osd.25      up   1.00000  1.00000
>> >  26    hdd    2.72899          osd.26      up   1.00000  1.00000
>> >  27    hdd    2.72899          osd.27      up   1.00000  1.00000
>> >  28    hdd    2.72899          osd.28      up   1.00000  1.00000
>> >  29    hdd    2.72899          osd.29      up   1.00000  1.00000
>> >  30    hdd    2.72899          osd.30      up   1.00000  1.00000
>> >  31    hdd    2.72899          osd.31      up   1.00000  1.00000
>> >  32    hdd    2.72899          osd.32      up   1.00000  1.00000
>> >  33    hdd    2.72899          osd.33      up   1.00000  1.00000
>> >  34    hdd    2.72899          osd.34      up   1.00000  1.00000
>> >  35    hdd    2.72899          osd.35      up   1.00000  1.00000
>> >  -9          32.74786      host dcs4
>> >  36    hdd    2.72899          osd.36      up   1.00000  1.00000
>> >  37    hdd    2.72899          osd.37      up   1.00000  1.00000
>> >  38    hdd    2.72899          osd.38      up   1.00000  1.00000
>> >  39    hdd    2.72899          osd.39      up   1.00000  1.00000
>> >  40    hdd    2.72899          osd.40      up   1.00000  1.00000
>> >  41    hdd    2.72899          osd.41      up   1.00000  1.00000
>> >  42    hdd    2.72899          osd.42      up   1.00000  1.00000
>> >  43    hdd    2.72899          osd.43      up   1.00000  1.00000
>> >  44    hdd    2.72899          osd.44      up   1.00000  1.00000
>> >  45    hdd    2.72899          osd.45      up   1.00000  1.00000
>> >  46    hdd    2.72899          osd.46      up   1.00000  1.00000
>> >  47    hdd    2.72899          osd.47      up   1.00000  1.00000
>> > -11          32.74786      host dcs5
>> >  48    hdd    2.72899          osd.48      up   1.00000  1.00000
>> >  49    hdd    2.72899          osd.49      up   1.00000  1.00000
>> >  50    hdd    2.72899          osd.50      up   1.00000  1.00000
>> >  51    hdd    2.72899          osd.51      up   1.00000  1.00000
>> >  52    hdd    2.72899          osd.52      up   1.00000  1.00000
>> >  53    hdd    2.72899          osd.53      up   1.00000  1.00000
>> >  54    hdd    2.72899          osd.54      up   1.00000  1.00000
>> >  55    hdd    2.72899          osd.55      up   1.00000  1.00000
>> >  56    hdd    2.72899          osd.56      up   1.00000  1.00000
>> >  57    hdd    2.72899          osd.57      up   1.00000  1.00000
>> >  58    hdd    2.72899          osd.58      up   1.00000  1.00000
>> >  59    hdd    2.72899          osd.59      up   1.00000  1.00000
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx