Slow ops aren't necessarily an indication of a failure. Does the
affected OSD log something during that time? Are snap-trims or
deep-scrubs running when the slow ops occur? Do you have the cephfs
mounted on OSD servers? What's the 'ceph -s' when that occurs, are
there any other warnings except slow ops?
Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
Eugen, I haven't tried. The first thing I did was restart the OSDs to
restore services that were stuck. This morning it happened again on a
totally random disk. If it happens again (I hope not) I will run this
command.
I don't understand why SLOW_OPS appears without any evidence of failure. Is
there no mechanism to automatically recover from this event?
Em sex., 16 de dez. de 2022 às 11:20, Eugen Block <eblock@xxxxxx> escreveu:
Have you tried catching an OSD's dump_blocked_ops with cephadm?
Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
> Eugen, thanks for answering.
>
> I understand that there is not enough memory, but I managed to recover a
> lot of the memory that was in use.
> Right now I can't upgrade to 48, but it's already planned.
>
> After yesterday's episode I managed to recover a lot of the memory that
was
> in use.
>
> Until then everything was normal, but at this very moment it happened
> again, but without high disk traffic and not much RAM in use. This leads
me
> to believe that the problem would not be due to lack of memory, as there
is
> a lot of free memory at the moment.
>
> Em qua., 14 de dez. de 2022 às 13:34, Eugen Block <eblock@xxxxxx>
escreveu:
>
>> With 12 OSDs and a default of 4 GB RAM per OSD you would at least
>> require 48 GB, usually a little more. Even if you reduced the memory
>> target per OSD it doesn’t mean they can deal with the workload. There
>> was a thread explaining that a couple of weeks ago.
>>
>> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
>>
>> > Good morning everyone.
>> >
>> > Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
>> > restarting the OSDs that were showing this problem everything was
solved
>> > (there were VMs stuck because of this), what I'm breaking my head is
to
>> > know the reason for having SLOW_OPS.
>> >
>> > In the logs I saw that the problem started at 04:00 AM and continued
>> until
>> > 07:50 AM (when I restarted the OSDs).
>> >
>> > I'm suspicious of some exaggerated settings that I applied and forgot
>> there
>> > in the initial setup while performing a test, which may have caused a
>> high
>> > use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in
>> this
>> > case was to put 512 PGs in two pools, one of which was affected.
>> >
>> > In the logs I saw that the problem started when some VMs started to
>> perform
>> > backup actions, increasing the writing a little (to a maximum of 300
>> MBps),
>> > after a few seconds a disk started to show this WARN and also this
line:
>> > Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by
type
>> [
>> > 'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
>> > 'cephfs.ds_disk.data' : 69])
>> >
>> > Then he presented these:
>> > Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
log
>> > [WRN] : Health check update: 0 slow ops, oldest one blocked for 36
sec,
>> > daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
>> > [...]
>> > Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster)
log
>> > [WRN] : Health check update: 149 slow ops, oldest one blocked for 6696
>> sec,
>> > daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)
>> >
>> > I've already checked the SMART, they're all OK, I've checked the
graphs
>> > generated in Grafana and none of the disks saturate, there haven't
been
>> any
>> > incidents related to the network, that is, I haven't identified any
other
>> > problem that could cause this.
>> >
>> > What could have caused this event? What can I do to prevent it from
>> > happening again?
>> >
>> > Below is some information about the cluster:
>> > 5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and
connected
>> > through 40Gb interfaces.
>> >
>> > # ceph osd tree
>> > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
>> > -1 163.73932 root default
>> > -3 32.74786 host dcs1
>> > 0 hdd 2.72899 osd.0 up 1.00000 1.00000
>> > 1 hdd 2.72899 osd.1 up 1.00000 1.00000
>> > 2 hdd 2.72899 osd.2 up 1.00000 1.00000
>> > 3 hdd 2.72899 osd.3 up 1.00000 1.00000
>> > 4 hdd 2.72899 osd.4 up 1.00000 1.00000
>> > 5 hdd 2.72899 osd.5 up 1.00000 1.00000
>> > 6 hdd 2.72899 osd.6 up 1.00000 1.00000
>> > 7 hdd 2.72899 osd.7 up 1.00000 1.00000
>> > 8 hdd 2.72899 osd.8 up 1.00000 1.00000
>> > 9 hdd 2.72899 osd.9 up 1.00000 1.00000
>> > 10 hdd 2.72899 osd.10 up 1.00000 1.00000
>> > 11 hdd 2.72899 osd.11 up 1.00000 1.00000
>> > -5 32.74786 host dcs2
>> > 12 hdd 2.72899 osd.12 up 1.00000 1.00000
>> > 13 hdd 2.72899 osd.13 up 1.00000 1.00000
>> > 14 hdd 2.72899 osd.14 up 1.00000 1.00000
>> > 15 hdd 2.72899 osd.15 up 1.00000 1.00000
>> > 16 hdd 2.72899 osd.16 up 1.00000 1.00000
>> > 17 hdd 2.72899 osd.17 up 1.00000 1.00000
>> > 18 hdd 2.72899 osd.18 up 1.00000 1.00000
>> > 19 hdd 2.72899 osd.19 up 1.00000 1.00000
>> > 20 hdd 2.72899 osd.20 up 1.00000 1.00000
>> > 21 hdd 2.72899 osd.21 up 1.00000 1.00000
>> > 22 hdd 2.72899 osd.22 up 1.00000 1.00000
>> > 23 hdd 2.72899 osd.23 up 1.00000 1.00000
>> > -7 32.74786 host dcs3
>> > 24 hdd 2.72899 osd.24 up 1.00000 1.00000
>> > 25 hdd 2.72899 osd.25 up 1.00000 1.00000
>> > 26 hdd 2.72899 osd.26 up 1.00000 1.00000
>> > 27 hdd 2.72899 osd.27 up 1.00000 1.00000
>> > 28 hdd 2.72899 osd.28 up 1.00000 1.00000
>> > 29 hdd 2.72899 osd.29 up 1.00000 1.00000
>> > 30 hdd 2.72899 osd.30 up 1.00000 1.00000
>> > 31 hdd 2.72899 osd.31 up 1.00000 1.00000
>> > 32 hdd 2.72899 osd.32 up 1.00000 1.00000
>> > 33 hdd 2.72899 osd.33 up 1.00000 1.00000
>> > 34 hdd 2.72899 osd.34 up 1.00000 1.00000
>> > 35 hdd 2.72899 osd.35 up 1.00000 1.00000
>> > -9 32.74786 host dcs4
>> > 36 hdd 2.72899 osd.36 up 1.00000 1.00000
>> > 37 hdd 2.72899 osd.37 up 1.00000 1.00000
>> > 38 hdd 2.72899 osd.38 up 1.00000 1.00000
>> > 39 hdd 2.72899 osd.39 up 1.00000 1.00000
>> > 40 hdd 2.72899 osd.40 up 1.00000 1.00000
>> > 41 hdd 2.72899 osd.41 up 1.00000 1.00000
>> > 42 hdd 2.72899 osd.42 up 1.00000 1.00000
>> > 43 hdd 2.72899 osd.43 up 1.00000 1.00000
>> > 44 hdd 2.72899 osd.44 up 1.00000 1.00000
>> > 45 hdd 2.72899 osd.45 up 1.00000 1.00000
>> > 46 hdd 2.72899 osd.46 up 1.00000 1.00000
>> > 47 hdd 2.72899 osd.47 up 1.00000 1.00000
>> > -11 32.74786 host dcs5
>> > 48 hdd 2.72899 osd.48 up 1.00000 1.00000
>> > 49 hdd 2.72899 osd.49 up 1.00000 1.00000
>> > 50 hdd 2.72899 osd.50 up 1.00000 1.00000
>> > 51 hdd 2.72899 osd.51 up 1.00000 1.00000
>> > 52 hdd 2.72899 osd.52 up 1.00000 1.00000
>> > 53 hdd 2.72899 osd.53 up 1.00000 1.00000
>> > 54 hdd 2.72899 osd.54 up 1.00000 1.00000
>> > 55 hdd 2.72899 osd.55 up 1.00000 1.00000
>> > 56 hdd 2.72899 osd.56 up 1.00000 1.00000
>> > 57 hdd 2.72899 osd.57 up 1.00000 1.00000
>> > 58 hdd 2.72899 osd.58 up 1.00000 1.00000
>> > 59 hdd 2.72899 osd.59 up 1.00000 1.00000
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx