Eugen, I haven't tried. The first thing I did was restart the OSDs to restore services that were stuck. This morning it happened again on a totally random disk. If it happens again (I hope not) I will run this command. I don't understand why SLOW_OPS appears without any evidence of failure. Is there no mechanism to automatically recover from this event? Em sex., 16 de dez. de 2022 às 11:20, Eugen Block <eblock@xxxxxx> escreveu: > Have you tried catching an OSD's dump_blocked_ops with cephadm? > > Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>: > > > Eugen, thanks for answering. > > > > I understand that there is not enough memory, but I managed to recover a > > lot of the memory that was in use. > > Right now I can't upgrade to 48, but it's already planned. > > > > After yesterday's episode I managed to recover a lot of the memory that > was > > in use. > > > > Until then everything was normal, but at this very moment it happened > > again, but without high disk traffic and not much RAM in use. This leads > me > > to believe that the problem would not be due to lack of memory, as there > is > > a lot of free memory at the moment. > > > > Em qua., 14 de dez. de 2022 às 13:34, Eugen Block <eblock@xxxxxx> > escreveu: > > > >> With 12 OSDs and a default of 4 GB RAM per OSD you would at least > >> require 48 GB, usually a little more. Even if you reduced the memory > >> target per OSD it doesn’t mean they can deal with the workload. There > >> was a thread explaining that a couple of weeks ago. > >> > >> Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>: > >> > >> > Good morning everyone. > >> > > >> > Guys, today my cluster had a "problem", it was showing SLOW_OPS, when > >> > restarting the OSDs that were showing this problem everything was > solved > >> > (there were VMs stuck because of this), what I'm breaking my head is > to > >> > know the reason for having SLOW_OPS. > >> > > >> > In the logs I saw that the problem started at 04:00 AM and continued > >> until > >> > 07:50 AM (when I restarted the OSDs). > >> > > >> > I'm suspicious of some exaggerated settings that I applied and forgot > >> there > >> > in the initial setup while performing a test, which may have caused a > >> high > >> > use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in > >> this > >> > case was to put 512 PGs in two pools, one of which was affected. > >> > > >> > In the logs I saw that the problem started when some VMs started to > >> perform > >> > backup actions, increasing the writing a little (to a maximum of 300 > >> MBps), > >> > after a few seconds a disk started to show this WARN and also this > line: > >> > Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by > type > >> [ > >> > 'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [ > >> > 'cephfs.ds_disk.data' : 69]) > >> > > >> > Then he presented these: > >> > Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) > log > >> > [WRN] : Health check update: 0 slow ops, oldest one blocked for 36 > sec, > >> > daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS) > >> > [...] > >> > Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) > log > >> > [WRN] : Health check update: 149 slow ops, oldest one blocked for 6696 > >> sec, > >> > daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS) > >> > > >> > I've already checked the SMART, they're all OK, I've checked the > graphs > >> > generated in Grafana and none of the disks saturate, there haven't > been > >> any > >> > incidents related to the network, that is, I haven't identified any > other > >> > problem that could cause this. > >> > > >> > What could have caused this event? What can I do to prevent it from > >> > happening again? > >> > > >> > Below is some information about the cluster: > >> > 5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and > connected > >> > through 40Gb interfaces. > >> > > >> > # ceph osd tree > >> > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > >> > -1 163.73932 root default > >> > -3 32.74786 host dcs1 > >> > 0 hdd 2.72899 osd.0 up 1.00000 1.00000 > >> > 1 hdd 2.72899 osd.1 up 1.00000 1.00000 > >> > 2 hdd 2.72899 osd.2 up 1.00000 1.00000 > >> > 3 hdd 2.72899 osd.3 up 1.00000 1.00000 > >> > 4 hdd 2.72899 osd.4 up 1.00000 1.00000 > >> > 5 hdd 2.72899 osd.5 up 1.00000 1.00000 > >> > 6 hdd 2.72899 osd.6 up 1.00000 1.00000 > >> > 7 hdd 2.72899 osd.7 up 1.00000 1.00000 > >> > 8 hdd 2.72899 osd.8 up 1.00000 1.00000 > >> > 9 hdd 2.72899 osd.9 up 1.00000 1.00000 > >> > 10 hdd 2.72899 osd.10 up 1.00000 1.00000 > >> > 11 hdd 2.72899 osd.11 up 1.00000 1.00000 > >> > -5 32.74786 host dcs2 > >> > 12 hdd 2.72899 osd.12 up 1.00000 1.00000 > >> > 13 hdd 2.72899 osd.13 up 1.00000 1.00000 > >> > 14 hdd 2.72899 osd.14 up 1.00000 1.00000 > >> > 15 hdd 2.72899 osd.15 up 1.00000 1.00000 > >> > 16 hdd 2.72899 osd.16 up 1.00000 1.00000 > >> > 17 hdd 2.72899 osd.17 up 1.00000 1.00000 > >> > 18 hdd 2.72899 osd.18 up 1.00000 1.00000 > >> > 19 hdd 2.72899 osd.19 up 1.00000 1.00000 > >> > 20 hdd 2.72899 osd.20 up 1.00000 1.00000 > >> > 21 hdd 2.72899 osd.21 up 1.00000 1.00000 > >> > 22 hdd 2.72899 osd.22 up 1.00000 1.00000 > >> > 23 hdd 2.72899 osd.23 up 1.00000 1.00000 > >> > -7 32.74786 host dcs3 > >> > 24 hdd 2.72899 osd.24 up 1.00000 1.00000 > >> > 25 hdd 2.72899 osd.25 up 1.00000 1.00000 > >> > 26 hdd 2.72899 osd.26 up 1.00000 1.00000 > >> > 27 hdd 2.72899 osd.27 up 1.00000 1.00000 > >> > 28 hdd 2.72899 osd.28 up 1.00000 1.00000 > >> > 29 hdd 2.72899 osd.29 up 1.00000 1.00000 > >> > 30 hdd 2.72899 osd.30 up 1.00000 1.00000 > >> > 31 hdd 2.72899 osd.31 up 1.00000 1.00000 > >> > 32 hdd 2.72899 osd.32 up 1.00000 1.00000 > >> > 33 hdd 2.72899 osd.33 up 1.00000 1.00000 > >> > 34 hdd 2.72899 osd.34 up 1.00000 1.00000 > >> > 35 hdd 2.72899 osd.35 up 1.00000 1.00000 > >> > -9 32.74786 host dcs4 > >> > 36 hdd 2.72899 osd.36 up 1.00000 1.00000 > >> > 37 hdd 2.72899 osd.37 up 1.00000 1.00000 > >> > 38 hdd 2.72899 osd.38 up 1.00000 1.00000 > >> > 39 hdd 2.72899 osd.39 up 1.00000 1.00000 > >> > 40 hdd 2.72899 osd.40 up 1.00000 1.00000 > >> > 41 hdd 2.72899 osd.41 up 1.00000 1.00000 > >> > 42 hdd 2.72899 osd.42 up 1.00000 1.00000 > >> > 43 hdd 2.72899 osd.43 up 1.00000 1.00000 > >> > 44 hdd 2.72899 osd.44 up 1.00000 1.00000 > >> > 45 hdd 2.72899 osd.45 up 1.00000 1.00000 > >> > 46 hdd 2.72899 osd.46 up 1.00000 1.00000 > >> > 47 hdd 2.72899 osd.47 up 1.00000 1.00000 > >> > -11 32.74786 host dcs5 > >> > 48 hdd 2.72899 osd.48 up 1.00000 1.00000 > >> > 49 hdd 2.72899 osd.49 up 1.00000 1.00000 > >> > 50 hdd 2.72899 osd.50 up 1.00000 1.00000 > >> > 51 hdd 2.72899 osd.51 up 1.00000 1.00000 > >> > 52 hdd 2.72899 osd.52 up 1.00000 1.00000 > >> > 53 hdd 2.72899 osd.53 up 1.00000 1.00000 > >> > 54 hdd 2.72899 osd.54 up 1.00000 1.00000 > >> > 55 hdd 2.72899 osd.55 up 1.00000 1.00000 > >> > 56 hdd 2.72899 osd.56 up 1.00000 1.00000 > >> > 57 hdd 2.72899 osd.57 up 1.00000 1.00000 > >> > 58 hdd 2.72899 osd.58 up 1.00000 1.00000 > >> > 59 hdd 2.72899 osd.59 up 1.00000 1.00000 > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx