Re: SLOW_OPS

Eugen Block <eblock@xxxxxx> · Wed, 14 Dec 2022 16:32:27 +0000

With 12 OSDs and a default of 4 GB RAM per OSD you would at least  
require 48 GB, usually a little more. Even if you reduced the memory  
target per OSD it doesn’t mean they can deal with the workload. There  
was a thread explaining that a couple of weeks ago.

Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:

Good morning everyone.

Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
restarting the OSDs that were showing this problem everything was solved
(there were VMs stuck because of this), what I'm breaking my head is to
know the reason for having SLOW_OPS.

In the logs I saw that the problem started at 04:00 AM and continued until
07:50 AM (when I restarted the OSDs).

I'm suspicious of some exaggerated settings that I applied and forgot there
in the initial setup while performing a test, which may have caused a high
use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in this
case was to put 512 PGs in two pools, one of which was affected.

In the logs I saw that the problem started when some VMs started to perform
backup actions, increasing the writing a little (to a maximum of 300 MBps),
after a few seconds a disk started to show this WARN and also this line:
Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by type [
'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
'cephfs.ds_disk.data' : 69])

Then he presented these:
Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 0 slow ops, oldest one blocked for 36 sec,
daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
[...]
Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 149 slow ops, oldest one blocked for 6696 sec,
daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)

I've already checked the SMART, they're all OK, I've checked the graphs
generated in Grafana and none of the disks saturate, there haven't been any
incidents related to the network, that is, I haven't identified any other
problem that could cause this.

What could have caused this event? What can I do to prevent it from
happening again?

Below is some information about the cluster:
5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and connected
through 40Gb interfaces.

# ceph osd tree
ID   CLASS  WEIGHT     TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         163.73932  root default
 -3          32.74786      host dcs1
  0    hdd    2.72899          osd.0       up   1.00000  1.00000
  1    hdd    2.72899          osd.1       up   1.00000  1.00000
  2    hdd    2.72899          osd.2       up   1.00000  1.00000
  3    hdd    2.72899          osd.3       up   1.00000  1.00000
  4    hdd    2.72899          osd.4       up   1.00000  1.00000
  5    hdd    2.72899          osd.5       up   1.00000  1.00000
  6    hdd    2.72899          osd.6       up   1.00000  1.00000
  7    hdd    2.72899          osd.7       up   1.00000  1.00000
  8    hdd    2.72899          osd.8       up   1.00000  1.00000
  9    hdd    2.72899          osd.9       up   1.00000  1.00000
 10    hdd    2.72899          osd.10      up   1.00000  1.00000
 11    hdd    2.72899          osd.11      up   1.00000  1.00000
 -5          32.74786      host dcs2
 12    hdd    2.72899          osd.12      up   1.00000  1.00000
 13    hdd    2.72899          osd.13      up   1.00000  1.00000
 14    hdd    2.72899          osd.14      up   1.00000  1.00000
 15    hdd    2.72899          osd.15      up   1.00000  1.00000
 16    hdd    2.72899          osd.16      up   1.00000  1.00000
 17    hdd    2.72899          osd.17      up   1.00000  1.00000
 18    hdd    2.72899          osd.18      up   1.00000  1.00000
 19    hdd    2.72899          osd.19      up   1.00000  1.00000
 20    hdd    2.72899          osd.20      up   1.00000  1.00000
 21    hdd    2.72899          osd.21      up   1.00000  1.00000
 22    hdd    2.72899          osd.22      up   1.00000  1.00000
 23    hdd    2.72899          osd.23      up   1.00000  1.00000
 -7          32.74786      host dcs3
 24    hdd    2.72899          osd.24      up   1.00000  1.00000
 25    hdd    2.72899          osd.25      up   1.00000  1.00000
 26    hdd    2.72899          osd.26      up   1.00000  1.00000
 27    hdd    2.72899          osd.27      up   1.00000  1.00000
 28    hdd    2.72899          osd.28      up   1.00000  1.00000
 29    hdd    2.72899          osd.29      up   1.00000  1.00000
 30    hdd    2.72899          osd.30      up   1.00000  1.00000
 31    hdd    2.72899          osd.31      up   1.00000  1.00000
 32    hdd    2.72899          osd.32      up   1.00000  1.00000
 33    hdd    2.72899          osd.33      up   1.00000  1.00000
 34    hdd    2.72899          osd.34      up   1.00000  1.00000
 35    hdd    2.72899          osd.35      up   1.00000  1.00000
 -9          32.74786      host dcs4
 36    hdd    2.72899          osd.36      up   1.00000  1.00000
 37    hdd    2.72899          osd.37      up   1.00000  1.00000
 38    hdd    2.72899          osd.38      up   1.00000  1.00000
 39    hdd    2.72899          osd.39      up   1.00000  1.00000
 40    hdd    2.72899          osd.40      up   1.00000  1.00000
 41    hdd    2.72899          osd.41      up   1.00000  1.00000
 42    hdd    2.72899          osd.42      up   1.00000  1.00000
 43    hdd    2.72899          osd.43      up   1.00000  1.00000
 44    hdd    2.72899          osd.44      up   1.00000  1.00000
 45    hdd    2.72899          osd.45      up   1.00000  1.00000
 46    hdd    2.72899          osd.46      up   1.00000  1.00000
 47    hdd    2.72899          osd.47      up   1.00000  1.00000
-11          32.74786      host dcs5
 48    hdd    2.72899          osd.48      up   1.00000  1.00000
 49    hdd    2.72899          osd.49      up   1.00000  1.00000
 50    hdd    2.72899          osd.50      up   1.00000  1.00000
 51    hdd    2.72899          osd.51      up   1.00000  1.00000
 52    hdd    2.72899          osd.52      up   1.00000  1.00000
 53    hdd    2.72899          osd.53      up   1.00000  1.00000
 54    hdd    2.72899          osd.54      up   1.00000  1.00000
 55    hdd    2.72899          osd.55      up   1.00000  1.00000
 56    hdd    2.72899          osd.56      up   1.00000  1.00000
 57    hdd    2.72899          osd.57      up   1.00000  1.00000
 58    hdd    2.72899          osd.58      up   1.00000  1.00000
 59    hdd    2.72899          osd.59      up   1.00000  1.00000
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx