Re: SLOW_OPS

Eugen Block <eblock@xxxxxx> · Fri, 16 Dec 2022 14:18:05 +0000

Have you tried catching an OSD's dump_blocked_ops with cephadm?

Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:

Eugen, thanks for answering.

I understand that there is not enough memory, but I managed to recover a
lot of the memory that was in use.
Right now I can't upgrade to 48, but it's already planned.

After yesterday's episode I managed to recover a lot of the memory that was
in use.

Until then everything was normal, but at this very moment it happened
again, but without high disk traffic and not much RAM in use. This leads me
to believe that the problem would not be due to lack of memory, as there is
a lot of free memory at the moment.

Em qua., 14 de dez. de 2022 às 13:34, Eugen Block <eblock@xxxxxx> escreveu:

With 12 OSDs and a default of 4 GB RAM per OSD you would at least
require 48 GB, usually a little more. Even if you reduced the memory
target per OSD it doesn’t mean they can deal with the workload. There
was a thread explaining that a couple of weeks ago.

Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:

> Good morning everyone.
>
> Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
> restarting the OSDs that were showing this problem everything was solved
> (there were VMs stuck because of this), what I'm breaking my head is to
> know the reason for having SLOW_OPS.
>
> In the logs I saw that the problem started at 04:00 AM and continued
until
> 07:50 AM (when I restarted the OSDs).
>
> I'm suspicious of some exaggerated settings that I applied and forgot
there
> in the initial setup while performing a test, which may have caused a
high
> use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in
this
> case was to put 512 PGs in two pools, one of which was affected.
>
> In the logs I saw that the problem started when some VMs started to
perform
> backup actions, increasing the writing a little (to a maximum of 300
MBps),
> after a few seconds a disk started to show this WARN and also this line:
> Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by type
[
> 'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
> 'cephfs.ds_disk.data' : 69])
>
> Then he presented these:
> Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
> [WRN] : Health check update: 0 slow ops, oldest one blocked for 36 sec,
> daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
> [...]
> Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
> [WRN] : Health check update: 149 slow ops, oldest one blocked for 6696
sec,
> daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)
>
> I've already checked the SMART, they're all OK, I've checked the graphs
> generated in Grafana and none of the disks saturate, there haven't been
any
> incidents related to the network, that is, I haven't identified any other
> problem that could cause this.
>
> What could have caused this event? What can I do to prevent it from
> happening again?
>
> Below is some information about the cluster:
> 5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and connected
> through 40Gb interfaces.
>
> # ceph osd tree
> ID   CLASS  WEIGHT     TYPE NAME       STATUS  REWEIGHT  PRI-AFF
>  -1         163.73932  root default
>  -3          32.74786      host dcs1
>   0    hdd    2.72899          osd.0       up   1.00000  1.00000
>   1    hdd    2.72899          osd.1       up   1.00000  1.00000
>   2    hdd    2.72899          osd.2       up   1.00000  1.00000
>   3    hdd    2.72899          osd.3       up   1.00000  1.00000
>   4    hdd    2.72899          osd.4       up   1.00000  1.00000
>   5    hdd    2.72899          osd.5       up   1.00000  1.00000
>   6    hdd    2.72899          osd.6       up   1.00000  1.00000
>   7    hdd    2.72899          osd.7       up   1.00000  1.00000
>   8    hdd    2.72899          osd.8       up   1.00000  1.00000
>   9    hdd    2.72899          osd.9       up   1.00000  1.00000
>  10    hdd    2.72899          osd.10      up   1.00000  1.00000
>  11    hdd    2.72899          osd.11      up   1.00000  1.00000
>  -5          32.74786      host dcs2
>  12    hdd    2.72899          osd.12      up   1.00000  1.00000
>  13    hdd    2.72899          osd.13      up   1.00000  1.00000
>  14    hdd    2.72899          osd.14      up   1.00000  1.00000
>  15    hdd    2.72899          osd.15      up   1.00000  1.00000
>  16    hdd    2.72899          osd.16      up   1.00000  1.00000
>  17    hdd    2.72899          osd.17      up   1.00000  1.00000
>  18    hdd    2.72899          osd.18      up   1.00000  1.00000
>  19    hdd    2.72899          osd.19      up   1.00000  1.00000
>  20    hdd    2.72899          osd.20      up   1.00000  1.00000
>  21    hdd    2.72899          osd.21      up   1.00000  1.00000
>  22    hdd    2.72899          osd.22      up   1.00000  1.00000
>  23    hdd    2.72899          osd.23      up   1.00000  1.00000
>  -7          32.74786      host dcs3
>  24    hdd    2.72899          osd.24      up   1.00000  1.00000
>  25    hdd    2.72899          osd.25      up   1.00000  1.00000
>  26    hdd    2.72899          osd.26      up   1.00000  1.00000
>  27    hdd    2.72899          osd.27      up   1.00000  1.00000
>  28    hdd    2.72899          osd.28      up   1.00000  1.00000
>  29    hdd    2.72899          osd.29      up   1.00000  1.00000
>  30    hdd    2.72899          osd.30      up   1.00000  1.00000
>  31    hdd    2.72899          osd.31      up   1.00000  1.00000
>  32    hdd    2.72899          osd.32      up   1.00000  1.00000
>  33    hdd    2.72899          osd.33      up   1.00000  1.00000
>  34    hdd    2.72899          osd.34      up   1.00000  1.00000
>  35    hdd    2.72899          osd.35      up   1.00000  1.00000
>  -9          32.74786      host dcs4
>  36    hdd    2.72899          osd.36      up   1.00000  1.00000
>  37    hdd    2.72899          osd.37      up   1.00000  1.00000
>  38    hdd    2.72899          osd.38      up   1.00000  1.00000
>  39    hdd    2.72899          osd.39      up   1.00000  1.00000
>  40    hdd    2.72899          osd.40      up   1.00000  1.00000
>  41    hdd    2.72899          osd.41      up   1.00000  1.00000
>  42    hdd    2.72899          osd.42      up   1.00000  1.00000
>  43    hdd    2.72899          osd.43      up   1.00000  1.00000
>  44    hdd    2.72899          osd.44      up   1.00000  1.00000
>  45    hdd    2.72899          osd.45      up   1.00000  1.00000
>  46    hdd    2.72899          osd.46      up   1.00000  1.00000
>  47    hdd    2.72899          osd.47      up   1.00000  1.00000
> -11          32.74786      host dcs5
>  48    hdd    2.72899          osd.48      up   1.00000  1.00000
>  49    hdd    2.72899          osd.49      up   1.00000  1.00000
>  50    hdd    2.72899          osd.50      up   1.00000  1.00000
>  51    hdd    2.72899          osd.51      up   1.00000  1.00000
>  52    hdd    2.72899          osd.52      up   1.00000  1.00000
>  53    hdd    2.72899          osd.53      up   1.00000  1.00000
>  54    hdd    2.72899          osd.54      up   1.00000  1.00000
>  55    hdd    2.72899          osd.55      up   1.00000  1.00000
>  56    hdd    2.72899          osd.56      up   1.00000  1.00000
>  57    hdd    2.72899          osd.57      up   1.00000  1.00000
>  58    hdd    2.72899          osd.58      up   1.00000  1.00000
>  59    hdd    2.72899          osd.59      up   1.00000  1.00000
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx