With 12 OSDs and a default of 4 GB RAM per OSD you would at least
require 48 GB, usually a little more. Even if you reduced the memory
target per OSD it doesn’t mean they can deal with the workload. There
was a thread explaining that a couple of weeks ago.
Zitat von Murilo Morais <murilo@xxxxxxxxxxxxxx>:
Good morning everyone.
Guys, today my cluster had a "problem", it was showing SLOW_OPS, when
restarting the OSDs that were showing this problem everything was solved
(there were VMs stuck because of this), what I'm breaking my head is to
know the reason for having SLOW_OPS.
In the logs I saw that the problem started at 04:00 AM and continued until
07:50 AM (when I restarted the OSDs).
I'm suspicious of some exaggerated settings that I applied and forgot there
in the initial setup while performing a test, which may have caused a high
use of RAM leaving a maximum of 400 MB of 32 GB free memory, which in this
case was to put 512 PGs in two pools, one of which was affected.
In the logs I saw that the problem started when some VMs started to perform
backup actions, increasing the writing a little (to a maximum of 300 MBps),
after a few seconds a disk started to show this WARN and also this line:
Dec 14 04:01:01 dcs1.evocorp ceph-mon[639148]: 69 slow requests (by type [
'delayed' : 65 'waiting for sub ops' : 4 ] most affected pool [
'cephfs.ds_disk.data' : 69])
Then he presented these:
Dec 14 04:01:02 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 0 slow ops, oldest one blocked for 36 sec,
daemons [osd.20,osd.5 ] have slow ops. (SLOW_OPS)
[...]
Dec 14 05:52:01 dcs1.evocorp ceph-mon[639148]: log_channel(cluster) log
[WRN] : Health check update: 149 slow ops, oldest one blocked for 6696 sec,
daemons [osd.20,osd.5 ,osd.50] have slow ops. (SLOW_OPS)
I've already checked the SMART, they're all OK, I've checked the graphs
generated in Grafana and none of the disks saturate, there haven't been any
incidents related to the network, that is, I haven't identified any other
problem that could cause this.
What could have caused this event? What can I do to prevent it from
happening again?
Below is some information about the cluster:
5 machines with 32GB RAM, 2 processors and 12 3TB SAS disks and connected
through 40Gb interfaces.
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 163.73932 root default
-3 32.74786 host dcs1
0 hdd 2.72899 osd.0 up 1.00000 1.00000
1 hdd 2.72899 osd.1 up 1.00000 1.00000
2 hdd 2.72899 osd.2 up 1.00000 1.00000
3 hdd 2.72899 osd.3 up 1.00000 1.00000
4 hdd 2.72899 osd.4 up 1.00000 1.00000
5 hdd 2.72899 osd.5 up 1.00000 1.00000
6 hdd 2.72899 osd.6 up 1.00000 1.00000
7 hdd 2.72899 osd.7 up 1.00000 1.00000
8 hdd 2.72899 osd.8 up 1.00000 1.00000
9 hdd 2.72899 osd.9 up 1.00000 1.00000
10 hdd 2.72899 osd.10 up 1.00000 1.00000
11 hdd 2.72899 osd.11 up 1.00000 1.00000
-5 32.74786 host dcs2
12 hdd 2.72899 osd.12 up 1.00000 1.00000
13 hdd 2.72899 osd.13 up 1.00000 1.00000
14 hdd 2.72899 osd.14 up 1.00000 1.00000
15 hdd 2.72899 osd.15 up 1.00000 1.00000
16 hdd 2.72899 osd.16 up 1.00000 1.00000
17 hdd 2.72899 osd.17 up 1.00000 1.00000
18 hdd 2.72899 osd.18 up 1.00000 1.00000
19 hdd 2.72899 osd.19 up 1.00000 1.00000
20 hdd 2.72899 osd.20 up 1.00000 1.00000
21 hdd 2.72899 osd.21 up 1.00000 1.00000
22 hdd 2.72899 osd.22 up 1.00000 1.00000
23 hdd 2.72899 osd.23 up 1.00000 1.00000
-7 32.74786 host dcs3
24 hdd 2.72899 osd.24 up 1.00000 1.00000
25 hdd 2.72899 osd.25 up 1.00000 1.00000
26 hdd 2.72899 osd.26 up 1.00000 1.00000
27 hdd 2.72899 osd.27 up 1.00000 1.00000
28 hdd 2.72899 osd.28 up 1.00000 1.00000
29 hdd 2.72899 osd.29 up 1.00000 1.00000
30 hdd 2.72899 osd.30 up 1.00000 1.00000
31 hdd 2.72899 osd.31 up 1.00000 1.00000
32 hdd 2.72899 osd.32 up 1.00000 1.00000
33 hdd 2.72899 osd.33 up 1.00000 1.00000
34 hdd 2.72899 osd.34 up 1.00000 1.00000
35 hdd 2.72899 osd.35 up 1.00000 1.00000
-9 32.74786 host dcs4
36 hdd 2.72899 osd.36 up 1.00000 1.00000
37 hdd 2.72899 osd.37 up 1.00000 1.00000
38 hdd 2.72899 osd.38 up 1.00000 1.00000
39 hdd 2.72899 osd.39 up 1.00000 1.00000
40 hdd 2.72899 osd.40 up 1.00000 1.00000
41 hdd 2.72899 osd.41 up 1.00000 1.00000
42 hdd 2.72899 osd.42 up 1.00000 1.00000
43 hdd 2.72899 osd.43 up 1.00000 1.00000
44 hdd 2.72899 osd.44 up 1.00000 1.00000
45 hdd 2.72899 osd.45 up 1.00000 1.00000
46 hdd 2.72899 osd.46 up 1.00000 1.00000
47 hdd 2.72899 osd.47 up 1.00000 1.00000
-11 32.74786 host dcs5
48 hdd 2.72899 osd.48 up 1.00000 1.00000
49 hdd 2.72899 osd.49 up 1.00000 1.00000
50 hdd 2.72899 osd.50 up 1.00000 1.00000
51 hdd 2.72899 osd.51 up 1.00000 1.00000
52 hdd 2.72899 osd.52 up 1.00000 1.00000
53 hdd 2.72899 osd.53 up 1.00000 1.00000
54 hdd 2.72899 osd.54 up 1.00000 1.00000
55 hdd 2.72899 osd.55 up 1.00000 1.00000
56 hdd 2.72899 osd.56 up 1.00000 1.00000
57 hdd 2.72899 osd.57 up 1.00000 1.00000
58 hdd 2.72899 osd.58 up 1.00000 1.00000
59 hdd 2.72899 osd.59 up 1.00000 1.00000
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx