Hi Joachim
I have the following starting points:
Has the read and write latency also changed?
I don't really understand, how I'd measure the latency. What we measure
regularly:
fio --filename=/dev/vdb --direct=1 --rw=randrw
--refill_buffers --norandommap --randrepeat=0
--ioengine=libaio --bs=4k --rwmixread=40
--iodepth=16 --numjobs=16 --runtime=60 --group_reporting
--name=4k-mixed-read-write
4k-mixed-read-write: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.7
Starting 16 processes
Jobs: 7 (f=7):
[_(3),m(1),_(1),m(1),_(1),m(2),_(2),m(3),_(2)][10.2%][r=92KiB/s,w=592KiB/s][r=23,w=148
IOPS][eta 08m:59s]
4k-mixed-read-write: (groupid=0, jobs=16): err= 0: pid=8208: Sat Jul 9
17:47:59 2022
read: IOPS=4528, BW=17.7MiB/s (18.5MB/s)(1085MiB/61366msec)
slat (usec): min=4, max=33583, avg=13.77, stdev=70.01
clat (usec): min=169, max=5236.2k, avg=6859.71, stdev=49067.43
lat (usec): min=412, max=5236.2k, avg=6876.42, stdev=49067.49
clat percentiles (usec):
| 1.00th=[ 955], 5.00th=[ 1270], 10.00th=[ 1549],
| 20.00th=[ 1958], 30.00th=[ 2343], 40.00th=[ 2704],
| 50.00th=[ 3064], 60.00th=[ 3523], 70.00th=[ 4015],
| 80.00th=[ 4686], 90.00th=[ 5735], 95.00th=[ 7504],
| 99.00th=[ 105382], 99.50th=[ 185598], 99.90th=[ 583009],
| 99.95th=[ 809501], 99.99th=[1988101]
bw ( KiB/s): min= 7, max= 2192, per=6.46%, avg=1169.56,
stdev=439.96, samples=1900
iops : min= 1, max= 548, avg=292.36, stdev=110.02,
samples=1900
write: IOPS=6781, BW=26.5MiB/s (27.8MB/s)(1626MiB/61366msec)
slat (usec): min=5, max=9466, avg=14.18, stdev=39.69
clat (usec): min=1273, max=5206.5k, avg=32766.99, stdev=108764.53
lat (usec): min=1281, max=5206.6k, avg=32784.08, stdev=108764.60
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 7],
| 30.00th=[ 9], 40.00th=[ 12], 50.00th=[ 16], 60.00th=[ 21],
| 70.00th=[ 27], 80.00th=[ 33], 90.00th=[ 53], 95.00th=[ 106],
| 99.00th=[ 259], 99.50th=[ 426], 99.90th=[ 1502], 99.95th=[ 2333],
| 99.99th=[ 4245]
bw ( KiB/s): min= 6, max= 3064, per=6.40%, avg=1735.15,
stdev=653.97, samples=1917
iops : min= 1, max= 766, avg=433.75, stdev=163.51,
samples=1917
lat (usec) : 250=0.01%, 500=0.01%, 750=0.04%, 1000=0.53%
lat (msec) : 2=8.10%, 4=23.79%, 10=26.66%, 20=15.32%, 50=18.67%
lat (msec) : 100=3.33%, 250=2.81%, 500=0.44%, 750=0.14%, 1000=0.04%
cpu : usr=0.84%, sys=1.67%, ctx=777034, majf=0, minf=580
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=277884,416138,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=17.7MiB/s (18.5MB/s), 17.7MiB/s-17.7MiB/s
(18.5MB/s-18.5MB/s), io=1085MiB (1138MB), run=61366-61366msec
WRITE: bw=26.5MiB/s (27.8MB/s), 26.5MiB/s-26.5MiB/s
(27.8MB/s-27.8MB/s), io=1626MiB (1705MB), run=61366-61366msec
Disk stats (read/write):
vdb: ios=277964/416121, merge=0/7, ticks=1883252/13426595,
in_queue=7721698, util=100.00
These values are about half as high as before the disk failure.
How does the utilisation of the hard disks look (iostat)?
An example output of one storage node:
avg-cpu: %user %nice %system %iowait %steal %idle
3.97 0.00 2.85 0.74 0.00 92.45
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.06 1.16 17.01 98.42 227.18
35.84 1.00 0.06 0.27 0.05 55.02 99.95
nvme2n1 0.00 0.31 5.14 83.86 380.94 2736.85
70.06 1.02 0.29 0.19 0.29 11.23 99.97
nvme0n1 0.27 17.17 240.90 1190.43 15757.58 36318.89
72.77 5.69 3.28 0.20 3.91 0.70 100.00
sda 0.19 0.97 0.22 29.11 11.29 2848.95
195.02 0.05 1.69 0.27 1.70 0.23 0.67
sdd 0.05 3.62 24.85 161.70 2217.69 5640.61
84.25 0.73 3.93 0.58 4.45 0.17 3.22
sdc 0.05 1.75 26.03 152.82 1959.02 5376.83
82.03 0.45 2.52 0.55 2.85 0.18 3.25
sde 0.06 4.52 28.41 154.77 2444.19 4660.30
77.57 0.59 3.22 0.53 3.72 0.15 2.68
sdf 0.08 5.16 28.99 192.36 3150.86 7024.10
91.94 1.39 6.27 0.74 7.11 0.18 3.93
sdg 0.05 3.39 26.78 181.90 2592.68 7402.03
95.79 1.30 6.21 0.67 7.03 0.18 3.86
sdh 0.11 5.33 39.63 231.45 3214.61 7086.68
76.00 1.41 5.22 0.63 6.00 0.16 4.40
md1 0.00 0.00 0.49 25.44 14.30 2846.64
220.69 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.12 0.01
51.55 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.01 0.95 0.08 29.14 3.33 2848.95
195.26 0.05 1.71 0.28 1.71 0.24 0.70
sdk 0.08 4.89 26.78 196.32 2292.12 6333.01
77.32 0.90 4.01 0.59 4.48 0.17 3.76
sdj 0.05 1.89 18.91 126.88 2278.79 4909.42
98.61 0.53 3.62 0.76 4.05 0.20 2.88
sdm 0.07 4.83 35.83 203.28 3197.73 7063.41
85.83 0.87 3.66 0.65 4.19 0.18 4.20
sdl 0.07 4.60 28.30 183.15 2414.73 6478.64
84.12 0.83 3.91 0.59 4.42 0.17 3.64
sdp 0.05 3.30 26.35 154.14 2485.64 6156.47
95.76 1.00 5.56 0.67 6.39 0.18 3.30
sdn 0.07 2.66 27.31 153.58 3237.96 6450.18
107.12 0.97 5.35 0.79 6.16 0.20 3.56
sdq 0.06 3.33 27.89 129.79 2768.62 4435.46
91.37 0.59 3.76 0.62 4.44 0.19 2.98
sdo 0.08 7.37 41.22 234.76 2468.76 7442.73
71.83 1.61 5.83 0.55 6.75 0.16 4.38
sdr 0.10 6.78 37.88 240.03 2680.16 7912.17
76.23 1.12 4.02 0.58 4.56 0.17 4.60
sds 0.07 3.84 31.48 206.88 2272.33 6364.27
72.47 0.74 3.08 0.55 3.47 0.16 3.92
sdt 0.07 2.96 28.18 158.38 2888.65 5934.89
94.59 1.09 5.83 0.71 6.74 0.18 3.41
sdu 0.04 2.30 24.03 123.84 2058.85 5431.40
101.31 0.49 3.29 0.56 3.81 0.16 2.39
dm-0 0.00 0.00 0.26 20.79 13.29 2845.47
271.73 0.07 3.55 0.27 3.59 0.88 1.85
dm-1 0.00 0.00 0.23 0.29 0.98 1.17
8.18 0.00 2.26 0.26 3.85 0.11 0.01
sdb 0.04 1.01 17.49 74.93 2301.97 16607.51
409.22 0.90 9.72 0.75 11.81 0.53 4.88
As the first three entries are NVMe SSDs, the value %util doesn't have a
real value.
Check the fragmentation and saturation of the file system.
Following is an example of a disk fragmentation:
[0] storage-node-001(cloud_ceph production)~ #xfs_db -r /dev/sde1
xfs_db> frag -d
actual 4994, ideal 776, fragmentation factor 84.46%
Note, this number is largely meaningless.
Files on this filesystem average 6.44 extents per file
xfs_db> frag -f
actual 533266, ideal 394548, fragmentation factor 26.01%
Note, this number is largely meaningless.
Files on this filesystem average 1.35 extents per file
xfs_db>
I'm not really sure, what you mean with the saturation of the file system.
[0] storage-node-001(cloud_ceph production)~ #ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
...
...
0 1.74100 1.00000 1782G 984G 798G 55.21 1.50 44
1 1.74100 1.00000 1782G 1148G 634G 64.41 1.75 57
2 1.74100 1.00000 1782G 930G 852G 52.19 1.42 44
4 1.74100 1.00000 1782G 1149G 633G 64.48 1.75 61
5 1.74100 1.00000 1782G 991G 791G 55.62 1.51 53
6 1.74100 1.00000 1782G 1084G 698G 60.84 1.66 52
8 1.74100 1.00000 1782G 1036G 746G 58.15 1.58 51
9 1.74100 1.00000 1782G 906G 876G 50.83 1.38 46
36 1.74100 1.00000 1782G 1086G 695G 60.96 1.66 53
39 1.74100 1.00000 1782G 1196G 586G 67.12 1.83 55
44 1.74100 1.00000 1782G 1137G 644G 63.82 1.74 54
47 1.74100 1.00000 1782G 1073G 709G 60.22 1.64 59
48 1.74100 1.00000 1782G 1144G 638G 64.21 1.75 59
51 1.74100 1.00000 1782G 1035G 746G 58.10 1.58 45
54 1.74100 1.00000 1782G 1131G 651G 63.46 1.73 55
17 1.74100 1.00000 1782G 770G 1012G 43.22 1.18 43
29 1.74100 1.00000 1782G 1086G 695G 60.96 1.66 53
158 1.74080 1.00000 1782G 844G 938G 47.35 1.29 44
...
...
What about the leveldb compaction?
We've compacted all three mons without any positive impact:
[0] ctrl-vm-020(cloud ceph (monitor))~ #time ceph compact
compacted leveldb in 3.067166
real 0m3.436s
user 0m0.275s
sys 0m0.045s
[0] storage-node-001(cloud_ceph production)~ #time ceph compact
compacted leveldb in 3.190430
real 0m3.480s
user 0m0.180s
sys 0m0.037s
[0] storage-node-002(cloud_ceph production)~ #time ceph compact
compacted leveldb in 2.284909
real 0m2.559s
user 0m0.170s
sys 0m0.041s
The fio benchmark as shown above has not changed.
Has the CPU load on the nodes changed?
No, it is more or less the same as before. An example:
top - 18:03:02 up 3 days, 10:24, 2 users, load average: 3.19, 2.74, 2.83
Tasks:693 total, 1 running,692 sleeping, 0 stopped, 0 zombie
top - 18:04:09 up 3 days, 10:25, 2 users, load average: 3.04, 2.79, 2.85
Tasks:697 total, 1 running,696 sleeping, 0 stopped, 0 zombie
%Cpu0 : 5.9 us, 3.6 sy, 0.0 ni,88.5 id, 0.3 wa, 0.0 hi, 1.6 si, 0.0 st
%Cpu1 : 5.6 us, 3.7 sy, 0.0 ni,89.4 id, 0.3 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu2 : 7.3 us, 4.0 sy, 0.0 ni,87.7 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu3 : 7.0 us, 3.6 sy, 0.0 ni,88.4 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu4 : 7.6 us, 3.6 sy, 0.0 ni,88.1 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 7.9 us, 3.3 sy, 0.0 ni,87.5 id, 1.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu6 : 7.3 us, 4.0 sy, 0.0 ni,86.8 id, 1.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu7 : 6.6 us, 4.0 sy, 0.0 ni,88.1 id, 1.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu8 : 7.4 us, 3.7 sy, 0.0 ni,88.0 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu9 : 6.6 us, 4.0 sy, 0.0 ni,87.7 id, 1.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu10 : 6.9 us, 3.3 sy, 0.0 ni,88.8 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu11 : 6.9 us, 4.2 sy, 0.0 ni,87.6 id, 1.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu12 : 3.6 us, 1.7 sy, 0.0 ni,94.1 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu13 : 3.3 us, 3.0 sy, 0.0 ni,93.4 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 3.9 us, 2.6 sy, 0.0 ni,93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 3.6 us, 2.3 sy, 0.0 ni,94.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu16 : 4.3 us, 2.3 sy, 0.0 ni,92.7 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu17 : 3.9 us, 2.6 sy, 0.0 ni,93.1 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 3.9 us, 2.3 sy, 0.0 ni,92.4 id, 1.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu19 : 3.0 us, 2.3 sy, 0.0 ni,91.4 id, 3.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu20 : 3.3 us, 2.0 sy, 0.0 ni,94.4 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu21 : 3.3 us, 2.6 sy, 0.0 ni,93.7 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu22 : 2.6 us, 2.3 sy, 0.0 ni,94.7 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu23 : 3.6 us, 2.3 sy, 0.0 ni,93.8 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu24 : 2.0 us, 0.7 sy, 0.0 ni,97.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu25 : 2.0 us, 1.3 sy, 0.0 ni,96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 : 2.3 us, 1.0 sy, 0.0 ni,96.1 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu27 : 2.3 us, 1.7 sy, 0.0 ni,95.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu28 : 1.7 us, 2.3 sy, 0.0 ni,95.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu29 : 2.3 us, 1.0 sy, 0.0 ni,96.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu30 : 2.0 us, 1.0 sy, 0.0 ni,96.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu31 : 2.6 us, 1.0 sy, 0.0 ni,95.8 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu32 : 2.6 us, 1.0 sy, 0.0 ni,96.1 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu33 : 3.0 us, 1.0 sy, 0.0 ni,95.4 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu34 : 2.3 us, 1.0 sy, 0.0 ni,96.4 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu35 : 2.0 us, 0.7 sy, 0.0 ni,97.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu36 : 0.3 us, 1.0 sy, 0.0 ni,98.4 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu37 : 1.0 us, 1.0 sy, 0.0 ni,97.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu38 : 1.3 us, 0.7 sy, 0.0 ni,98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu39 : 0.7 us, 0.7 sy, 0.0 ni,98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu40 : 0.3 us, 0.3 sy, 0.0 ni,99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu41 : 1.0 us, 1.0 sy, 0.0 ni,97.4 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu42 : 0.3 us, 0.3 sy, 0.0 ni,99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu43 : 1.0 us, 0.7 sy, 0.0 ni,98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu44 : 1.0 us, 1.0 sy, 0.0 ni,98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu45 : 0.7 us, 1.0 sy, 0.0 ni,98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu46 : 0.7 us, 0.7 sy, 0.0 ni,98.0 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu47 : 0.7 us, 0.3 sy, 0.0 ni,99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem :13174224+total, 3221368 free,30296824 used,98224048 buff/cache
KiB Swap: 8388604 total, 8312828 free, 75776 used.88300832 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
11097 ceph 20 0 6995508 1.4g 5956 S 63.3 1.1 3911:53 ceph-osd
10380 ceph 20 0 4538152 591284 13052 S 24.0 0.4 464:08.13 ceph-osd
17588 ceph 20 0 4934448 690908 11932 S 24.0 0.5 802:59.36 ceph-osd
11770 ceph 20 0 4384968 606812 6504 S 16.2 0.5 652:43.91 ceph-osd
18986 ceph 20 0 4434068 672880 5868 S 13.0 0.5 636:27.15 ceph-osd
8731 ceph 20 0 4781112 697388 5876 S 12.7 0.5 820:54.88
ceph-osd
...
...
During rebalancing (recorvery), we experienced a recovery io between a
fee 100 MB/s and 3500 MB/s. The following output was an average example:
recovery io 1702 MB/s, 426 objects/s
client io 551 MB/s rd, 131 MB/s wr, 4433 op/s rd, 3510 op/s wr
With best regards, Michael
On 09/07/2022 16.08, Joachim Kraftmayer wrote:
Hi Michael,
I have the following starting points:
Has the read and write latency also changed?
How does the utilisation of the hard disks look (iostat)?
Check the fragmentation and saturation of the file system.
What about the leveldb compaction?
Has the CPU load on the nodes changed?
Regards, Joachim
Am 09.07.2022 15:18 schrieb Michael Eichenberger
<michael.eichenberger@xxxxxxxxxxxxxxxxx>:
Hi all,
We currently run out of ideas what causes a 50% performance
decrease on
disk io, after removing a OSD out of our cluster.
Performance measurement is regularly done by a canary virtual machine
which does an hourly disk IO measurement. Performance degradation is
reported by customers on other virtual machines also.
Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd
out X')
the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
usual, peaks are approximately factor 10 lower than 'recovery io'
which was not the case before.
Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
(but we found a flapping OSD, which we also took out without
further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no
saturation
or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).
Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
to either type (no mixed pools).
- All OSD's use filestore
Since we currently run out of ideas what could cause these
performance
troubles, we appreciate any hint that increases the probability to
find
a solution !
Thanks in advance.
With best regards, Michael
--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern
Telefon: +41 31 332 53 63
www.stepping-stone.ch
michael.eichenberger@xxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern
Telefon: +41 31 332 53 63
www.stepping-stone.ch
michael.eichenberger@xxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx