On 07/02/15 13:49, German Anders wrote:
output
from iostat:
CEPHOSD01:
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc(ceph-0) 0.00 0.00 1.00 389.00 0.00
35.98 188.96 60.32 120.12 16.00 120.39 1.26 49.20
sdd(ceph-1) 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf(ceph-2) 0.00 1.00 6.00 521.00 0.02
60.72 236.05 143.10 309.75 484.00 307.74 1.90 100.00
sdg(ceph-3) 0.00 0.00 11.00 535.00 0.04
42.41 159.22 139.25 279.72 394.18 277.37 1.83 100.00
sdi(ceph-4) 0.00 1.00 4.00 560.00 0.02
54.87 199.32 125.96 187.07 562.00 184.39 1.65 93.20
sdj(ceph-5) 0.00 0.00 0.00 566.00 0.00
61.41 222.19 109.13 169.62 0.00 169.62 1.53 86.40
sdl(ceph-6) 0.00 0.00 8.00 0.00 0.09
0.00 23.00 0.12 12.00 12.00 0.00 2.50 2.00
sdm(ceph-7) 0.00 0.00 2.00 481.00 0.01
44.59 189.12 116.64 241.41 268.00 241.30 2.05 99.20
sdn(ceph-8) 0.00 0.00 1.00 0.00 0.00
0.00 8.00 0.01 8.00 8.00 0.00 8.00 0.80
fioa 0.00 0.00 0.00 1016.00 0.00
19.09 38.47 0.00 0.06 0.00 0.06 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc(ceph-0) 0.00 1.00 10.00 278.00 0.04
26.07 185.69 60.82 257.97 309.60 256.12 2.83 81.60
sdd(ceph-1) 0.00 0.00 2.00 0.00 0.02
0.00 20.00 0.02 10.00 10.00 0.00 10.00 2.00
sdf(ceph-2) 0.00 1.00 6.00 579.00 0.02
54.16 189.68 142.78 246.55 328.67 245.70 1.71 100.00
sdg(ceph-3) 0.00 0.00 10.00 75.00 0.05
5.32 129.41 4.94 185.08 11.20 208.27 4.05 34.40
sdi(ceph-4) 0.00 0.00 19.00 147.00 0.09
12.61 156.63 17.88 230.89 114.32 245.96 3.37 56.00
sdj(ceph-5) 0.00 1.00 2.00 629.00 0.01
43.66 141.72 143.00 223.35 426.00 222.71 1.58 100.00
sdl(ceph-6) 0.00 0.00 10.00 0.00 0.04
0.00 8.00 0.16 18.40 18.40 0.00 5.60 5.60
sdm(ceph-7) 0.00 0.00 11.00 4.00 0.05
0.01 8.00 0.48 35.20 25.82 61.00 14.13 21.20
sdn(ceph-8) 0.00 0.00 9.00 0.00 0.07
0.00 15.11 0.07 8.00 8.00 0.00 4.89 4.40
fioa 0.00 0.00 0.00 6415.00 0.00
125.81 40.16 0.00 0.14 0.00 0.14 0.00 0.00
CEPHOSD02:
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc1(ceph-9) 0.00 0.00 13.00 0.00 0.11
0.00 16.62 0.17 13.23 13.23 0.00 4.92 6.40
sdd1(ceph-10) 0.00 0.00 15.00 0.00 0.13
0.00 18.13 0.26 17.33 17.33 0.00 1.87 2.80
sdf1(ceph-11) 0.00 0.00 22.00 650.00 0.11
51.75 158.04 143.27 212.07 308.55 208.81 1.49 100.00
sdg1(ceph-12) 0.00 0.00 12.00 282.00 0.05
54.60 380.68 13.16 120.52 352.00 110.67 2.91 85.60
sdi1(ceph-13) 0.00 0.00 1.00 0.00 0.00
0.00 8.00 0.01 8.00 8.00 0.00 8.00 0.80
sdj1(ceph-14) 0.00 0.00 20.00 0.00 0.08
0.00 8.00 0.26 12.80 12.80 0.00 3.60 7.20
sdl1(ceph-15) 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm1(ceph-16) 0.00 0.00 20.00 424.00 0.11
32.20 149.05 89.69 235.30 243.00 234.93 2.14 95.20
sdn1(ceph-17) 0.00 0.00 5.00 411.00 0.02
45.47 223.94 98.32 182.28 1057.60 171.63 2.40 100.00
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc1(ceph-9) 0.00 0.00 26.00 383.00 0.11
34.32 172.44 86.92 258.64 297.08 256.03 2.29 93.60
sdd1(ceph-10) 0.00 0.00 8.00 31.00 0.09
1.86 101.95 0.84 178.15 94.00 199.87 6.46 25.20
sdf1(ceph-11) 0.00 1.00 5.00 409.00 0.05
48.34 239.34 90.94 219.43 383.20 217.43 2.34 96.80
sdg1(ceph-12) 0.00 0.00 0.00 238.00 0.00
1.64 14.12 58.34 143.60 0.00 143.60 1.83 43.60
sdi1(ceph-13) 0.00 0.00 11.00 0.00 0.05
0.00 10.18 0.16 14.18 14.18 0.00 5.09 5.60
sdj1(ceph-14) 0.00 0.00 1.00 0.00 0.00
0.00 8.00 0.02 16.00 16.00 0.00 16.00 1.60
sdl1(ceph-15) 0.00 0.00 1.00 0.00 0.03
0.00 64.00 0.01 12.00 12.00 0.00 12.00 1.20
sdm1(ceph-16) 0.00 1.00 4.00 587.00 0.03
50.09 173.69 143.32 244.97 296.00 244.62 1.69 100.00
sdn1(ceph-17) 0.00 0.00 0.00 375.00 0.00
23.68 129.34 69.76 182.51 0.00 182.51 2.47 92.80
If the iostat output is typical it seems you are limited by random
writes on a subset of your OSDs (you have 9 on each server but you
have between 4 and 6 used for writes and wMB/s vs w/s points to a
moderately random access pattern).
You should find out why. You may have a configuration problem or the
access to your rbds is focused on a few 4MB (if you used the
defaults) sections of the devices.
The
other OSD server had pretty much the same load.
The
config of the OSD's is the following:
-
2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
-
128G RAM
-
2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
-
2x 10GbE ADPT DP
-
Journals are configured to run on RAMDISK (TMPFS), but in the
first OSD serv we've the journals going on to a FusionIO
(/dev/fioa) ADPT with batt.
I suppose this is not yet production (TMPFS journals). You only have
128G RAM for 9 OSD, what is the size of your journals when you use
TMPFS and more importantly what is the value of filestore sync max
interval ?
I'm not sure how the OSD will react with a journal with a multi-GB/s
write bandwidth: the default filestore sync max interval might be
too high (it should prevent the journal from filling up). On the
other end a low max interval will prevent the OS from reordering
writes to hard drives to avoid too much random IO.
So there might be two causes I can see that might lead to
performance problems:
- the IO load might not be distributed to all your OSD limiting your
total bandwidth,
- you might have IO freezes when your TMPFS journals fill up if you
have very high bursts (probably unlikely but the consequences might
be dire).
Another problem I see is cost : Fusion IO speed and cost (and TMPFS
speed) are probably overkill for the journals in your case. With
your setup 2x Intel DC S3500 would probably be enough (unless you
need more write endurance).
With what you save not using a Fusion IO card in each server you
could probably have additional servers and get far better
performance overall.
If you do, use a 10GB journal size and a filestore max sync interval
allowing only half of it to be written to. With 2x 500MB/s write
bandwidth divided between 9 balanced OSD this would be 110MB/s so
you could use 30s with room to spare.
This assumes you can distribute IOs to all OSDs, you might have to
convert your rbds to a lower order or use stripping to achieve this
if you have atypical access patterns.
Lionel
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com