Re: speedup ceph / scaling / find the bottleneck

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Fri, 29 Jun 2012 15:22:43 +0200

iostat output via iostat -x -t 5 while 4k random writes

06/29/2012 03:20:55 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31,63    0,00   52,64    0,78    0,00   14,95

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0,00   690,40    0,00 3143,60     0,00 33958,80 
10,80     2,68    0,85   0,08  24,08
sdc               0,00  1069,80    0,00 5151,60     0,00 54693,00 
10,62     8,31    1,61   0,06  29,68
sdd               0,00   581,00    0,00 2762,80     0,00 27809,00 
10,07     2,45    0,89   0,08  21,12
sde               0,00   820,00    0,00 4208,20     0,00 43457,40 
10,33     4,00    0,95   0,07  28,56
sda               0,00     0,00    0,00    0,40     0,00     9,60 
24,00     0,00    0,00   0,00   0,00

06/29/2012 03:21:00 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29,68    0,00   52,89    0,98    0,00   16,45

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0,00  1046,60    0,00 5544,20     0,00 57938,00 
10,45     6,08    1,10   0,06  32,08
sdc               0,00   115,60    0,00 3483,60     0,00 29368,00 
8,43     3,45    0,99   0,06  21,36
sdd               0,00  1143,20    0,00 5991,00     0,00 62607,40 
10,45     6,03    1,01   0,06  35,20
sde               0,00  1070,00    0,00 5561,60     0,00 58207,20 
10,47     5,76    1,04   0,07  38,08
sda               0,00     0,00    0,00    0,00     0,00     0,00 
0,00     0,00    0,00   0,00   0,00

06/29/2012 03:21:05 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29,69    0,00   53,06    0,60    0,00   16,65

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0,00   199,60    0,00 4484,40     0,00 41338,20 
9,22     1,96    0,44   0,07  30,56
sdc               0,00   766,60    0,00 3616,20     0,00 38829,00 
10,74     3,62    1,00   0,07  25,68
sdd               0,00   149,20    0,00 5066,60     0,00 45793,60 
9,04     4,48    0,89   0,06  28,48
sde               0,00   150,00    0,00 4328,80     0,00 36496,00 
8,43     2,96    0,68   0,07  32,40
sda               0,00     0,00    0,00    0,40     0,00    35,20 
88,00     0,00    0,00   0,00   0,00

06/29/2012 03:21:10 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29,11    0,00   46,58    0,50    0,00   23,81

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0,00   881,20    0,00 3077,20     0,00 33382,80 
10,85     3,44    1,12   0,06  18,16
sdc               0,00   867,60    0,00 5098,40     0,00 52056,20 
10,21     5,65    1,11   0,05  24,32
sdd               0,00   864,40    0,00 2759,00     0,00 30321,60 
10,99     3,39    1,23   0,06  17,36
sde               0,00   846,20    0,00 3193,40     0,00 36795,60 
11,52     3,48    1,09   0,06  19,92
sda               0,00     0,00    0,00    1,40     0,00    11,20 
8,00     0,01    4,57   2,29   0,32

Am 29.06.2012 15:16, schrieb Stefan Priebe - Profihost AG:
Big sorry. ceph was scrubbing during my last test. Didn't recognized this.

When i redo the test i see writes between 20MB/s and 100Mb/s. That is
OK. Sorry.

Stefan

Am 29.06.2012 15:11, schrieb Stefan Priebe - Profihost AG:
Another BIG hint.

While doing random 4k I/O from one VM i archieve 14k I/Os. This is
around 54MB/s. But EACH ceph-osd machine is writing between 500MB/s and
750MB/s. What do they write?!?!

Just an idea?:
Do they completely rewrite EACH 4MB block for each 4k write?

Stefan

Am 29.06.2012 15:02, schrieb Stefan Priebe - Profihost AG:
Am 29.06.2012 13:49, schrieb Mark Nelson:
I'll try to replicate your findings in house.  I've got some other
things I have to do today, but hopefully I can take a look next
week. If
I recall correctly, in the other thread you said that sequential writes
are using much less CPU time on your systems?

Random 4k writes: 10% idle
Seq 4k writes: !! 99,7% !! idle
Seq 4M writes: 90% idle

 >  Do you see better scaling in that case?

3 osd nodes:
1 VM:
Rand 4k writes: 7000 iops
Seq 4k writes: 19900 iops

2 VMs:
Rand 4k writes: 6000 iops each
Seq 4k writes: 4000 iops each VM 1
Seq 4k writes: 18500 iops each VM 2

4 osd nodes:
1 VM:
Rand 4k writes: 14400 iops
Seq 4k writes: 19000 iops

2 VMs:
Rand 4k writes: 7000 iops each
Seq 4k writes: 18000 iops each

To figure out where CPU is being used, you could try various options:
oprofile, perf, valgrind, strace.  Each has it's own advantages.

Here's how you can create a simple callgraph with perf:

http://lwn.net/Articles/340010/
10s perf data output while doing random 4k writes:
https://raw.github.com/gist/2c16136faebec381ae35/09e6de68a5461a198430a9ec19dfd5392f276706/gistfile1.txt

Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html