Re: RADOS Bench strange behavior

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Tue, 09 Jul 2013 07:19:22 -0500

On 07/09/2013 06:47 AM, Sebastien Han wrote:
Hi Mark,

Yes write back caching is enable since we have a BBU. See the current
cache policy of the controller: WriteBack, ReadAheadNone and Direct.

FYI, both journal and filestore are stored on the same disks, thus sd*1
is the journal and sd*2 is the filestore.

In order to give you a little bit more about the behaviour here's what I
see when I do a "rados put":

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.30    0.00  100.40     0.00    25.63
522.85     0.07    0.69    0.00    0.69   0.08   0.80
sdb1              0.00     0.00    0.00   48.10     0.00    12.83
546.08     0.03    0.67    0.00    0.67   0.09   0.44
sdb2              0.00     0.30    0.00   49.00     0.00    12.81
535.27     0.04    0.75    0.00    0.75   0.07   0.36

And here's what I see with a "rados bench" with a concurrency of 1 on a
pool with only one copy.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     8.20    0.00   71.20     0.00    16.12
463.78     0.05    0.77    0.00    0.77   0.08   0.60
sdc1              0.00     0.00    0.00   30.20     0.00     8.02
543.63     0.02    0.81    0.00    0.81   0.11   0.32
sdc2              0.00     8.20    0.00   38.80     0.00     8.11
427.93     0.03    0.78    0.00    0.78   0.07   0.28
sde               0.00     1.20    0.00   57.70     0.00    14.42
511.94     0.06    0.96    0.00    0.96   0.14   0.80
sde1              0.00     0.00    0.00   27.20     0.00     7.21
543.24     0.02    0.81    0.00    0.81   0.10   0.28
sde2              0.00     1.20    0.00   28.50     0.00     7.21
518.01     0.03    1.16    0.00    1.16   0.18   0.52
sdd               0.00     1.50    0.00   78.40     0.00    19.24
502.50     0.08    1.08    0.00    1.08   0.12   0.92
sdd1              0.00     0.00    0.00   36.30     0.00     9.62
542.74     0.03    0.88    0.00    0.88   0.09   0.32
sdd2              0.00     1.50    0.00   39.40     0.00     9.62
499.86     0.05    1.33    0.00    1.33   0.15   0.60
sdb               0.00     2.10    0.00   80.70     0.00    20.04
508.49     0.06    0.69    0.00    0.69   0.07   0.60
sdb1              0.00     0.00    0.00   37.80     0.00    10.02
542.92     0.02    0.62    0.00    0.62   0.07   0.28
sdb2              0.00     2.10    0.00   40.10     0.00    10.02
511.54     0.03    0.80    0.00    0.80   0.08   0.32

This definitely looks way to much to me…

I also tried to reproduce the rados bench behaviour by looping "radios
put" command then I got something like:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00  589.00     0.00   152.15
529.03     0.55    0.93    0.00    0.93   0.12   6.80
sde1              0.00     0.00    0.00  285.00     0.00    76.15
547.20     0.22    0.77    0.00    0.77   0.13   3.60
sde2              0.00     0.00    0.00  285.00     0.00    76.00
546.13     0.32    1.14    0.00    1.14   0.11   3.20
sdd               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00  250.00     0.00    64.07
524.83     0.30    1.20    0.00    1.20   0.11   2.80
sdd1              0.00     0.00    0.00  121.00     0.00    32.07
542.74     0.16    1.32    0.00    1.32   0.13   1.60
sdd2              0.00     0.00    0.00  120.00     0.00    32.00
546.13     0.14    1.17    0.00    1.17   0.10   1.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

As you can see, sde and sdd were hit but not at the same time.

Any more ideas?

Hrm!  Were you using 4MB of data with rados put?  Also, I don't know how 
much extra latency running "rados put" would add from start to finish. 
Is it slower than RADOS bench when you loop it?  It may not show much 
concurrency if the writes on the OSDs are finishing quickly and waiting 
on the next operation to start.

Another thing to look at with RADOS bench is the average latency being 
reported at each reporting interval.  Is it better than you'd expect for 
a 4MB write to a single spinning disk?  That might tell you how much of 
an effect cache is having.

––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood."

*Phone : *+33 (0)1 49 70 99 72 – *Mobile : *+33 (0)6 52 84 44 70
*Email :* sebastien.han@xxxxxxxxxxxx <mailto:sebastien.han@xxxxxxxxxxxx>
– *Skype : *han.sbastien
*Address :* 10, rue de la Victoire – 75009 Paris
*Web : *www.enovance.com <http://www.enovance.com/> – *Twitter : *@enovance

On Jul 9, 2013, at 1:11 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>> wrote:

On 07/09/2013 03:20 AM, Sebastien Han wrote:
Hi all,

While running some benchmarks with the internal rados benchmarker I
noticed something really strange. First of all, this is the line I used
to run it:

$ sudo rados -p 07:59:54_performance bench 300 write -b 4194304 -t 1
--no-cleanup

So I want to test an IO with a concurrency of 1. I had a look at the
code and also strace the process and I noticed that the IOs are send one
by one sequentially. Thus it does what I expect from it.
However while monitoring the disks usage on all my OSDs, I found out
that they were all loaded (writing, both journals and filestore) which
is kind of weird since all the IOs are send one by one. I was expecting
that only one OSDs at a time will be writing.

Obviously there is no replication going on since I changed the rep size
to 1.

$ ceph osd dump |grep "07:59:54_performance"
pool 323 '07:59:54_performance' rep size 1 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 1306 owner 0

Thanks in advance guy.

Hi Sebastien,

Do your SAS controllers have writeback cache perchance?  If so, you
may be seeing the OSDs immediately acknowledge journal writes before
the data actually makes it to disk allowing the client to send the
next operation before the first actually is truly committed.  Even
without WB cache, you'll probably see a lot of concurrent journal
writes and OSD writes since the OSD writes are just happening in the
background via buffered IO.  There's also drive cache helping out here
unless you've explicitly disabled it.  I suspect some combination of
above is what you are seeing.

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com