Re: RADOS Bench strange behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hrm!  Were you using 4MB of data with rados put?  
Also, I don't know how much extra latency running "rados put" would add from start to finish. Is it slower than RADOS bench when you loop it?  It may not show much concurrency if the writes on the OSDs are finishing quickly and waiting on the next operation to start.

The file was 1GB  large. It was just for the test, I just tried with 4M and it took around 60 sec for the loop to get 1000 object of 4M done against 35 sec for the rados bench. Thus rados bench looks almost 2 times faster.
I assume the rados put gives the ack as soon as the object is stored on the backend fs.

Another thing to look at with RADOS bench is the average latency being reported at each reporting interval.  Is it better than you'd expect for a 4MB write to a single spinning disk?  That might tell you how much of an effect cache is having.

The average latency was 0.0378687. Disks are SEAGATE, ST91000640SS , 931.012GB. Latency announced by Seagate is 4.16 ms, thus it's better than expected.

––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood."









Phone : +33 (0)1 49 70 99 72 – Mobile : +33 (0)6 52 84 44 70
Email : sebastien.han@xxxxxxxxxxxx – Skype : han.sbastien
Address : 10, rue de la Victoire – 75009 Paris
Web : www.enovance.com – Twitter : @enovance

On Jul 9, 2013, at 2:19 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

On 07/09/2013 06:47 AM, Sebastien Han wrote:
Hi Mark,

Yes write back caching is enable since we have a BBU. See the current
cache policy of the controller: WriteBack, ReadAheadNone and Direct.

FYI, both journal and filestore are stored on the same disks, thus sd*1
is the journal and sd*2 is the filestore.

In order to give you a little bit more about the behaviour here's what I
see when I do a "rados put":

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.30    0.00  100.40     0.00    25.63
522.85     0.07    0.69    0.00    0.69   0.08   0.80
sdb1              0.00     0.00    0.00   48.10     0.00    12.83
546.08     0.03    0.67    0.00    0.67   0.09   0.44
sdb2              0.00     0.30    0.00   49.00     0.00    12.81
535.27     0.04    0.75    0.00    0.75   0.07   0.36

And here's what I see with a "rados bench" with a concurrency of 1 on a
pool with only one copy.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     8.20    0.00   71.20     0.00    16.12
463.78     0.05    0.77    0.00    0.77   0.08   0.60
sdc1              0.00     0.00    0.00   30.20     0.00     8.02
543.63     0.02    0.81    0.00    0.81   0.11   0.32
sdc2              0.00     8.20    0.00   38.80     0.00     8.11
427.93     0.03    0.78    0.00    0.78   0.07   0.28
sde               0.00     1.20    0.00   57.70     0.00    14.42
511.94     0.06    0.96    0.00    0.96   0.14   0.80
sde1              0.00     0.00    0.00   27.20     0.00     7.21
543.24     0.02    0.81    0.00    0.81   0.10   0.28
sde2              0.00     1.20    0.00   28.50     0.00     7.21
518.01     0.03    1.16    0.00    1.16   0.18   0.52
sdd               0.00     1.50    0.00   78.40     0.00    19.24
502.50     0.08    1.08    0.00    1.08   0.12   0.92
sdd1              0.00     0.00    0.00   36.30     0.00     9.62
542.74     0.03    0.88    0.00    0.88   0.09   0.32
sdd2              0.00     1.50    0.00   39.40     0.00     9.62
499.86     0.05    1.33    0.00    1.33   0.15   0.60
sdb               0.00     2.10    0.00   80.70     0.00    20.04
508.49     0.06    0.69    0.00    0.69   0.07   0.60
sdb1              0.00     0.00    0.00   37.80     0.00    10.02
542.92     0.02    0.62    0.00    0.62   0.07   0.28
sdb2              0.00     2.10    0.00   40.10     0.00    10.02
511.54     0.03    0.80    0.00    0.80   0.08   0.32

This definitely looks way to much to me…

I also tried to reproduce the rados bench behaviour by looping "radios
put" command then I got something like:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00  589.00     0.00   152.15
529.03     0.55    0.93    0.00    0.93   0.12   6.80
sde1              0.00     0.00    0.00  285.00     0.00    76.15
547.20     0.22    0.77    0.00    0.77   0.13   3.60
sde2              0.00     0.00    0.00  285.00     0.00    76.00
546.13     0.32    1.14    0.00    1.14   0.11   3.20
sdd               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00  250.00     0.00    64.07
524.83     0.30    1.20    0.00    1.20   0.11   2.80
sdd1              0.00     0.00    0.00  121.00     0.00    32.07
542.74     0.16    1.32    0.00    1.32   0.13   1.60
sdd2              0.00     0.00    0.00  120.00     0.00    32.00
546.13     0.14    1.17    0.00    1.17   0.10   1.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

As you can see, sde and sdd were hit but not at the same time.

Any more ideas?

Hrm!  Were you using 4MB of data with rados put?  Also, I don't know how much extra latency running "rados put" would add from start to finish. Is it slower than RADOS bench when you loop it?  It may not show much concurrency if the writes on the OSDs are finishing quickly and waiting on the next operation to start.

Another thing to look at with RADOS bench is the average latency being reported at each reporting interval.  Is it better than you'd expect for a 4MB write to a single spinning disk?  That might tell you how much of an effect cache is having.


––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood."









*Phone : *+33 (0)1 49 70 99 72 – *Mobile : *+33 (0)6 52 84 44 70
*Email :* sebastien.han@xxxxxxxxxxxx <mailto:sebastien.han@xxxxxxxxxxxx>
– *Skype : *han.sbastien
*Address :* 10, rue de la Victoire – 75009 Paris
*Web : *www.enovance.com <http://www.enovance.com/> – *Twitter : *@enovance

On Jul 9, 2013, at 1:11 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>> wrote:

On 07/09/2013 03:20 AM, Sebastien Han wrote:
Hi all,

While running some benchmarks with the internal rados benchmarker I
noticed something really strange. First of all, this is the line I used
to run it:

$ sudo rados -p 07:59:54_performance bench 300 write -b 4194304 -t 1
--no-cleanup

So I want to test an IO with a concurrency of 1. I had a look at the
code and also strace the process and I noticed that the IOs are send one
by one sequentially. Thus it does what I expect from it.
However while monitoring the disks usage on all my OSDs, I found out
that they were all loaded (writing, both journals and filestore) which
is kind of weird since all the IOs are send one by one. I was expecting
that only one OSDs at a time will be writing.

Obviously there is no replication going on since I changed the rep size
to 1.

$ ceph osd dump |grep "07:59:54_performance"
pool 323 '07:59:54_performance' rep size 1 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 1306 owner 0

Thanks in advance guy.


Hi Sebastien,

Do your SAS controllers have writeback cache perchance?  If so, you
may be seeing the OSDs immediately acknowledge journal writes before
the data actually makes it to disk allowing the client to send the
next operation before the first actually is truly committed.  Even
without WB cache, you'll probably see a lot of concurrent journal
writes and OSD writes since the OSD writes are just happening in the
background via buffered IO.  There's also drive cache helping out here
unless you've explicitly disabled it.  I suspect some combination of
above is what you are seeing.

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux