Re: RADOS Bench strange behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mark,

Yes write back caching is enable since we have a BBU. See the current cache policy of the controller: WriteBack, ReadAheadNone and Direct.

FYI, both journal and filestore are stored on the same disks, thus sd*1 is the journal and sd*2 is the filestore.

In order to give you a little bit more about the behaviour here's what I see when I do a "rados put":

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.30    0.00  100.40     0.00    25.63   522.85     0.07    0.69    0.00    0.69   0.08   0.80
sdb1              0.00     0.00    0.00   48.10     0.00    12.83   546.08     0.03    0.67    0.00    0.67   0.09   0.44
sdb2              0.00     0.30    0.00   49.00     0.00    12.81   535.27     0.04    0.75    0.00    0.75   0.07   0.36

And here's what I see with a "rados bench" with a concurrency of 1 on a pool with only one copy.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     8.20    0.00   71.20     0.00    16.12   463.78     0.05    0.77    0.00    0.77   0.08   0.60
sdc1              0.00     0.00    0.00   30.20     0.00     8.02   543.63     0.02    0.81    0.00    0.81   0.11   0.32
sdc2              0.00     8.20    0.00   38.80     0.00     8.11   427.93     0.03    0.78    0.00    0.78   0.07   0.28
sde               0.00     1.20    0.00   57.70     0.00    14.42   511.94     0.06    0.96    0.00    0.96   0.14   0.80
sde1              0.00     0.00    0.00   27.20     0.00     7.21   543.24     0.02    0.81    0.00    0.81   0.10   0.28
sde2              0.00     1.20    0.00   28.50     0.00     7.21   518.01     0.03    1.16    0.00    1.16   0.18   0.52
sdd               0.00     1.50    0.00   78.40     0.00    19.24   502.50     0.08    1.08    0.00    1.08   0.12   0.92
sdd1              0.00     0.00    0.00   36.30     0.00     9.62   542.74     0.03    0.88    0.00    0.88   0.09   0.32
sdd2              0.00     1.50    0.00   39.40     0.00     9.62   499.86     0.05    1.33    0.00    1.33   0.15   0.60
sdb               0.00     2.10    0.00   80.70     0.00    20.04   508.49     0.06    0.69    0.00    0.69   0.07   0.60
sdb1              0.00     0.00    0.00   37.80     0.00    10.02   542.92     0.02    0.62    0.00    0.62   0.07   0.28
sdb2              0.00     2.10    0.00   40.10     0.00    10.02   511.54     0.03    0.80    0.00    0.80   0.08   0.32

This definitely looks way to much to me…

I also tried to reproduce the rados bench behaviour by looping "radios put" command then I got something like:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00  589.00     0.00   152.15   529.03     0.55    0.93    0.00    0.93   0.12   6.80
sde1              0.00     0.00    0.00  285.00     0.00    76.15   547.20     0.22    0.77    0.00    0.77   0.13   3.60
sde2              0.00     0.00    0.00  285.00     0.00    76.00   546.13     0.32    1.14    0.00    1.14   0.11   3.20
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00  250.00     0.00    64.07   524.83     0.30    1.20    0.00    1.20   0.11   2.80
sdd1              0.00     0.00    0.00  121.00     0.00    32.07   542.74     0.16    1.32    0.00    1.32   0.13   1.60
sdd2              0.00     0.00    0.00  120.00     0.00    32.00   546.13     0.14    1.17    0.00    1.17   0.10   1.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

As you can see, sde and sdd were hit but not at the same time.

Any more ideas?

––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood."









Phone : +33 (0)1 49 70 99 72 – Mobile : +33 (0)6 52 84 44 70
Email : sebastien.han@xxxxxxxxxxxx – Skype : han.sbastien
Address : 10, rue de la Victoire – 75009 Paris
Web : www.enovance.com – Twitter : @enovance

On Jul 9, 2013, at 1:11 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

On 07/09/2013 03:20 AM, Sebastien Han wrote:
Hi all,

While running some benchmarks with the internal rados benchmarker I
noticed something really strange. First of all, this is the line I used
to run it:

$ sudo rados -p 07:59:54_performance bench 300 write -b 4194304 -t 1
--no-cleanup

So I want to test an IO with a concurrency of 1. I had a look at the
code and also strace the process and I noticed that the IOs are send one
by one sequentially. Thus it does what I expect from it.
However while monitoring the disks usage on all my OSDs, I found out
that they were all loaded (writing, both journals and filestore) which
is kind of weird since all the IOs are send one by one. I was expecting
that only one OSDs at a time will be writing.

Obviously there is no replication going on since I changed the rep size
to 1.

$ ceph osd dump |grep "07:59:54_performance"
pool 323 '07:59:54_performance' rep size 1 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 1306 owner 0

Thanks in advance guy.


Hi Sebastien,

Do your SAS controllers have writeback cache perchance?  If so, you may be seeing the OSDs immediately acknowledge journal writes before the data actually makes it to disk allowing the client to send the next operation before the first actually is truly committed.  Even without WB cache, you'll probably see a lot of concurrent journal writes and OSD writes since the OSD writes are just happening in the background via buffered IO.  There's also drive cache helping out here unless you've explicitly disabled it.  I suspect some combination of above is what you are seeing.

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux