Re: Poor performance with three nodes

Eric Lee Green <eric.lee.green@xxxxxxxxx> · Wed, 02 Oct 2013 17:02:43 -0700

On 10/2/2013 3:50 PM, Sage Weil wrote:
On Wed, 2 Oct 2013, Eric Lee Green wrote:
By contrast, that same dd to an iSCSI volume exported by one of the servers
wrote at 240 megabytes per second. Order of magnitude difference.
Can you see what 'rados -p rbd bench 60 write' tells you?

Pretty much the same as what I got with the dd smoketest:

 Total time run:         62.526671
Total writes made:      770
Write size:             4194304
Bandwidth (MB/sec):     49.259

Stddev Bandwidth:       36.0099
Max bandwidth (MB/sec): 120
Min bandwidth (MB/sec): 0
Average Latency:        1.29088
Stddev Latency:         1.75083
Max latency:            11.2005
Min latency:            0.102783
[root@stack1 ~]#

I suspect the problem here is an unfortunate combination of what dd does
(1 outstanding write at a time) and what iSCSI is probably doing
(acknowledging the write before it is written to the disk--I'm guess a
write to /dev/* doesn't also send a scsi flush).  This lets you approach
the disk or network bandwidth even though the client/app (dd) is only
dispatching a single 512K IO at a time.

My experience is that while what the dd is doing is not reflective of 
what a filesystem does, with a block size that large it doesn't matter 
-- 512K bytes outstanding is sufficient that the  latency of issuing 
writes is no longer an issue. What I get from the dd is within a few 
percent of what I get copying a large file onto the filesystem or doing 
other similar tasks involving streaming data onto (or off of) the drive.

I'm curious if the iSCSI number changes if you add oflag=direct or
oflag=sync.

It's also worth pointing out that what dd is doing (single outstanding IO)
no sane file system would do, except perhaps during commit/sync time when
it is carefully ordering IOs.  You might want to try the dd to a file
inside a mounted fs insted of to the raw device.

Well, I created a hidden directory in one of the ceph data stores and 
copied a file in it to see what dd would do in that case, note that this 
is with 512-byte blocks from one xfs filesystem to another xfs filesystem:

[root@storage1 .t]# dd if=/export/home1/linux.tgz of=linux.tgz
9177080+0 records in
9177080+0 records out
4698664960 bytes (4.7 GB) copied, 15.5054 s, 303 MB/s
[root@storage1 .t]#

Seems similar to what I expect, reading from one SAS channel and writing 
to the other SAS channel (3Gigabit/sec SAS channels). I've benchmarked 
this combination before for streaming writes and I can maintain that 
bandwidth pretty much forever.

My conclusion at the moment is that a) ceph isn't a good match for my 
infrastructure, it really wants its own dedicated hardware with no RAID, 
and b) even there I should not expect much more for single-stream writes 
than what I'm seeing above, though aggregate write performance should 
scale. Unfortunately (b) isn't my workload, where aggregate bandwidth 
requirements are modest but burst bandwidth requirements are high.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com