Re: Ceph performance improvement

David McBride <dwm37@xxxxxxxxx> · Wed, 22 Aug 2012 11:24:20 +0100

On 22/08/12 09:54, Denis Fondras wrote:

The only point that prevents my from using it at datacenter-scale is
performance.

Here are some figures :
* Test with "dd" on the OSD server (on drive
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

That looks like you're writing to a filesystem on that disk, rather than 
the block device itself -- but lets say you've got 139MB/sec 
(1112Mbit/sec) of straight-line performance.

Note: this is already faster than your network link can go -- you can, 
at best, only achieve 120MB/sec over your gigabit link.

* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

Is this a dd to the RBD device directly, or is this a write to a file in 
a filesystem created on top of it?

dd will write blocks synchronously -- that is, it will write one block, 
wait for the write to complete, then write the next block, and so on. 
Because of the durability guarantees provided by ceph, this will result 
in dd doing a lot of waiting around while writes are being sent over the 
network and written out on your OSD.

(If you're using the default replication count of 2, probably twice? 
I'm not exactly sure what Ceph does when it only has one OSD to work on..?)

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
client using RBD :
# time tar xzf src.tar.gz
real    0m26.955s
user    0m9.233s
sys     0m11.425s

Just ignoring networking and storage for a moment, this also isn't a 
fair test: you're comparing the decompress-and-unpack time of a 139MB 
tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that 
has 8GB.

Even ignoring the relative CPU difference, then unless you're doing 
something clever that you haven't described, there's no guarantee that 
the files in the latter case have actually been written to disk -- you 
have enough memory on your server for it to buffer all of those writes 
in RAM.  You'd need to add a sync() call or similar at the end of your 
timing run to ensure that all of those writes have actually been 
committed to disk.

* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.

(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)

Cheers,
David
--
David McBride <dwm37@xxxxxxxxx>
Unix Specialist, University Computing Service
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html