Re: To put journals to SSD or not?

"Fuchs, Andreas (SwissTXT)" <Andreas.Fuchs@xxxxxxxxxxx> · Mon, 2 Sep 2013 09:19:46 +0000

How do you test the random behavior of the disks, what's a good setup?
If I understand ceph writes in 4M blocks I also expect a 50%/50% rw ratio of our workloads, what else to I have to take into consideration.

Also what I not yet understand, in my performance test I get pretty nice rados bench results:
(osd nodes have 1Gb public and 1Gb sync interface, testnode has 10Gb nic to public)

rados bench -p test 30 write --no-cleanup
Bandwidth (MB/sec):     128.801 (here the 1Gb sync network is clearly the bottleneck)

rados bench -p test 30 seq
Bandwidth (MB/sec):    303.140 ( here it's the 1Gb public interface of the 3 nodes)

But if I test still sequential workloads to a rbd device with the same pool settings as the testpool above, results are as follows

sudo dd if=/dev/zero of=/mnt/testfile bs=4M count=100 oflag=direct
419430400 bytes (419 MB) copied, 5.97832 s, 70.2 MB/s

I cannot identify the bottleneck here, no network interface is at his limit, cpu's are < 10%, iostat shows all disk working with ok numbers. The only difference I see ceph -w shows much more ops than with the rados bench. Any idea how I could identify the bottleneck here? Or is it just the single dd thread?

Regards
Andi

-----Original Message-----
From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Martin Rudat
Sent: Montag, 2. September 2013 01:44
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  To put journals to SSD or not?

On 2013-09-02 05:19, Fuchs, Andreas (SwissTXT) wrote:
> Reading through the documentation and talking to several peaople leads to the conclusion that it's a best practice to place the journal of an OSD instance to a separate SSD disk to speed writing up.
>
> But is this true? i have 3 new dell servers for testing available with 12 x 4 TB SATA and 2 x 100GB SSD disks. I don't have the exact specs at hand but tests show:
>
> The SATA's sequential write speed is 300MB/s The SSD which is in RAID1 
> config is only 270MB/s ! was probably not the most expensive.
>
> When we put the journals on the OSD's i can expect a sequential wtite speed of 12 x 150MB/s (on write to journal, one to disk), this is 1800MB/s per Node.
The thing is that, unless you've got a magical workload, you're not going to be seeing sequential write speeds from your spinning disks, because, at a minimum, a write to the journal at the beginning of the disk, and a write to data at a different portion of the disk is going to perform the same as random i/o... because the disk is going to have to seek, on average half-way across the platter each time it commits a new transaction to disk... this gets worse when you also take into account random reads, which also cause more disk seeks.

Sequential read on the disks I've got is at about 180M/s (they're cheap slow disk), random read/write on the array seems to be peaking around 10M/s a disk.

I'd benchmark your random i/o performance, and use that to choose how much, and how fast, a set of SSDs you will need.

I've actually got a 4-disk external hot-swap sata cage on order, that connects over a usb3 or esata link... sequential read/write even with the slow disk I've got will saturate the link... but filled with spinning disk doing random i/o, there should be plenty of headroom available... it'll be interesting to see if it's a worthwhile investment, as opposed to having to open a computer up to change disks.

--
Martin Rudat

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com