Re: Poor write/random read/random write performance

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 19 Aug 2013 11:33:30 -0500

On 08/19/2013 08:59 AM, Da Chun Ng wrote:
> Thanks very much! Mark.
> Yes, I put the data and journal on the same disk, no SSD in my environment.
> My controllers are general SATA II.

Ok, so in this case the lack of WB cache on the controller and no SSDs
for journals is probably having an effect.

> 
> Some more questions below in blue.
> 
> ------------------------------------------------------------------------
> Date: Mon, 19 Aug 2013 07:48:23 -0500
> From: mark.nelson@xxxxxxxxxxx
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Poor write/random read/random write performance
> 
> On 08/19/2013 06:28 AM, Da Chun Ng wrote:
> 
>     I have a 3 nodes, 15 osds ceph cluster setup:
>     * 15 7200 RPM SATA disks, 5 for each node.
>     * 10G network
>     * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
>     * 64G Ram for each node.
> 
>     I deployed the cluster with ceph-deploy, and created a new data pool
>     for cephfs.
>     Both the data and metadata pools are set with replica size 3.
>     Then mounted the cephfs on one of the three nodes, and tested the
>     performance with fio.
> 
>     The sequential read  performance looks good:
>     fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
>     -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
>     read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
> 
> 
> Sounds like readahead and or caching is helping out a lot here. Btw, you 
> might want to make sure this is actually coming from the disks with 
> iostat or collectl or something.
> 
> I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes 
> before every test. I used collectl to watch every disk IO, the numbers 
> should match. I think readahead is helping here.

Ok, good!  I suspect that readahead is indeed helping.

> 
> 
>     But the sequential write/random read/random write performance is
>     very poor:
>     fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
>     -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>     write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
> 
> 
> One thing to keep in mind is that unless you have SSDs in this system, 
> you will be doing 2 writes for every client write to the spinning disks 
> (since data and journals will both be on the same disk).
> 
> So let's do the math:
> 
> 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
> (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
> 
> If there is no write coalescing going on, this isn't terrible.  If there 
> is, this is terrible.
> 
> How can I know if there is write coalescing going on?

look in collectl at the average IO sizes going to the disks.  I bet they
will be 16KB.  If you were to look further with blktrace and
seekwatcher, I bet you'd see lots of seeking between OSD data writes and
journal writes since there is no controller cache helping smooth things
out (and your journals are on the same drives).

> 
> Have you tried buffered writes with the sync engine at the same IO size?
> 
> Do you mean as below?
> fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
> -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

Yeah, that'd work.

> 
>     fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
>     -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>     read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
> 
> 
> In this case:
> 
> 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
> 
> Definitely not great!  You might want to try fiddling with read ahead 
> both on the CephFS client and on the block devices under the OSDs 
> themselves.
> 
> Could you please tell me how to enable read ahead on the CephFS client?

It's one of the mount options:

http://ceph.com/docs/master/man/8/mount.ceph/

> 
> For the block devices under the OSDs, the read ahead value is:
> [root@ceph0 ~]# blockdev --getra /dev/sdi
> 256
> How big is appropriate for it?

To be honest I've seen different results depending on the hardware.  I'd
try anywhere from 32kb to 2048kb.

> 
> One thing I did notice back during bobtail is that increasing the number 
> of osd op threads seemed to help small object read performance.  It 
> might be worth looking at too.
> 
> http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
> 
> Other than that, if you really want to dig into this, you can use tools 
> like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
> for what the IO going to the OSDs looks like.  That can help when 
> diagnosing this sort of thing.
> 
>     fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
>     -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>     write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
> 
> 
> 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
> (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive
> 
> 
>     I am mostly surprised by the seq write performance comparing to the
>     raw sata disk performance(It can get 4127 IOPS when mounted with
>     ext4). My cephfs only gets 1/10 performance of the raw disk.
> 
> 
> 7200 RPM spinning disks typically top out at something like 150 IOPS 
> (and some are lower).  With 15 disks, to hit 4127 IOPS you were probably 
> seeing some write coalescing effects (or if these were random reads, 
> some benefit to read ahead).
> 
> 
>     How can I tune my cluster to improve the sequential write/random
>     read/random write performance?
> 
> I don't know what kind of controller you have, but in cases where 
> journals are on the same disks as the data, using writeback cache helps 
> a lot because the controller can coalesce the direct IO journal writes 
> in cache and just do big periodic dumps to the drives.  That really 
> reduces seek overhead for the writes.  Using SSDs for the journals 
> accomplishes much of the same effect, and lets you get faster large IO 
> writes too, but in many chassis there is a density (and cost) trade-off.
> 
> Hope this helps!
> 
> Mark
> 
> 
> 
> 
> 
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx  <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> _______________________________________________ ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com