Re: Poor write/random read/random write performance

Da Chun Ng <dachun.ng@xxxxxxxxxxx> · Mon, 19 Aug 2013 17:05:51 +0000

Thank you! Testing now.
How about pg num? I'm using the default size 64, as I tried with (100 * osd_num)/replica_size, but it decreased the performance surprisingly.

> Date: Mon, 19 Aug 2013 11:33:30 -0500
> From: mark.nelson@xxxxxxxxxxx
> To: dachun.ng@xxxxxxxxxxx
> CC: ceph-users@xxxxxxxxxxxxxx
> Subject: Re: [ceph-users] Poor write/random read/random write performance
> 
> On 08/19/2013 08:59 AM, Da Chun Ng wrote:
> > Thanks very much! Mark.
> > Yes, I put the data and journal on the same disk, no SSD in my environment.
> > My controllers are general SATA II.
> 
> Ok, so in this case the lack of WB cache on the controller and no SSDs
> for journals is probably having an effect.
> 
> > 
> > Some more questions below in blue.
> > 
> > ------------------------------------------------------------------------
> > Date: Mon, 19 Aug 2013 07:48:23 -0500
> > From: mark.nelson@xxxxxxxxxxx
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re: [ceph-users] Poor write/random read/random write performance
> > 
> > On 08/19/2013 06:28 AM, Da Chun Ng wrote:
> > 
> >     I have a 3 nodes, 15 osds ceph cluster setup:
> >     * 15 7200 RPM SATA disks, 5 for each node.
> >     * 10G network
> >     * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
> >     * 64G Ram for each node.
> > 
> >     I deployed the cluster with ceph-deploy, and created a new data pool
> >     for cephfs.
> >     Both the data and metadata pools are set with replica size 3.
> >     Then mounted the cephfs on one of the three nodes, and tested the
> >     performance with fio.
> > 
> >     The sequential read  performance looks good:
> >     fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
> >     -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
> >     read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
> > 
> > 
> > Sounds like readahead and or caching is helping out a lot here. Btw, you 
> > might want to make sure this is actually coming from the disks with 
> > iostat or collectl or something.
> > 
> > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes 
> > before every test. I used collectl to watch every disk IO, the numbers 
> > should match. I think readahead is helping here.
> 
> Ok, good!  I suspect that readahead is indeed helping.
> 
> > 
> > 
> >     But the sequential write/random read/random write performance is
> >     very poor:
> >     fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
> >     -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> >     write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
> > 
> > 
> > One thing to keep in mind is that unless you have SSDs in this system, 
> > you will be doing 2 writes for every client write to the spinning disks 
> > (since data and journals will both be on the same disk).
> > 
> > So let's do the math:
> > 
> > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
> > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
> > 
> > If there is no write coalescing going on, this isn't terrible.  If there 
> > is, this is terrible.
> > 
> > How can I know if there is write coalescing going on?
> 
> look in collectl at the average IO sizes going to the disks.  I bet they
> will be 16KB.  If you were to look further with blktrace and
> seekwatcher, I bet you'd see lots of seeking between OSD data writes and
> journal writes since there is no controller cache helping smooth things
> out (and your journals are on the same drives).
> 
> > 
> > Have you tried buffered writes with the sync engine at the same IO size?
> > 
> > Do you mean as below?
> > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
> > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> 
> Yeah, that'd work.
> 
> > 
> >     fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
> >     -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> >     read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
> > 
> > 
> > In this case:
> > 
> > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
> > 
> > Definitely not great!  You might want to try fiddling with read ahead 
> > both on the CephFS client and on the block devices under the OSDs 
> > themselves.
> > 
> > Could you please tell me how to enable read ahead on the CephFS client?
> 
> It's one of the mount options:
> 
> http://ceph.com/docs/master/man/8/mount.ceph/
> 
> > 
> > For the block devices under the OSDs, the read ahead value is:
> > [root@ceph0 ~]# blockdev --getra /dev/sdi
> > 256
> > How big is appropriate for it?
> 
> To be honest I've seen different results depending on the hardware.  I'd
> try anywhere from 32kb to 2048kb.
> 
> > 
> > One thing I did notice back during bobtail is that increasing the number 
> > of osd op threads seemed to help small object read performance.  It 
> > might be worth looking at too.
> > 
> > http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
> > 
> > Other than that, if you really want to dig into this, you can use tools 
> > like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
> > for what the IO going to the OSDs looks like.  That can help when 
> > diagnosing this sort of thing.
> > 
> >     fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
> >     -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> >     write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
> > 
> > 
> > 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
> > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive
> > 
> > 
> >     I am mostly surprised by the seq write performance comparing to the
> >     raw sata disk performance(It can get 4127 IOPS when mounted with
> >     ext4). My cephfs only gets 1/10 performance of the raw disk.
> > 
> > 
> > 7200 RPM spinning disks typically top out at something like 150 IOPS 
> > (and some are lower).  With 15 disks, to hit 4127 IOPS you were probably 
> > seeing some write coalescing effects (or if these were random reads, 
> > some benefit to read ahead).
> > 
> > 
> >     How can I tune my cluster to improve the sequential write/random
> >     read/random write performance?
> > 
> > I don't know what kind of controller you have, but in cases where 
> > journals are on the same disks as the data, using writeback cache helps 
> > a lot because the controller can coalesce the direct IO journal writes 
> > in cache and just do big periodic dumps to the drives.  That really 
> > reduces seek overhead for the writes.  Using SSDs for the journals 
> > accomplishes much of the same effect, and lets you get faster large IO 
> > writes too, but in many chassis there is a density (and cost) trade-off.
> > 
> > Hope this helps!
> > 
> > Mark
> > 
> > 
> > 
> > 
> > 
> > 
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@xxxxxxxxxxxxxx  <mailto:ceph-users@xxxxxxxxxxxxxx>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> > 
> > _______________________________________________ ceph-users mailing list 
> > ceph-users@xxxxxxxxxxxxxx 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com