Re: Poor write/random read/random write performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you! Testing now.

How about pg num? I'm using the default size 64, as I tried with (100 * osd_num)/replica_size, but it decreased the performance surprisingly.

> Date: Mon, 19 Aug 2013 11:33:30 -0500
> From: mark.nelson@xxxxxxxxxxx
> To: dachun.ng@xxxxxxxxxxx
> CC: ceph-users@xxxxxxxxxxxxxx
> Subject: Re: [ceph-users] Poor write/random read/random write performance
>
> On 08/19/2013 08:59 AM, Da Chun Ng wrote:
> > Thanks very much! Mark.
> > Yes, I put the data and journal on the same disk, no SSD in my environment.
> > My controllers are general SATA II.
>
> Ok, so in this case the lack of WB cache on the controller and no SSDs
> for journals is probably having an effect.
>
> >
> > Some more questions below in blue.
> >
> > ------------------------------------------------------------------------
> > Date: Mon, 19 Aug 2013 07:48:23 -0500
> > From: mark.nelson@xxxxxxxxxxx
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re: [ceph-users] Poor write/random read/random write performance
> >
> > On 08/19/2013 06:28 AM, Da Chun Ng wrote:
> >
> > I have a 3 nodes, 15 osds ceph cluster setup:
> > * 15 7200 RPM SATA disks, 5 for each node.
> > * 10G network
> > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
> > * 64G Ram for each node.
> >
> > I deployed the cluster with ceph-deploy, and created a new data pool
> > for cephfs.
> > Both the data and metadata pools are set with replica size 3.
> > Then mounted the cephfs on one of the three nodes, and tested the
> > performance with fio.
> >
> > The sequential read performance looks good:
> > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
> > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
> > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
> >
> >
> > Sounds like readahead and or caching is helping out a lot here. Btw, you
> > might want to make sure this is actually coming from the disks with
> > iostat or collectl or something.
> >
> > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes
> > before every test. I used collectl to watch every disk IO, the numbers
> > should match. I think readahead is helping here.
>
> Ok, good! I suspect that readahead is indeed helping.
>
> >
> >
> > But the sequential write/random read/random write performance is
> > very poor:
> > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
> > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
> >
> >
> > One thing to keep in mind is that unless you have SSDs in this system,
> > you will be doing 2 writes for every client write to the spinning disks
> > (since data and journals will both be on the same disk).
> >
> > So let's do the math:
> >
> > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
> > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
> >
> > If there is no write coalescing going on, this isn't terrible. If there
> > is, this is terrible.
> >
> > How can I know if there is write coalescing going on?
>
> look in collectl at the average IO sizes going to the disks. I bet they
> will be 16KB. If you were to look further with blktrace and
> seekwatcher, I bet you'd see lots of seeking between OSD data writes and
> journal writes since there is no controller cache helping smooth things
> out (and your journals are on the same drives).
>
> >
> > Have you tried buffered writes with the sync engine at the same IO size?
> >
> > Do you mean as below?
> > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K
> > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>
> Yeah, that'd work.
>
> >
> > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
> > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
> >
> >
> > In this case:
> >
> > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
> >
> > Definitely not great! You might want to try fiddling with read ahead
> > both on the CephFS client and on the block devices under the OSDs
> > themselves.
> >
> > Could you please tell me how to enable read ahead on the CephFS client?
>
> It's one of the mount options:
>
> http://ceph.com/docs/master/man/8/mount.ceph/
>
> >
> > For the block devices under the OSDs, the read ahead value is:
> > [root@ceph0 ~]# blockdev --getra /dev/sdi
> > 256
> > How big is appropriate for it?
>
> To be honest I've seen different results depending on the hardware. I'd
> try anywhere from 32kb to 2048kb.
>
> >
> > One thing I did notice back during bobtail is that increasing the number
> > of osd op threads seemed to help small object read performance. It
> > might be worth looking at too.
> >
> > http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
> >
> > Other than that, if you really want to dig into this, you can use tools
> > like iostat, collectl, blktrace, and seekwatcher to try and get a feel
> > for what the IO going to the OSDs looks like. That can help when
> > diagnosing this sort of thing.
> >
> > fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
> > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
> > write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
> >
> >
> > 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024
> > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive
> >
> >
> > I am mostly surprised by the seq write performance comparing to the
> > raw sata disk performance(It can get 4127 IOPS when mounted with
> > ext4). My cephfs only gets 1/10 performance of the raw disk.
> >
> >
> > 7200 RPM spinning disks typically top out at something like 150 IOPS
> > (and some are lower). With 15 disks, to hit 4127 IOPS you were probably
> > seeing some write coalescing effects (or if these were random reads,
> > some benefit to read ahead).
> >
> >
> > How can I tune my cluster to improve the sequential write/random
> > read/random write performance?
> >
> > I don't know what kind of controller you have, but in cases where
> > journals are on the same disks as the data, using writeback cache helps
> > a lot because the controller can coalesce the direct IO journal writes
> > in cache and just do big periodic dumps to the drives. That really
> > reduces seek overhead for the writes. Using SSDs for the journals
> > accomplishes much of the same effect, and lets you get faster large IO
> > writes too, but in many chassis there is a density (and cost) trade-off.
> >
> > Hope this helps!
> >
> > Mark
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________ ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux