Re: Poor write/random read/random write performance

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 21 Aug 2013 12:06:55 -0500

On 08/21/2013 07:58 AM, Da Chun Ng wrote:
> Mark,
> 
> I tried with journal aio = true, and op thread = 4, but made little 
> difference.
> Then I tried to enlarge read ahead value both on the osd block devices 
> and cephfs client. It did improve some overall performance, especially 
> the sequential read performance. But still has not much help to 
> the write/random read/random write performance.

One thing that might be worth trying on a test node, (Don't use this in
production!) is putting your journals in ram via /dev/shm.  That might
give you an idea of how much your journal writes are conflicting with
data writes.

If you want to try this, you'll need to set journal dio = false.

> 
> I tried to change the place group number to (100 * 
> osd_num)/replica_size, it does not decrease the overall performance this 
> time, but not improve neither.
> 
>  > Date: Mon, 19 Aug 2013 12:31:07 -0500
>  > From: mark.nelson@xxxxxxxxxxx
>  > To: dachun.ng@xxxxxxxxxxx
>  > CC: ceph-users@xxxxxxxxxxxxxx
>  > Subject: Re:  Poor write/random read/random write performance
>  >
>  > On 08/19/2013 12:05 PM, Da Chun Ng wrote:
>  > > Thank you! Testing now.
>  > >
>  > > How about pg num? I'm using the default size 64, as I tried with 
> (100 *
>  > > osd_num)/replica_size, but it decreased the performance surprisingly.
>  >
>  > Oh! That's odd! Typically you would want more than that. Most likely
>  > you aren't distributing PGs very evenly across OSDs with 64. More PGs
>  > shouldn't decrease performance unless the monitors are behaving badly.
>  > We saw some issues back in early cuttlefish but you should be fine with
>  > many more PGs.
>  >
>  > Mark
>  >
>  > >
>  > > > Date: Mon, 19 Aug 2013 11:33:30 -0500
>  > > > From: mark.nelson@xxxxxxxxxxx
>  > > > To: dachun.ng@xxxxxxxxxxx
>  > > > CC: ceph-users@xxxxxxxxxxxxxx
>  > > > Subject: Re:  Poor write/random read/random write 
> performance
>  > > >
>  > > > On 08/19/2013 08:59 AM, Da Chun Ng wrote:
>  > > > > Thanks very much! Mark.
>  > > > > Yes, I put the data and journal on the same disk, no SSD in my
>  > > environment.
>  > > > > My controllers are general SATA II.
>  > > >
>  > > > Ok, so in this case the lack of WB cache on the controller and no 
> SSDs
>  > > > for journals is probably having an effect.
>  > > >
>  > > > >
>  > > > > Some more questions below in blue.
>  > > > >
>  > > > >
>  > > 
> ------------------------------------------------------------------------
>  > > > > Date: Mon, 19 Aug 2013 07:48:23 -0500
>  > > > > From: mark.nelson@xxxxxxxxxxx
>  > > > > To: ceph-users@xxxxxxxxxxxxxx
>  > > > > Subject: Re:  Poor write/random read/random write
>  > > performance
>  > > > >
>  > > > > On 08/19/2013 06:28 AM, Da Chun Ng wrote:
>  > > > >
>  > > > > I have a 3 nodes, 15 osds ceph cluster setup:
>  > > > > * 15 7200 RPM SATA disks, 5 for each node.
>  > > > > * 10G network
>  > > > > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
>  > > > > * 64G Ram for each node.
>  > > > >
>  > > > > I deployed the cluster with ceph-deploy, and created a new data 
> pool
>  > > > > for cephfs.
>  > > > > Both the data and metadata pools are set with replica size 3.
>  > > > > Then mounted the cephfs on one of the three nodes, and tested the
>  > > > > performance with fio.
>  > > > >
>  > > > > The sequential read performance looks good:
>  > > > > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
>  > > > > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
>  > > > > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
>  > > > >
>  > > > >
>  > > > > Sounds like readahead and or caching is helping out a lot here.
>  > > Btw, you
>  > > > > might want to make sure this is actually coming from the disks with
>  > > > > iostat or collectl or something.
>  > > > >
>  > > > > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all 
> the nodes
>  > > > > before every test. I used collectl to watch every disk IO, the 
> numbers
>  > > > > should match. I think readahead is helping here.
>  > > >
>  > > > Ok, good! I suspect that readahead is indeed helping.
>  > > >
>  > > > >
>  > > > >
>  > > > > But the sequential write/random read/random write performance is
>  > > > > very poor:
>  > > > > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
>  > > > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>  > > > > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
>  > > > >
>  > > > >
>  > > > > One thing to keep in mind is that unless you have SSDs in this 
> system,
>  > > > > you will be doing 2 writes for every client write to the spinning
>  > > disks
>  > > > > (since data and journals will both be on the same disk).
>  > > > >
>  > > > > So let's do the math:
>  > > > >
>  > > > > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
>  > > > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 
> IOPS /
>  > > drive
>  > > > >
>  > > > > If there is no write coalescing going on, this isn't terrible. If
>  > > there
>  > > > > is, this is terrible.
>  > > > >
>  > > > > How can I know if there is write coalescing going on?
>  > > >
>  > > > look in collectl at the average IO sizes going to the disks. I 
> bet they
>  > > > will be 16KB. If you were to look further with blktrace and
>  > > > seekwatcher, I bet you'd see lots of seeking between OSD data 
> writes and
>  > > > journal writes since there is no controller cache helping smooth 
> things
>  > > > out (and your journals are on the same drives).
>  > > >
>  > > > >
>  > > > > Have you tried buffered writes with the sync engine at the same IO
>  > > size?
>  > > > >
>  > > > > Do you mean as below?
>  > > > > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K
>  > > > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
>  > > >
>  > > > Yeah, that'd work.
>  > > >
>  > > > >
>  > > > > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
>  > > > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest
>  > > -runtime 60
>  > > > > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
>  > > > >
>  > > > >
>  > > > > In this case:
>  > > > >
>  > > > > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.
>  > > > >
>  > > > > Definitely not great! You might want to try fiddling with read 
> ahead
>  > > > > both on the CephFS client and on the block devices under the OSDs
>  > > > > themselves.
>  > > > >
>  > > > > Could you please tell me how to enable read ahead on the CephFS 
> client?
>  > > >
>  > > > It's one of the mount options:
>  > > >
>  > > > http://ceph.com/docs/master/man/8/mount.ceph/
>  > > >
>  > > > >
>  > > > > For the block devices under the OSDs, the read ahead value is:
>  > > > > [root@ceph0 ~]# blockdev --getra /dev/sdi
>  > > > > 256
>  > > > > How big is appropriate for it?
>  > > >
>  > > > To be honest I've seen different results depending on the 
> hardware. I'd
>  > > > try anywhere from 32kb to 2048kb.
>  > > >
>  > > > >
>  > > > > One thing I did notice back during bobtail is that increasing the
>  > > number
>  > > > > of osd op threads seemed to help small object read performance. It
>  > > > > might be worth looking at too.
>  > > > >
>  > > > >
>  > > 
> http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
>  > > > >
>  > > > > Other than that, if you really want to dig into this, you can use
>  > > tools
>  > > > > like iostat, collectl, blktrace, and seekwatcher to try and get 
> a feel
>  > > > > for what the IO going to the OSDs looks like. That can help when
>  > > > > diagnosing this sort of thing.
>  > > > >
>  > > > > fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
>  > > > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest
>  > > -runtime 60
>  > > > > write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
>  > > > >
>  > > > >
>  > > > > 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024
>  > > > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 
> IOPS /
>  > > drive
>  > > > >
>  > > > >
>  > > > > I am mostly surprised by the seq write performance comparing to the
>  > > > > raw sata disk performance(It can get 4127 IOPS when mounted with
>  > > > > ext4). My cephfs only gets 1/10 performance of the raw disk.
>  > > > >
>  > > > >
>  > > > > 7200 RPM spinning disks typically top out at something like 150 
> IOPS
>  > > > > (and some are lower). With 15 disks, to hit 4127 IOPS you were
>  > > probably
>  > > > > seeing some write coalescing effects (or if these were random 
> reads,
>  > > > > some benefit to read ahead).
>  > > > >
>  > > > >
>  > > > > How can I tune my cluster to improve the sequential write/random
>  > > > > read/random write performance?
>  > > > >
>  > > > > I don't know what kind of controller you have, but in cases where
>  > > > > journals are on the same disks as the data, using writeback cache
>  > > helps
>  > > > > a lot because the controller can coalesce the direct IO journal 
> writes
>  > > > > in cache and just do big periodic dumps to the drives. That really
>  > > > > reduces seek overhead for the writes. Using SSDs for the journals
>  > > > > accomplishes much of the same effect, and lets you get faster 
> large IO
>  > > > > writes too, but in many chassis there is a density (and cost)
>  > > trade-off.
>  > > > >
>  > > > > Hope this helps!
>  > > > >
>  > > > > Mark
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > > _______________________________________________
>  > > > > ceph-users mailing list
>  > > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>  > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  > > > >
>  > > > >
>  > > > >
>  > > > > _______________________________________________ ceph-users mailing
>  > > list
>  > > > > ceph-users@xxxxxxxxxxxxxx
>  > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  > > >
>  >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com