On 08/21/2013 07:58 AM, Da Chun Ng wrote: > Mark, > > I tried with journal aio = true, and op thread = 4, but made little > difference. > Then I tried to enlarge read ahead value both on the osd block devices > and cephfs client. It did improve some overall performance, especially > the sequential read performance. But still has not much help to > the write/random read/random write performance. One thing that might be worth trying on a test node, (Don't use this in production!) is putting your journals in ram via /dev/shm. That might give you an idea of how much your journal writes are conflicting with data writes. If you want to try this, you'll need to set journal dio = false. > > I tried to change the place group number to (100 * > osd_num)/replica_size, it does not decrease the overall performance this > time, but not improve neither. > > > Date: Mon, 19 Aug 2013 12:31:07 -0500 > > From: mark.nelson@xxxxxxxxxxx > > To: dachun.ng@xxxxxxxxxxx > > CC: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: Poor write/random read/random write performance > > > > On 08/19/2013 12:05 PM, Da Chun Ng wrote: > > > Thank you! Testing now. > > > > > > How about pg num? I'm using the default size 64, as I tried with > (100 * > > > osd_num)/replica_size, but it decreased the performance surprisingly. > > > > Oh! That's odd! Typically you would want more than that. Most likely > > you aren't distributing PGs very evenly across OSDs with 64. More PGs > > shouldn't decrease performance unless the monitors are behaving badly. > > We saw some issues back in early cuttlefish but you should be fine with > > many more PGs. > > > > Mark > > > > > > > > > Date: Mon, 19 Aug 2013 11:33:30 -0500 > > > > From: mark.nelson@xxxxxxxxxxx > > > > To: dachun.ng@xxxxxxxxxxx > > > > CC: ceph-users@xxxxxxxxxxxxxx > > > > Subject: Re: Poor write/random read/random write > performance > > > > > > > > On 08/19/2013 08:59 AM, Da Chun Ng wrote: > > > > > Thanks very much! Mark. > > > > > Yes, I put the data and journal on the same disk, no SSD in my > > > environment. > > > > > My controllers are general SATA II. > > > > > > > > Ok, so in this case the lack of WB cache on the controller and no > SSDs > > > > for journals is probably having an effect. > > > > > > > > > > > > > > Some more questions below in blue. > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > Date: Mon, 19 Aug 2013 07:48:23 -0500 > > > > > From: mark.nelson@xxxxxxxxxxx > > > > > To: ceph-users@xxxxxxxxxxxxxx > > > > > Subject: Re: Poor write/random read/random write > > > performance > > > > > > > > > > On 08/19/2013 06:28 AM, Da Chun Ng wrote: > > > > > > > > > > I have a 3 nodes, 15 osds ceph cluster setup: > > > > > * 15 7200 RPM SATA disks, 5 for each node. > > > > > * 10G network > > > > > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. > > > > > * 64G Ram for each node. > > > > > > > > > > I deployed the cluster with ceph-deploy, and created a new data > pool > > > > > for cephfs. > > > > > Both the data and metadata pools are set with replica size 3. > > > > > Then mounted the cephfs on one of the three nodes, and tested the > > > > > performance with fio. > > > > > > > > > > The sequential read performance looks good: > > > > > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K > > > > > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 > > > > > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec > > > > > > > > > > > > > > > Sounds like readahead and or caching is helping out a lot here. > > > Btw, you > > > > > might want to make sure this is actually coming from the disks with > > > > > iostat or collectl or something. > > > > > > > > > > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all > the nodes > > > > > before every test. I used collectl to watch every disk IO, the > numbers > > > > > should match. I think readahead is helping here. > > > > > > > > Ok, good! I suspect that readahead is indeed helping. > > > > > > > > > > > > > > > > > > > But the sequential write/random read/random write performance is > > > > > very poor: > > > > > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K > > > > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > > > > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec > > > > > > > > > > > > > > > One thing to keep in mind is that unless you have SSDs in this > system, > > > > > you will be doing 2 writes for every client write to the spinning > > > disks > > > > > (since data and journals will both be on the same disk). > > > > > > > > > > So let's do the math: > > > > > > > > > > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 > > > > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 > IOPS / > > > drive > > > > > > > > > > If there is no write coalescing going on, this isn't terrible. If > > > there > > > > > is, this is terrible. > > > > > > > > > > How can I know if there is write coalescing going on? > > > > > > > > look in collectl at the average IO sizes going to the disks. I > bet they > > > > will be 16KB. If you were to look further with blktrace and > > > > seekwatcher, I bet you'd see lots of seeking between OSD data > writes and > > > > journal writes since there is no controller cache helping smooth > things > > > > out (and your journals are on the same drives). > > > > > > > > > > > > > > Have you tried buffered writes with the sync engine at the same IO > > > size? > > > > > > > > > > Do you mean as below? > > > > > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K > > > > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > > > > > > > Yeah, that'd work. > > > > > > > > > > > > > > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio > > > > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest > > > -runtime 60 > > > > > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec > > > > > > > > > > > > > > > In this case: > > > > > > > > > > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive. > > > > > > > > > > Definitely not great! You might want to try fiddling with read > ahead > > > > > both on the CephFS client and on the block devices under the OSDs > > > > > themselves. > > > > > > > > > > Could you please tell me how to enable read ahead on the CephFS > client? > > > > > > > > It's one of the mount options: > > > > > > > > http://ceph.com/docs/master/man/8/mount.ceph/ > > > > > > > > > > > > > > For the block devices under the OSDs, the read ahead value is: > > > > > [root@ceph0 ~]# blockdev --getra /dev/sdi > > > > > 256 > > > > > How big is appropriate for it? > > > > > > > > To be honest I've seen different results depending on the > hardware. I'd > > > > try anywhere from 32kb to 2048kb. > > > > > > > > > > > > > > One thing I did notice back during bobtail is that increasing the > > > number > > > > > of osd op threads seemed to help small object read performance. It > > > > > might be worth looking at too. > > > > > > > > > > > > > > http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread > > > > > > > > > > Other than that, if you really want to dig into this, you can use > > > tools > > > > > like iostat, collectl, blktrace, and seekwatcher to try and get > a feel > > > > > for what the IO going to the OSDs looks like. That can help when > > > > > diagnosing this sort of thing. > > > > > > > > > > fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio > > > > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest > > > -runtime 60 > > > > > write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec > > > > > > > > > > > > > > > 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 > > > > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 > IOPS / > > > drive > > > > > > > > > > > > > > > I am mostly surprised by the seq write performance comparing to the > > > > > raw sata disk performance(It can get 4127 IOPS when mounted with > > > > > ext4). My cephfs only gets 1/10 performance of the raw disk. > > > > > > > > > > > > > > > 7200 RPM spinning disks typically top out at something like 150 > IOPS > > > > > (and some are lower). With 15 disks, to hit 4127 IOPS you were > > > probably > > > > > seeing some write coalescing effects (or if these were random > reads, > > > > > some benefit to read ahead). > > > > > > > > > > > > > > > How can I tune my cluster to improve the sequential write/random > > > > > read/random write performance? > > > > > > > > > > I don't know what kind of controller you have, but in cases where > > > > > journals are on the same disks as the data, using writeback cache > > > helps > > > > > a lot because the controller can coalesce the direct IO journal > writes > > > > > in cache and just do big periodic dumps to the drives. That really > > > > > reduces seek overhead for the writes. Using SSDs for the journals > > > > > accomplishes much of the same effect, and lets you get faster > large IO > > > > > writes too, but in many chassis there is a density (and cost) > > > trade-off. > > > > > > > > > > Hope this helps! > > > > > > > > > > Mark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > _______________________________________________ ceph-users mailing > > > list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com