Thank you! Testing now.
How about pg num? I'm using the default size 64, as I tried with (100 * osd_num)/replica_size, but it decreased the performance surprisingly. > Date: Mon, 19 Aug 2013 11:33:30 -0500 > From: mark.nelson@xxxxxxxxxxx > To: dachun.ng@xxxxxxxxxxx > CC: ceph-users@xxxxxxxxxxxxxx > Subject: Re: [ceph-users] Poor write/random read/random write performance > > On 08/19/2013 08:59 AM, Da Chun Ng wrote: > > Thanks very much! Mark. > > Yes, I put the data and journal on the same disk, no SSD in my environment. > > My controllers are general SATA II. > > Ok, so in this case the lack of WB cache on the controller and no SSDs > for journals is probably having an effect. > > > > > Some more questions below in blue. > > > > ------------------------------------------------------------------------ > > Date: Mon, 19 Aug 2013 07:48:23 -0500 > > From: mark.nelson@xxxxxxxxxxx > > To: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: [ceph-users] Poor write/random read/random write performance > > > > On 08/19/2013 06:28 AM, Da Chun Ng wrote: > > > > I have a 3 nodes, 15 osds ceph cluster setup: > > * 15 7200 RPM SATA disks, 5 for each node. > > * 10G network > > * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. > > * 64G Ram for each node. > > > > I deployed the cluster with ceph-deploy, and created a new data pool > > for cephfs. > > Both the data and metadata pools are set with replica size 3. > > Then mounted the cephfs on one of the three nodes, and tested the > > performance with fio. > > > > The sequential read performance looks good: > > fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K > > -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 > > read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec > > > > > > Sounds like readahead and or caching is helping out a lot here. Btw, you > > might want to make sure this is actually coming from the disks with > > iostat or collectl or something. > > > > I ran "sync && echo 3 | tee /proc/sys/vm/drop_caches" on all the nodes > > before every test. I used collectl to watch every disk IO, the numbers > > should match. I think readahead is helping here. > > Ok, good! I suspect that readahead is indeed helping. > > > > > > > But the sequential write/random read/random write performance is > > very poor: > > fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec > > > > > > One thing to keep in mind is that unless you have SSDs in this system, > > you will be doing 2 writes for every client write to the spinning disks > > (since data and journals will both be on the same disk). > > > > So let's do the math: > > > > 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive > > > > If there is no write coalescing going on, this isn't terrible. If there > > is, this is terrible. > > > > How can I know if there is write coalescing going on? > > look in collectl at the average IO sizes going to the disks. I bet they > will be 16KB. If you were to look further with blktrace and > seekwatcher, I bet you'd see lots of seeking between OSD data writes and > journal writes since there is no controller cache helping smooth things > out (and your journals are on the same drives). > > > > > Have you tried buffered writes with the sync engine at the same IO size? > > > > Do you mean as below? > > fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K > > -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > Yeah, that'd work. > > > > > fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec > > > > > > In this case: > > > > 11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive. > > > > Definitely not great! You might want to try fiddling with read ahead > > both on the CephFS client and on the block devices under the OSDs > > themselves. > > > > Could you please tell me how to enable read ahead on the CephFS client? > > It's one of the mount options: > > http://ceph.com/docs/master/man/8/mount.ceph/ > > > > > For the block devices under the OSDs, the read ahead value is: > > [root@ceph0 ~]# blockdev --getra /dev/sdi > > 256 > > How big is appropriate for it? > > To be honest I've seen different results depending on the hardware. I'd > try anywhere from 32kb to 2048kb. > > > > > One thing I did notice back during bobtail is that increasing the number > > of osd op threads seemed to help small object read performance. It > > might be worth looking at too. > > > > http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread > > > > Other than that, if you really want to dig into this, you can use tools > > like iostat, collectl, blktrace, and seekwatcher to try and get a feel > > for what the IO going to the OSDs looks like. That can help when > > diagnosing this sort of thing. > > > > fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio > > -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 > > write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec > > > > > > 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 > > (KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive > > > > > > I am mostly surprised by the seq write performance comparing to the > > raw sata disk performance(It can get 4127 IOPS when mounted with > > ext4). My cephfs only gets 1/10 performance of the raw disk. > > > > > > 7200 RPM spinning disks typically top out at something like 150 IOPS > > (and some are lower). With 15 disks, to hit 4127 IOPS you were probably > > seeing some write coalescing effects (or if these were random reads, > > some benefit to read ahead). > > > > > > How can I tune my cluster to improve the sequential write/random > > read/random write performance? > > > > I don't know what kind of controller you have, but in cases where > > journals are on the same disks as the data, using writeback cache helps > > a lot because the controller can coalesce the direct IO journal writes > > in cache and just do big periodic dumps to the drives. That really > > reduces seek overhead for the writes. Using SSDs for the journals > > accomplishes much of the same effect, and lets you get faster large IO > > writes too, but in many chassis there is a density (and cost) trade-off. > > > > Hope this helps! > > > > Mark > > > > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com