On Fri, 6 Nov 2015, Chen, Xiaoxi wrote: > Can we simply the case as cephFS and RGW both has dedicate metadata pool > ---- So we can solve this in deployment, using OSD with keyvaluestore > backend for it ( on SSD) should be a best fit. I think that's a good approach for the current code (FileStore and/or KeyValueStore). But for NewStore I'd like to solve this problem directly so that it can be used for both cases. Rocksdb has a mechanism for moving lower level ssts to a slower device based on a total size threshold on the main device; hopefully this can be used so that we can give it both an ssd and hdd. sage > > > Thus for New-Newstore, we just focus on data pool? > > > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Friday, November 6, 2015 1:11 AM > To: Ning Yao; Chen, Xiaoxi > Cc: Xue, Chendi; Samuel Just; ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: Specify omap path for filestore > > > > Yes. The hard part here in my view is the allocation of space between ssd > and hdd when the amount of omap data can vary widely, from very little for > rbd to the entire pool for rgw indexes or cephfs metadata. > > sage > > > > On November 5, 2015 11:33:48 AM GMT+01:00, Ning Yao <zay11022@xxxxxxxxx> > wrote: > > Agreed! Actually in different use cases. > But still not heavily loaded with SSD under small write use case, on > this point, I may assume that newstore overlay would be much better? > It seems that we can do more based on NewStore to let the store using > the raw device directly based on onode_t, data_map (which can act as > the inode in filesystem), so that we can achieve the whole HDD iops as > real data without the interference of filesystem-journal and inode > get/set. > Regards > Ning Yao > 2015-11-04 23:19 GMT+08:00 Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx>: > > Hi Ning, > Yes, we doesn?t save any IO, or may even need more IO as read amplification b > y LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS, IOPS/$$ > in SSD(10K+ IOPS per $100) is 2 order cheaper than that > > of in an HDD( 100 IOPS per $100). > Some use case: > 1.When we have enough load, moving any load out of the HDD definitely bring > some help. Omap is the thing that could be easily moved out to SSD , note t > hat omap workload is not intensive but random, which is just fit into the ss > d already working as journal. > 2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SS > D), which will reduce the inode size thus more inode could be cached in memo > ry. Again, SSD is more than fast for this even sharing with journal. > 3. in RGW case, we will have some container objects with tons of omap, movi > ng the omap to SSD is a clear optimization. > -Xiaoxi > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Xue, Chendi > > Sent: Wednesday, November 4, 2015 4:15 PM > To: Ning Yao > Cc: Samuel Just; ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: Specify omap path for filestore > Hi, Ning > Thanks for the advice, we did done thing you suggested in our performance > tuning work, actually tuning up the usage of memory is the first thing we > tried. > Firstly, I should guess the omap to ssd benefit shows when we use quite > intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost > drive each HDD to utility 95+%. > We hoped and tested on tune up the inode memory size and fd cache size, > since I believe if inode can be always hit in the memory which definitely > benefit more than using omap. Sadly our server only has 32G memory total. > Even we set xattr size as 65535 as original configured and also fd cache si > ze as > 10240 as I remembered, still gain a little to the performance but may lead > to > OOM of OSD, > > so that is why we came up the solution of moving omap out to > a SSD device. > Another reason to move omap out is because it helps on performance > analysis, since omap uses keyvaluestore, and each rbd request causes one or > more 4k inode operation, which lead a frontend and backend throughput > ratio as 1: 5.8, which is not that easy to explain the 5.8. > Also we can get more randwrite iops if there is no seqwrite to one HDD > device, when HDD handles randwrite iops and also some omap(leveldb) > write, we can only get 175 iops disk write per HDD when util is nearly full > . > when HDD only handles randwrite without any omap write, we can get 325 > iops disk write per HDD when HDD util is nearly full. > System data please refer to below url > http://xuechendi.github.io/data/ > omap on HDD is before mapping to other device omap on SSD is after > Best > > regards, > Chendi > -----Original Message----- > From: Ning Yao [mailto:zay11022@xxxxxxxxx] > Sent: Wednesday, November 4, 2015 3:09 PM > To: Xue, Chendi <chendi.xue@xxxxxxxxx> > Cc: Samuel Just <sjust@xxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: Specify omap path for filestore > Hi, Chendi, > I don't think it will be a big improvement compared with normal way to usin > g > FileStore (enable filestore_max_inline_xattr_xfs and tune > filestore_fd_cache_size,osd_pg_object_context_cache_count, > filestore_omap_header_cache_size properly to achieve a high hit rate).Do > you enable filestore_max_inline_xattr in the first test? If not, it may be > reasonable. In my previous test, I remember just about 20%~30% > improvement. > And can you also provide cpu cost per Op on osd node? > Regards > Ning Yao > 2015-10-30 10:04 GMT+08:00 Xue, Chendi > > <chendi.xue@xxxxxxxxx>: > > Hi, Sam > Last week I introduced about how we saw the benefit of moving omap to a > > separate device. > > And here is the pull request: > https://github.com/ceph/ceph/pull/6421 > I had tested redeploy and restart ceph cluster at my setup, the codes > > works fine. > > one problem is do you think I should *DELETE* all the files under the > > omap_path firstly? Because I notice if old pg data leaves there, osd daemon > may run into chaos. But I am not sure if it > > should leave to users to DELETE. > > Any thoughts? > Also I paste some data I talked , which is about the rbd and osd write iops > > ratio when doing randwrite to a rbd device. > > ======Here is some data===== > We uses 4 clients , 35 vm each to test on rbd randwrite. > 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal > 2 replica > filestore_max_inline_xattr_xfs=0 > filestore_max_inline_xattr_size_xfs=0 > Before moving omap to separate ssd, we saw a frontend and backend iops > ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talke > d, 5.8 > > consists of 2 replica write, inode and omap writes > > runid op_size op_type QD engine > serverNum > > clientNum rbdNum runtime fio_iops fio_bw > fio_latency > osd_iops osd_bw osd_latency > > 332 4k randwrite qd8 qemurbd > 4 4 > > 140 400 sec 1206.000 4.987 MB/s 884.61 > 7 msec > 7034.975 47.407 MB/s 242.620 msec > > And after moving omap to a separate ssd, we saw a frontend vs. backend > > ratio drops to > > 1:2.6, rbd side total iops 5006, hdd total iops 13089 > > runid op_size op_type QD engine > serverNum > > clientNum rbdNum runtime fio_iops fio_bw > fio_latency > osd_iops osd_bw osd_latency > > 326 4k randwrite qd8 qemurbd > 4 4 > > 140 400 sec 5006.000 19.822 MB/s 222.296 > msec > 13089.020 82.897 MB/s 482.203 msec > > Best regards, > Chendi > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > majordomo > > info at http://vger.kernel.org/majordomo-info.html > > {.n + +% lzwm b ? r y? ?zX ?} ?z &j:+v zZ+ +zf h ~ i > z w ? > & )? f > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > Sent from Kaiten Mail. Please excuse my brevity. > > >