RE: Specify omap path for filestore

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 6 Nov 2015 02:28:45 -0800 (PST)

On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> Can we simply the case as cephFS and RGW both has dedicate metadata pool
> ---- So we can solve this in deployment, using OSD with keyvaluestore
> backend for it ( on SSD) should be a best fit.

I think that's a good approach for the current code (FileStore and/or 
KeyValueStore).

But for NewStore I'd like to solve this problem directly so that it can be 
used for both cases.  Rocksdb has a mechanism for moving lower level ssts 
to a slower device based on a total size threshold on the main device; 
hopefully this can be used so that we can give it both an ssd and hdd.

sage

>  
> 
> Thus for New-Newstore, we just focus on data pool?
> 
>  
> 
> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> Sent: Friday, November 6, 2015 1:11 AM
> To: Ning Yao; Chen, Xiaoxi
> Cc: Xue, Chendi; Samuel Just; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Specify omap path for filestore
> 
>  
> 
> Yes.  The hard part here in my view is the allocation of space between ssd
> and hdd when the amount of omap data can vary widely, from very little for
> rbd to the entire pool for rgw indexes or cephfs metadata.
> 
> sage
> 
>  
> 
> On November 5, 2015 11:33:48 AM GMT+01:00, Ning Yao <zay11022@xxxxxxxxx>
> wrote:
> 
> Agreed! Actually in different use cases.
> But still not heavily loaded with SSD under small write use case, on
> this point, I may assume that newstore overlay would be much better?
> It seems that we can do more based on NewStore to let the store using
> the raw device directly based on onode_t, data_map (which can act as
> the inode in filesystem), so that we can achieve the whole HDD iops as
> real data without the interference of filesystem-journal and inode
> get/set.
> Regards
> Ning Yao
> 2015-11-04 23:19 GMT+08:00 Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx>:
> 
>  Hi Ning,
>  Yes, we doesn?t save any IO, or may even need more IO as read amplification b
> y LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS,  IOPS/$$ 
> in SSD(10K+ IOPS per $100) is 2 order cheaper than that
> 
> of in an HDD( 100 IOPS per $100).
>  Some use case:
>  1.When we have enough load, moving any load out of the HDD definitely bring
>  some help. Omap is the thing that could be easily moved out to SSD , note t
> hat omap workload is not intensive but random, which is just fit into the ss
> d already working as journal.
>  2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SS
> D), which will reduce the inode size thus more inode could be cached in memo
> ry. Again, SSD is more than fast for this even sharing with journal.
>  3. in RGW case, we will have some container objects with tons of omap, movi
> ng the omap to SSD is a clear optimization.
>  -Xiaoxi
> 
>  -----Original Message-----
>  From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>  owner@xxxxxxxxxxxxxxx] On Behalf Of Xue, Chendi
> 
> Sent: Wednesday, November 4, 2015 4:15 PM
>  To: Ning Yao
>  Cc: Samuel Just; ceph-devel@xxxxxxxxxxxxxxx
>  Subject: RE: Specify omap path for filestore
>  Hi, Ning
>  Thanks for the advice, we did done thing you suggested in our performance
>  tuning work, actually tuning up the usage of memory is the first thing we
>  tried.
>  Firstly, I should guess the omap to ssd benefit shows when we use quite
>  intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost
>  drive each HDD to utility 95+%.
>  We hoped and tested on tune up the inode memory size and fd cache size,
>  since I believe if inode can be always hit in the memory which definitely
>  benefit more than using omap. Sadly our server only has 32G memory total.
>  Even we set xattr size as 65535 as original configured and also fd cache si
> ze as
>  10240 as I remembered, still gain a little to the performance but may lead 
> to
>  OOM of OSD,
> 
> so that is why we came up the solution of moving omap out to
>  a SSD device.
>  Another reason to move omap out is because it helps on performance
>  analysis, since omap uses keyvaluestore, and each rbd request causes one or
>  more 4k inode operation, which lead a frontend and backend throughput
>  ratio as 1: 5.8, which is not that easy to explain the 5.8.
>  Also we can get more randwrite iops if there is no seqwrite to one HDD
>  device, when HDD handles randwrite iops and also some omap(leveldb)
>  write, we can only get 175 iops disk write per HDD when util is nearly full
> .
>  when HDD only handles randwrite without any omap write, we can get 325
>  iops disk write per HDD when HDD util is nearly full.
>  System data please refer to below url
>  http://xuechendi.github.io/data/
>  omap on HDD is before mapping to other device omap on SSD is after
>  Best
> 
> regards,
>  Chendi
>  -----Original Message-----
>  From: Ning Yao [mailto:zay11022@xxxxxxxxx]
>  Sent: Wednesday, November 4, 2015 3:09 PM
>  To: Xue, Chendi <chendi.xue@xxxxxxxxx>
>  Cc: Samuel Just <sjust@xxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
>  Subject: Re: Specify omap path for filestore
>  Hi, Chendi,
>  I don't think it will be a big improvement compared with normal way to usin
> g
>  FileStore (enable filestore_max_inline_xattr_xfs and tune
>  filestore_fd_cache_size,osd_pg_object_context_cache_count,
>  filestore_omap_header_cache_size properly to achieve a high hit rate).Do
>  you enable  filestore_max_inline_xattr in the first test? If not, it may be
>  reasonable. In my previous test, I remember just about 20%~30%
>  improvement.
>  And can you also provide cpu cost per Op on osd node?
>  Regards
>  Ning Yao
>  2015-10-30 10:04 GMT+08:00 Xue, Chendi
> 
> <chendi.xue@xxxxxxxxx>:
> 
>  Hi, Sam
>  Last week I introduced about how we saw the benefit of moving omap to a
> 
>  separate device.
> 
>  And here is the pull request:
>  https://github.com/ceph/ceph/pull/6421
>  I had tested redeploy and restart ceph cluster at my setup, the codes
> 
>  works fine.
> 
>  one problem is do you think I should *DELETE* all the files under the
> 
>  omap_path firstly? Because I notice if old pg data leaves there, osd daemon
>  may run into chaos. But I am not sure if it
> 
> should leave to users to DELETE.
> 
>  Any thoughts?
>  Also I paste some data I talked , which is about the rbd and osd write iops
> 
>  ratio when doing randwrite to a rbd device.
> 
>  ======Here is some data=====
>  We uses 4 clients , 35 vm each to test on rbd randwrite.
>  4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
>  2 replica
>  filestore_max_inline_xattr_xfs=0
>  filestore_max_inline_xattr_size_xfs=0
>  Before moving omap to separate ssd, we saw a frontend and backend iops
>  ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talke
> d, 5.8
> 
>  consists of 2 replica write, inode and omap writes
> 
>  runid         op_size    op_type             QD             engine         
>       serverNum
> 
>  clientNum         rbdNum   runtime             fio_iops         fio_bw     
>           fio_latency
>  osd_iops           osd_bw             osd_latency
> 
>  332            4k              randwrite         qd8            qemurbd    
>        4                          4
> 
>  140            400 sec              1206.000         4.987 MB/s      884.61
> 7 msec
>  7034.975          47.407 MB/s    242.620 msec
> 
>  And after moving omap to a separate ssd, we saw a frontend vs. backend
> 
>  ratio drops to
> 
> 1:2.6, rbd side total iops 5006, hdd total iops 13089
> 
>  runid         op_size    op_type             QD             engine         
>       serverNum
> 
>  clientNum         rbdNum   runtime             fio_iops         fio_bw     
>           fio_latency
>  osd_iops           osd_bw             osd_latency
> 
>  326            4k              randwrite         qd8            qemurbd    
>        4                          4
> 
>  140            400 sec              5006.000         19.822 MB/s    222.296
>  msec
>  13089.020        82.897 MB/s    482.203 msec
> 
>  Best regards,
>  Chendi
> 
> --
>  To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>  in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> 
>  majordomo
> 
>  info at  http://vger.kernel.org/majordomo-info.html
> 
>    {.n +       +%  lzwm  b ?  r  y? ?zX     ?}   ?z &j:+v        zZ+  +zf   h   ~    i
>    z   w   ?
>  & )? f
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> Sent from Kaiten Mail. Please excuse my brevity.
> 
> 
>