Re: BlueStore Cache Ratios

Jorge Pinilla López <jorpilo@xxxxxxxxx> · Thu, 12 Oct 2017 20:11:06 +0200



    Hey Mark,

    
    Thanks a lot for the info. You should really make a paper of it
      and post it :)
    First of all, I am sorry if I say something wrong, I am still
      learning about this topic and I am speaking from totally
      unawareness.

    
    Second, I understood that ratios ar a way of controling
      priorities and they make that bloom filters and indexes don't get
      page out from cache, which really makes sense.
    Also the 512MB restriction kind of makes sense but I dont really
      know if It would make any sense to give more space for rocksdb
      block cache (like 1GB). I think only testing can resolve that
      question because I think it really depends on workloads.

    
    what I don't understand is why data ins't cache at all, even if
      there is free space for it. I undertand the importance order would
      be: bloom filter and index >> metadata >> data, but if
      there is free space left for data then why not go for it? maybe
      setting ratios of 0.90 k/v 0.09 metadata and 0.01 data would make
      more sense.

    
    El 11/10/2017 a las 15:44, Mark Nelson
      escribió:

    
    Hi
      Jorge,
      

      I was sort of responsible for all of this. :)
      

      So basically there are different caches in different places:
      

      - rocksdb cache
      

      - rocksdb block cache (which can be configured to include filters
      and indexes)
      

      - rocksdb compressed block cache
      

      - bluestore onode cache
      

      The bluestore onode cache is the only one that stores
      onode/extent/blob metadata before it is encoded, ie it's bigger
      but has lower impact on the CPU.  The next step is the regular
      rocksdb block cache where we've already encoded the data, but it's
      not compressed.  Optionally we could also compress the data and
      then cache it using rocksdb's compressed block cache.  Finally,
      rocksdb can set memory aside for bloom filters and indexes but
      we're configuring those to go into the block cache so we can get a
      better accounting for how memory is being used (otherwise it's
      difficult to control how much memory index and filters get).  The
      downside is that bloom filters and indexes can theoretically get
      paged out under heavy cache pressure.  We set these to be high
      priority in the block cache and also pin the L0 filters/index
      though to help avoid this.
      

      In the testing I did earlier this year, what I saw is that in low
      memory scenarios it's almost always best to give all of the cache
      to rocksdb's block cache.  Once you hit about the 512MB mark, we
      start seeing bigger gains by giving additional memory to
      bluestore's onode cache.  So we devised a mechanism where you can
      decide where to cut over.  It's quite possible that on very fast
      CPUs it might make sense ot use rocksdb compressed cache, or
      possibly if you have a huge number of objects these ratios might
      change.  The values we have now were sort of the best
      jack-of-all-trades values we found.
      

      Mark
      

      On 10/11/2017 08:32 AM, Jorge Pinilla López wrote:
      

      okay, thanks for the explanation, so from
        the 3GB of Cache (default
        

        cache for SSD) only a 0.5GB is going to K/V and 2.5 going to
        metadata.
        

        Is there a way of knowing how much k/v, metadata, data is
        storing and
        

        how full cache is so I can adjust my ratios?, I was thinking
        some ratios
        

        (like 0.9 k/v, 0.07 meta 0.03 data) but only speculating, I dont
        have
        

        any real data.
        

        El 11/10/2017 a las 14:32, Mohamad Gebai escribió:
        

        Hi Jorge,
          

          On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:
          

          Are .99 KV, .01 MetaData and .0 Data
            ratios right? they seem a little
            

            too disproporcionate.
            

          Yes, this is correct.
          

          Also .99 KV and Cache of 3GB for SSD
            means that almost the 3GB would
            

            be used for KV but there is also another attributed called
            

            bluestore_cache_kv_max which is by fault 512MB, then what is
            the rest
            

            of the cache used for?, nothing? shouldnt it be more kv_max
            value or
            

            less KV ratio?
            

          Anything over the *cache_kv_max value goes to the metadata
          cache. You
          

          can look in your logs to see the final values of kv, metadata
          and data
          

          cache ratios. To get data cache, you need to lower the ratios
          of
          

          metadata and kv caches.
          

          Mohamad
          

        --
        

------------------------------------------------------------------------
        

        *Jorge Pinilla López*
        

        jorpilo@xxxxxxxxx
        

        Estudiante de ingenieria informática
        

        Becario del area de sistemas (SICUZ)
        

        Universidad de Zaragoza
        

        PGP-KeyID: A34331932EBC715A
        

<http://pgp.rediris.es:11371/pks/lookup?op=get&search=0xA34331932EBC715A>
        

------------------------------------------------------------------------
        

        _______________________________________________
        

        ceph-users mailing list
        

        ceph-users@xxxxxxxxxxxxxx
        

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        

      _______________________________________________
      

      ceph-users mailing list
      

      ceph-users@xxxxxxxxxxxxxx
      

      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
      

    -- 

      
      Jorge Pinilla López

      jorpilo@xxxxxxxxx

      Estudiante de ingenieria informática

      Becario del area de sistemas (SICUZ)

      Universidad de Zaragoza

      PGP-KeyID: A34331932EBC715A

      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com