Re: Thoughts on rocksdb and erasurecode

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Thu, 27 Jun 2019 08:34:00 +1200

Hm, according to https://tracker.ceph.com/issues/24025 snappy compression should be available out of the box at least since luminous. What ceph version are you running?

On Wed, 26 Jun 2019 at 21:51, Rafał Wądołowski <rwadolowski@xxxxxxxxxxxxxx> wrote:

    We changed these settings. Our config now is:
    bluestore_rocksdb_options =
"compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=50331648,target_file_size_base=50331648,max_background_compactions=31,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=603979776,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8"

    It could be changed without redeploy. It changes the sst files,
      when compaction is triggered. The additional improvement is Snappy
      compression. We rebuild ceph with support for it. I can create PR
      with it, if you want :)

      Best Regards,

      Rafał Wądołowski

      Cloud & Security Engineer

    On 25.06.2019 22:16, Christian Wuerdig
      wrote:

        The sizes are determined by rocksdb settings - some details
          can be found here: https://tracker.ceph.com/issues/24361
        One thing to note, in this thread http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030775.html
          it's noted that rocksdb could use up to 100% extra space
          during compaction so if you want to avoid spill over during
          compaction then safer values would be 6/60/600 GB

        You can change max_bytes_for_level_base and
          max_bytes_for_level_multiplier to suit your needs better but
          I'm not sure if that can be changed on the fly or if you have
          to re-create OSDs in order to make them apply

        On Tue, 25 Jun 2019 at 18:06,
          Rafał Wądołowski <rwadolowski@xxxxxxxxxxxxxx>
          wrote:

            Why are you selected this specific sizes? Are there any
              tests/research on it?

             Best
              Regards,

              Rafał Wądołowski

            On
              24.06.2019 13:05, Konstantin Shalygin wrote:

                Hi

Have been thinking a bit about rocksdb and EC pools:

Since a RADOS object written to a EC(k+m) pool is split into several 
minor pieces, then the OSD will receive many more smaller objects, 
compared to the amount it would receive in a replicated setup.

This must mean that the rocksdb will also need to handle this more 
entries, and will grow faster. This will have an impact when using 
bluestore for slow HDD with DB on SSD drives, where the faster growing 
rocksdb might result in spillover to slow store - if not taken into 
consideration when designing the disk layout.

Are my thoughts on the right track or am I missing something?

Has somebody done any measurement on rocksdb growth, comparing replica 
vs EC ?

              If you want to be not affected
                  on spillover of block.db - use 3/30/300 GB partition
                  for your block.db.

              k

              _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com