On 3/12/19 7:24 AM, Benjamin Zapiec wrote:
Hello, i was wondering about ceph block.db to be nearly empty and I started to investigate. The recommendations from ceph are that block.db should be at least 4% the size of block. So my OSD configuration looks like this: wal.db - not explicit specified block.db - 250GB of SSD storage block - 6TB
By default we currently use 4 256MB WAL buffers. 2GB should be enough, though in most cases you are better off just leaving it on block.db as you did below.
Since wal is written to block.db if not available i didn't configured wal. With the size of 250GB we are slightly above 4%.
WAL will only use about 1GB of that FWIW
So everything should be "fine". But the block.db only contains about 10GB of data.
If this is an RBD workload, that's quite possible as RBD tends to use far less metadata than RGW.
If figured out that an object in block.db gets "amplified" so the space consumption is much higher than the object itself would need.
Data in the DB in general will suffer space amplification and it gets worse the more levels in rocksdb you have as multiple levels may have copies of the same data at different points in time. The bigger issue is that currently an entire level has to fit on the DB device. IE if level 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3 takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.
I'm using ceph as storage backend for openstack and raw images with a size of 10GB and more are common. So if i understand this correct i have to consider that a 10GB images may consume 100GB of block.db.
The DB holds metadata for the images (and some metadata for bluestore). This is going to be a very small fraction of the overall data size but is really important. Whenever we do a write to an object we first try to read some metadata about it (if it exists). Having those read attempts happen quickly is really important to make sure that the write happens quickly.
Beside the facts that the image may have a size of 100G and they are only used for initial reads unitl all changed blocks gets written to a SSD-only pool i was question me if i need a block.db and if it would be better to save the amount of SSD space used for block.db and just create a 10GB wal.db?
See above. Also, rocksdb periodically has to compact data and with lots of metadata (and as a result lots of levels) it can get pretty slow. Having rocksdb on fast storage helps speed that process up and avoid write stalls due to level0 compaction (higher level compaction can happen in alternate threads).
Has anyone done this before? Anyone who had sufficient SSD space but stick with wal.db to save SSD space? If i'm correct the block.db will never be used for huge images. And even though it may be used for one or two images does this make sense? The images are used initially to read all unchanged blocks from it. After a while each VM should access the images pool less and less due to the changes made in the VM.
The DB is there primarily to store metadata. RBD doesn't use a lot of space but may do a lot of reads from the DB if it can't keep all of the bluestore onodes in it's own in-memory cache (the kv cache). RGW uses the DB much more heavily and in some cases you may see 40-50% space usage if you have tiny RGW objects (~4KB). See this spreadsheet for more info:
https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing Mark
Any thoughts about this? Best regards _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com