Re: Ceph block storage - block.db useless?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/12/19 7:24 AM, Benjamin Zapiec wrote:
Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block    - 6TB


By default we currently use 4 256MB WAL buffers.  2GB should be enough, though in most cases you are better off just leaving it on block.db as you did below.


Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.


WAL will only use about 1GB of that FWIW



So everything should be "fine". But the block.db only contains
about 10GB of data.


If this is an RBD workload, that's quite possible as RBD tends to use far less metadata than RGW.



If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.


Data in the DB in general will suffer space amplification and it gets worse the more levels in rocksdb you have as multiple levels may have copies of the same data at different points in time.  The bigger issue is that currently an entire level has to fit on the DB device.  IE if level 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3 takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.



I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.


The DB holds metadata for the images (and some metadata for bluestore).  This is going to be a very small fraction of the overall data size but is really important.  Whenever we do a write to an object we first try to read some metadata about it (if it exists).  Having those read attempts happen quickly is really important to make sure that the write happens quickly.



Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?


See above.  Also, rocksdb periodically has to compact data and with lots of metadata (and as a result lots of levels) it can get pretty slow.  Having rocksdb on fast storage helps speed that process up and avoid write stalls due to level0 compaction (higher level compaction can happen in alternate threads).



Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.


The DB is there primarily to store metadata.  RBD doesn't use a lot of space but may do a lot of reads from the DB if it can't keep all of the bluestore onodes in it's own in-memory cache (the kv cache).  RGW uses the DB much more heavily and in some cases you may see 40-50% space usage if you have tiny RGW objects (~4KB).  See this spreadsheet for more info:


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing


Mark



Any thoughts about this?


Best regards


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux