Re: Ceph block storage - block.db useless?

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 12 Mar 2019 08:19:25 -0500

On 3/12/19 7:24 AM, Benjamin Zapiec wrote:
Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block    - 6TB

By default we currently use 4 256MB WAL buffers.  2GB should be enough, 
though in most cases you are better off just leaving it on block.db as 
you did below.

Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.

WAL will only use about 1GB of that FWIW

So everything should be "fine". But the block.db only contains
about 10GB of data.

If this is an RBD workload, that's quite possible as RBD tends to use 
far less metadata than RGW.

If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.

Data in the DB in general will suffer space amplification and it gets 
worse the more levels in rocksdb you have as multiple levels may have 
copies of the same data at different points in time.  The bigger issue 
is that currently an entire level has to fit on the DB device.  IE if 
level 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3 
takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.

I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.

The DB holds metadata for the images (and some metadata for bluestore).  
This is going to be a very small fraction of the overall data size but 
is really important.  Whenever we do a write to an object we first try 
to read some metadata about it (if it exists).  Having those read 
attempts happen quickly is really important to make sure that the write 
happens quickly.

Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?

See above.  Also, rocksdb periodically has to compact data and with lots 
of metadata (and as a result lots of levels) it can get pretty slow.  
Having rocksdb on fast storage helps speed that process up and avoid 
write stalls due to level0 compaction (higher level compaction can 
happen in alternate threads).

Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.

The DB is there primarily to store metadata.  RBD doesn't use a lot of 
space but may do a lot of reads from the DB if it can't keep all of the 
bluestore onodes in it's own in-memory cache (the kv cache).  RGW uses 
the DB much more heavily and in some cases you may see 40-50% space 
usage if you have tiny RGW objects (~4KB).  See this spreadsheet for 
more info:

https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing

Mark

Any thoughts about this?

Best regards

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com