Re: Collecting BlueStore per Object DB overhead

David Turner <drakonstein@xxxxxxxxx> · Tue, 01 May 2018 14:21:33 +0000

Primary RGW usage.  270M objects, 857TB data/1195TB raw, EC 8+3 in the RGW data pool, less than 200K objects in all other pools.  OSDs 366 and 367 are NVMe OSDs, the rest are 10TB disks for data/DB and 2GB WAL NVMe partition.  The only things on the NVMe OSDs are the RGW metadata pools.  I only have 2 servers with bluestore, the rest are currently filestore in the cluster.
osd.319 _onodes_=164010 db_used_bytes=14433648640 avg_obj_size=23392454 overhead_per_obj=88004
osd.352 _onodes_=162395 db_used_bytes=12957253632 avg_obj_size=23440441 overhead_per_obj=79788
osd.357 _onodes_=159920 db_used_bytes=14039384064 avg_obj_size=24208736 overhead_per_obj=87790
osd.356 _onodes_=164420 db_used_bytes=13006536704 avg_obj_size=23155304 overhead_per_obj=79105
osd.355 _onodes_=164086 db_used_bytes=13021216768 avg_obj_size=23448898 overhead_per_obj=79356
osd.354 _onodes_=164665 db_used_bytes=13026459648 avg_obj_size=23357786 overhead_per_obj=79108
osd.353 _onodes_=164575 db_used_bytes=14099152896 avg_obj_size=23377114 overhead_per_obj=85670
osd.359 _onodes_=163922 db_used_bytes=13991149568 avg_obj_size=23397323 overhead_per_obj=85352
osd.358 _onodes_=164805 db_used_bytes=12706643968 avg_obj_size=23160121 overhead_per_obj=77101
osd.364 _onodes_=163009 db_used_bytes=14926479360 avg_obj_size=23552838 overhead_per_obj=91568
osd.365 _onodes_=163639 db_used_bytes=13615759360 avg_obj_size=23541130 overhead_per_obj=83206
osd.362 _onodes_=164505 db_used_bytes=13152288768 avg_obj_size=23324698 overhead_per_obj=79950
osd.363 _onodes_=164395 db_used_bytes=13104054272 avg_obj_size=23157437 overhead_per_obj=79710
osd.360 _onodes_=163484 db_used_bytes=14292090880 avg_obj_size=23347543 overhead_per_obj=87421
osd.361 _onodes_=164140 db_used_bytes=12977176576 avg_obj_size=23498778 overhead_per_obj=79061
osd.366 _onodes_=1516 db_used_bytes=7509901312 avg_obj_size=5743370 overhead_per_obj=4953760
osd.367 _onodes_=1435 db_used_bytes=7992246272 avg_obj_size=6419719 overhead_per_obj=5569509

On Tue, May 1, 2018 at 1:57 AM Wido den Hollander <wido@xxxxxxxx> wrote:

On 04/30/2018 10:25 PM, Gregory Farnum wrote:

> 

> 

> On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander <wido@xxxxxxxx

> <mailto:wido@xxxxxxxx>> wrote:

> 

>     Hi,

> 

>     I've been investigating the per object overhead for BlueStore as I've

>     seen this has become a topic for a lot of people who want to store a lot

>     of small objects in Ceph using BlueStore.

> 

>     I've writting a piece of Python code which can be run on a server

>     running OSDs and will print the overhead.

> 

>     https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f

> 

>     Feedback on this script is welcome, but also the output of what people

>     are observing.

> 

>     The results from my tests are below, but what I see is that the overhead

>     seems to range from 10kB to 30kB per object.

> 

>     On RBD-only clusters the overhead seems to be around 11kB, but on

>     clusters with a RGW workload the overhead goes higher to 20kB.

> 

> 

> This change seems implausible as RGW always writes full objects, whereas

> RBD will frequently write pieces of them and do overwrites.

> I'm not sure what all knobs are available and which diagnostics

> BlueStore exports, but is it possible you're looking at the total

> RocksDB data store rather than the per-object overhead? The distinction

> here being that the RocksDB instance will also store "client" (ie, RGW)

> omap data and xattrs, in addition to the actual BlueStore onodes.

Yes, that is possible. But in the end, the amount of onodes is the

objects you store and then you want to know how many bytes the RocksDB

database uses.

I do agree that RGW doesn't do partial writes and has more metadata, but

eventually that all has to be stored.

We just need to come up with some good numbers on how to size the DB.

Currently I assume a 10GB:1TB ratio and that is working out, but with

people wanting to use 12TB disks we need to drill those numbers down

even more. Otherwise you will need a lot of SSD space to store the DB in

SSD if you want to.

Wido

> -Greg

>  

> 

> 

>     I know that partial overwrites and appends contribute to higher overhead

>     on objects and I'm trying to investigate this and share my information

>     with the community.

> 

>     I have two use-cases who want to store >2 billion objects with a avg

>     object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to

>     become a big problem.

> 

>     Anybody willing to share the overhead they are seeing with what

>     use-case?

> 

>     The more data we have on this the better we can estimate how DBs need to

>     be sized for BlueStore deployments.

> 

>     Wido

> 

>     # Cluster #1

>     osd.25 _onodes_=178572 db_used_bytes=2188378112 <tel:(218)%20837-8112>

>     avg_obj_size=6196529

>     overhead=12254

>     osd.20 _onodes_=209871 db_used_bytes=2307915776 avg_obj_size=5452002

>     overhead=10996

>     osd.10 _onodes_=195502 db_used_bytes=2395996160 <tel:(239)%20599-6160>

>     avg_obj_size=6013645

>     overhead=12255

>     osd.30 _onodes_=186172 db_used_bytes=2393899008 <tel:(239)%20389-9008>

>     avg_obj_size=6359453

>     overhead=12858

>     osd.1 _onodes_=169911 db_used_bytes=1799356416 avg_obj_size=4890883

>     overhead=10589

>     osd.0 _onodes_=199658 db_used_bytes=2028994560 <tel:(202)%20899-4560>

>     avg_obj_size=4835928

>     overhead=10162

>     osd.15 _onodes_=204015 db_used_bytes=2384461824 avg_obj_size=5722715

>     overhead=11687

> 

>     # Cluster #2

>     osd.1 _onodes_=221735 db_used_bytes=2773483520 avg_obj_size=5742992

>     overhead_per_obj=12508

>     osd.0 _onodes_=196817 db_used_bytes=2651848704 avg_obj_size=6454248

>     overhead_per_obj=13473

>     osd.3 _onodes_=212401 db_used_bytes=2745171968 avg_obj_size=6004150

>     overhead_per_obj=12924

>     osd.2 _onodes_=185757 db_used_bytes=3567255552 avg_obj_size=5359974

>     overhead_per_obj=19203

>     osd.5 _onodes_=198822 db_used_bytes=3033530368 <tel:(303)%20353-0368>

>     avg_obj_size=6765679

>     overhead_per_obj=15257

>     osd.4 _onodes_=161142 db_used_bytes=2136997888 <tel:(213)%20699-7888>

>     avg_obj_size=6377323

>     overhead_per_obj=13261

>     osd.7 _onodes_=158951 db_used_bytes=1836056576 avg_obj_size=5247527

>     overhead_per_obj=11551

>     osd.6 _onodes_=178874 db_used_bytes=2542796800 <tel:(254)%20279-6800>

>     avg_obj_size=6539688

>     overhead_per_obj=14215

>     osd.9 _onodes_=195166 db_used_bytes=2538602496 <tel:(253)%20860-2496>

>     avg_obj_size=6237672

>     overhead_per_obj=13007

>     osd.8 _onodes_=203946 db_used_bytes=3279945728 avg_obj_size=6523555

>     overhead_per_obj=16082

> 

>     # Cluster 3

>     osd.133 _onodes_=68558 db_used_bytes=15868100608

>     <tel:(586)%20810-0608> avg_obj_size=14743206

>     overhead_per_obj=231455

>     osd.132 _onodes_=60164 db_used_bytes=13911457792 avg_obj_size=14539445

>     overhead_per_obj=231225

>     osd.137 _onodes_=62259 db_used_bytes=15597568000

>     <tel:(559)%20756-8000> avg_obj_size=15138484

>     overhead_per_obj=250527

>     osd.136 _onodes_=70361 db_used_bytes=14540603392 avg_obj_size=13729154

>     overhead_per_obj=206657

>     osd.135 _onodes_=68003 db_used_bytes=12285116416

>     <tel:(228)%20511-6416> avg_obj_size=12877744

>     overhead_per_obj=180655

>     osd.134 _onodes_=64962 db_used_bytes=14056161280

>     <tel:(405)%20616-1280> avg_obj_size=15923550

>     overhead_per_obj=216375

>     osd.139 _onodes_=68016 db_used_bytes=20782776320 avg_obj_size=13619345

>     overhead_per_obj=305557

>     osd.138 _onodes_=66209 db_used_bytes=12850298880 avg_obj_size=14593418

>     overhead_per_obj=194086

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> 

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com