Re: Bluestore OSD_DATA, WAL & DB

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Op 17 oktober 2017 om 14:21 schreef Mark Nelson <mnelson@xxxxxxxxxx>:
> 
> 
> 
> 
> On 10/17/2017 01:54 AM, Wido den Hollander wrote:
> >
> >> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <richard.hesketh@xxxxxxxxxxxx>:
> >>
> >>
> >> On 16/10/17 13:45, Wido den Hollander wrote:
> >>>> Op 26 september 2017 om 16:39 schreef Mark Nelson <mnelson@xxxxxxxxxx>:
> >>>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> >>>>> thanks David,
> >>>>>
> >>>>> that's confirming what I was assuming. To bad that there is no
> >>>>> estimate/method to calculate the db partition size.
> >>>>
> >>>> It's possible that we might be able to get ranges for certain kinds of
> >>>> scenarios.  Maybe if you do lots of small random writes on RBD, you can
> >>>> expect a typical metadata size of X per object.  Or maybe if you do lots
> >>>> of large sequential object writes in RGW, it's more like Y.  I think
> >>>> it's probably going to be tough to make it accurate for everyone though.
> >>>
> >>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> >>> 75085
> >>> root@alpha:~#
> >>>
> >>> I then saw the RocksDB database was 450MB in size:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> >>> 459276288
> >>> root@alpha:~#
> >>>
> >>> 459276288 / 75085 = 6116
> >>>
> >>> So about 6kb of RocksDB data per object.
> >>>
> >>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.
> >>>
> >>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> >>>
> >>> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.
> >>>
> >>> Wido
> >>
> >> If I check for the same stats on OSDs in my production cluster I see similar but variable values:
> >>
> >> root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.0 db per object: 7490
> >> osd.1 db per object: 7523
> >> osd.2 db per object: 7378
> >> osd.3 db per object: 7447
> >> osd.4 db per object: 7233
> >> osd.5 db per object: 7393
> >> osd.6 db per object: 7074
> >> osd.7 db per object: 7967
> >> osd.8 db per object: 7253
> >> osd.9 db per object: 7680
> >>
> >> root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.10 db per object: 5168
> >> osd.11 db per object: 5291
> >> osd.12 db per object: 5476
> >> osd.13 db per object: 4978
> >> osd.14 db per object: 5252
> >> osd.15 db per object: 5461
> >> osd.16 db per object: 5135
> >> osd.17 db per object: 5126
> >> osd.18 db per object: 9336
> >> osd.19 db per object: 4986
> >>
> >> root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.20 db per object: 5115
> >> osd.21 db per object: 4844
> >> osd.22 db per object: 5063
> >> osd.23 db per object: 5486
> >> osd.24 db per object: 5228
> >> osd.25 db per object: 4966
> >> osd.26 db per object: 5047
> >> osd.27 db per object: 5021
> >> osd.28 db per object: 5321
> >> osd.29 db per object: 5150
> >>
> >> root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.30 db per object: 6658
> >> osd.31 db per object: 6445
> >> osd.32 db per object: 6259
> >> osd.33 db per object: 6691
> >> osd.34 db per object: 6513
> >> osd.35 db per object: 6628
> >> osd.36 db per object: 6779
> >> osd.37 db per object: 6819
> >> osd.38 db per object: 6677
> >> osd.39 db per object: 6689
> >>
> >> root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.40 db per object: 5335
> >> osd.41 db per object: 5203
> >> osd.42 db per object: 5552
> >> osd.43 db per object: 5188
> >> osd.44 db per object: 5218
> >> osd.45 db per object: 5157
> >> osd.46 db per object: 4956
> >> osd.47 db per object: 5370
> >> osd.48 db per object: 5117
> >> osd.49 db per object: 5313
> >>
> >> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating.
> >>
> >
> > Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case.
> >
> > Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here.
> >
> > Wido
> 
> I would be very careful about drawing too many conclusions given a 
> single snapshot in time, especially if there haven't been a lot of 
> partial object rewrites yet.  Just on the surface, 6KB/object feels low 
> (especially if you they are moderately large objects), but perhaps if 
> they've never been rewritten this is a reasonable lower bound.  This is 
> important because things like 4MB RBD objects that are regularly 
> rewritten might behave a lot differently than RGW objects that are 
> written once and then never rewritten.
> 

Thanks for the feedback. Indeed, we have to be cautious in this case. So 6kB/object feels low to you, so it's probably.

I'm testing with a 1GB WAL/50GB DB on a SSD with a 4TB disk which seems to hold out fine. It's not that space is a true issue, but "use as much as available" doesn't say much to people.

If I have a 1TB NVMe for 10 disks, should I give 100GB of DB to each OSD? It's those things people want to know. So we need numbers to figure these things out.

Wido

> Also, note that Marco is seeing much different numbers in his recent 
> post to the thread.
> 
> Mark
> 
> >
> >> [bonus hilarity]
> >> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like:
> >>
> >> root@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.60 db per object: 80273
> >> osd.61 db per object: 68859
> >> osd.62 db per object: 45560
> >> osd.63 db per object: 38209
> >> osd.64 db per object: 48258
> >> osd.65 db per object: 50525
> >>
> >> Rich
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux