> Op 17 oktober 2017 om 14:21 schreef Mark Nelson <mnelson@xxxxxxxxxx>: > > > > > On 10/17/2017 01:54 AM, Wido den Hollander wrote: > > > >> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh <richard.hesketh@xxxxxxxxxxxx>: > >> > >> > >> On 16/10/17 13:45, Wido den Hollander wrote: > >>>> Op 26 september 2017 om 16:39 schreef Mark Nelson <mnelson@xxxxxxxxxx>: > >>>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote: > >>>>> thanks David, > >>>>> > >>>>> that's confirming what I was assuming. To bad that there is no > >>>>> estimate/method to calculate the db partition size. > >>>> > >>>> It's possible that we might be able to get ranges for certain kinds of > >>>> scenarios. Maybe if you do lots of small random writes on RBD, you can > >>>> expect a typical metadata size of X per object. Or maybe if you do lots > >>>> of large sequential object writes in RGW, it's more like Y. I think > >>>> it's probably going to be tough to make it accurate for everyone though. > >>> > >>> So I did a quick test. I wrote 75.000 objects to a BlueStore device: > >>> > >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes' > >>> 75085 > >>> root@alpha:~# > >>> > >>> I then saw the RocksDB database was 450MB in size: > >>> > >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes' > >>> 459276288 > >>> root@alpha:~# > >>> > >>> 459276288 / 75085 = 6116 > >>> > >>> So about 6kb of RocksDB data per object. > >>> > >>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space. > >>> > >>> Is this a safe assumption? Do you think that 6kb is normal? Low? High? > >>> > >>> There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers. > >>> > >>> Wido > >> > >> If I check for the same stats on OSDs in my production cluster I see similar but variable values: > >> > >> root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.0 db per object: 7490 > >> osd.1 db per object: 7523 > >> osd.2 db per object: 7378 > >> osd.3 db per object: 7447 > >> osd.4 db per object: 7233 > >> osd.5 db per object: 7393 > >> osd.6 db per object: 7074 > >> osd.7 db per object: 7967 > >> osd.8 db per object: 7253 > >> osd.9 db per object: 7680 > >> > >> root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.10 db per object: 5168 > >> osd.11 db per object: 5291 > >> osd.12 db per object: 5476 > >> osd.13 db per object: 4978 > >> osd.14 db per object: 5252 > >> osd.15 db per object: 5461 > >> osd.16 db per object: 5135 > >> osd.17 db per object: 5126 > >> osd.18 db per object: 9336 > >> osd.19 db per object: 4986 > >> > >> root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.20 db per object: 5115 > >> osd.21 db per object: 4844 > >> osd.22 db per object: 5063 > >> osd.23 db per object: 5486 > >> osd.24 db per object: 5228 > >> osd.25 db per object: 4966 > >> osd.26 db per object: 5047 > >> osd.27 db per object: 5021 > >> osd.28 db per object: 5321 > >> osd.29 db per object: 5150 > >> > >> root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.30 db per object: 6658 > >> osd.31 db per object: 6445 > >> osd.32 db per object: 6259 > >> osd.33 db per object: 6691 > >> osd.34 db per object: 6513 > >> osd.35 db per object: 6628 > >> osd.36 db per object: 6779 > >> osd.37 db per object: 6819 > >> osd.38 db per object: 6677 > >> osd.39 db per object: 6689 > >> > >> root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.40 db per object: 5335 > >> osd.41 db per object: 5203 > >> osd.42 db per object: 5552 > >> osd.43 db per object: 5188 > >> osd.44 db per object: 5218 > >> osd.45 db per object: 5157 > >> osd.46 db per object: 4956 > >> osd.47 db per object: 5370 > >> osd.48 db per object: 5117 > >> osd.49 db per object: 5313 > >> > >> I'm not sure why so much variance (these nodes are basically identical) and I think that the db_used_bytes includes the WAL at least in my case, as I don't have a separate WAL device. I'm not sure how big the WAL is relative to metadata and hence how much this might be thrown off, but ~6kb/object seems like a reasonable value to take for back-of-envelope calculating. > >> > > > > Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are welcome in this case. > > > > Some input from a BlueStore dev might be helpful as well to see we are not drawing the wrong conclusions here. > > > > Wido > > I would be very careful about drawing too many conclusions given a > single snapshot in time, especially if there haven't been a lot of > partial object rewrites yet. Just on the surface, 6KB/object feels low > (especially if you they are moderately large objects), but perhaps if > they've never been rewritten this is a reasonable lower bound. This is > important because things like 4MB RBD objects that are regularly > rewritten might behave a lot differently than RGW objects that are > written once and then never rewritten. > Thanks for the feedback. Indeed, we have to be cautious in this case. So 6kB/object feels low to you, so it's probably. I'm testing with a 1GB WAL/50GB DB on a SSD with a 4TB disk which seems to hold out fine. It's not that space is a true issue, but "use as much as available" doesn't say much to people. If I have a 1TB NVMe for 10 disks, should I give 100GB of DB to each OSD? It's those things people want to know. So we need numbers to figure these things out. Wido > Also, note that Marco is seeing much different numbers in his recent > post to the thread. > > Mark > > > > >> [bonus hilarity] > >> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, I get results like: > >> > >> root@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done > >> osd.60 db per object: 80273 > >> osd.61 db per object: 68859 > >> osd.62 db per object: 45560 > >> osd.63 db per object: 38209 > >> osd.64 db per object: 48258 > >> osd.65 db per object: 50525 > >> > >> Rich > >> > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com