Re: slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

Nick Fisk <nick@xxxxxxxxxx> · Fri, 19 Oct 2018 10:14:53 +0100

> -----Original Message-----
> From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> Sent: 19 October 2018 08:15
> To: 'Igor Fedotov' <ifedotov@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> Subject: RE:  slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
> 
> > -----Original Message-----
> > From: Igor Fedotov [mailto:ifedotov@xxxxxxx]
> > Sent: 19 October 2018 01:03
> > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?
> >
> >
> >
> > On 10/18/2018 7:49 PM, Nick Fisk wrote:
> > > Hi,
> > >
> > > Ceph Version = 12.2.8
> > > 8TB spinner with 20G SSD partition
> > >
> > > Perf dump shows the following:
> > >
> > > "bluefs": {
> > >          "gift_bytes": 0,
> > >          "reclaim_bytes": 0,
> > >          "db_total_bytes": 21472731136,
> > >          "db_used_bytes": 3467640832,
> > >          "wal_total_bytes": 0,
> > >          "wal_used_bytes": 0,
> > >          "slow_total_bytes": 320063143936,
> > >          "slow_used_bytes": 4546625536,
> > >          "num_files": 124,
> > >          "log_bytes": 11833344,
> > >          "log_compactions": 4,
> > >          "logged_bytes": 316227584,
> > >          "files_written_wal": 2,
> > >          "files_written_sst": 4375,
> > >          "bytes_written_wal": 204427489105,
> > >          "bytes_written_sst": 248223463173
> > >
> > > Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
> > Correct. Most probably the rationale for this is the layered scheme
> > RocksDB uses to keep its sst. For each level It has a maximum
> > threshold (determined by level no, some base value and corresponding
> > multiplier - see max_bytes_for_level_base &
> > max_bytes_for_level_multiplier at
> > https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> > If the next level  (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
> > IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
> >
> > In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
> >
> > >
> 
> Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to
> shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the
> NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any
> tunables to change this behaviour post OSD deployment to move data back onto SSD?
> 
> On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would
> cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
> 
> This is from a similar slightly newer node with 10TB spinners and 40G partition
> "bluefs": {
>         "gift_bytes": 0,
>         "reclaim_bytes": 0,
>         "db_total_bytes": 53684985856,
>         "db_used_bytes": 10380902400,
>         "wal_total_bytes": 0,
>         "wal_used_bytes": 0,
>         "slow_total_bytes": 400033841152,
>         "slow_used_bytes": 0,
>         "num_files": 165,
>         "log_bytes": 15683584,
>         "log_compactions": 8,
>         "logged_bytes": 384712704,
>         "files_written_wal": 2,
>         "files_written_sst": 11317,
>         "bytes_written_wal": 564218701044,
>         "bytes_written_sst": 618268958848
> 
> So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is
> completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC
> pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G
> hopefully.
> 
> The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would
> likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into
> account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated.
> 
> So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD
> partition is bigger than 26GB (L0+L1)?

Ok, so after some reading [1] a slight correction. block.db needs to be at a minimum of around 28G(L1+L2+L3) to make sure L3 fits on the SSD, where for most RBD workloads (or any other largish object type workloads) the metadata will likely fit well within this limit.

[1] https://www.spinics.net/lists/ceph-devel/msg39315.html

> 
> > > Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
> > Right.
> > > Found a previous bug tracker for something which looks exactly the same case, but should be fixed now:
> > > https://tracker.ceph.com/issues/22264
> > >
> > > Thanks,
> > > Nick
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com