> -----Original Message----- > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > Sent: 19 October 2018 08:15 > To: 'Igor Fedotov' <ifedotov@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx > Subject: RE: slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD? > > > -----Original Message----- > > From: Igor Fedotov [mailto:ifedotov@xxxxxxx] > > Sent: 19 October 2018 01:03 > > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > Subject: Re: slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD? > > > > > > > > On 10/18/2018 7:49 PM, Nick Fisk wrote: > > > Hi, > > > > > > Ceph Version = 12.2.8 > > > 8TB spinner with 20G SSD partition > > > > > > Perf dump shows the following: > > > > > > "bluefs": { > > > "gift_bytes": 0, > > > "reclaim_bytes": 0, > > > "db_total_bytes": 21472731136, > > > "db_used_bytes": 3467640832, > > > "wal_total_bytes": 0, > > > "wal_used_bytes": 0, > > > "slow_total_bytes": 320063143936, > > > "slow_used_bytes": 4546625536, > > > "num_files": 124, > > > "log_bytes": 11833344, > > > "log_compactions": 4, > > > "logged_bytes": 316227584, > > > "files_written_wal": 2, > > > "files_written_sst": 4375, > > > "bytes_written_wal": 204427489105, > > > "bytes_written_sst": 248223463173 > > > > > > Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk? > > Correct. Most probably the rationale for this is the layered scheme > > RocksDB uses to keep its sst. For each level It has a maximum > > threshold (determined by level no, some base value and corresponding > > multiplier - see max_bytes_for_level_base & > > max_bytes_for_level_multiplier at > > https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide) > > If the next level (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device. > > IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume. > > > > In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%. > > > > > > > Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to > shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the > NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any > tunables to change this behaviour post OSD deployment to move data back onto SSD? > > On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes would > cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads? > > This is from a similar slightly newer node with 10TB spinners and 40G partition > "bluefs": { > "gift_bytes": 0, > "reclaim_bytes": 0, > "db_total_bytes": 53684985856, > "db_used_bytes": 10380902400, > "wal_total_bytes": 0, > "wal_used_bytes": 0, > "slow_total_bytes": 400033841152, > "slow_used_bytes": 0, > "num_files": 165, > "log_bytes": 15683584, > "log_compactions": 8, > "logged_bytes": 384712704, > "files_written_wal": 2, > "files_written_sst": 11317, > "bytes_written_wal": 564218701044, > "bytes_written_sst": 618268958848 > > So I see your point about the 25G file size making it over spill the partition, as it obvious in this case that the 10G of DB used is > completely stored on the SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase in usage. Albeit if I move to EC > pools, I should expect maybe a doubling in objects, so maybe that db_used might double, but it should still be within the 40G > hopefully. > > The 4% rule would not be workable in my case, there are 12X10TB disks in these nodes, I would nearly 5TB worth of SSD, which would > likely cost a similar amount to the whole node+disks. I get the fact that any recommendations need to take the worse case into > account, but I would imagine for a lot of simple RBD only use cases, this number is quite inflated. > > So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD > partition is bigger than 26GB (L0+L1)? Ok, so after some reading [1] a slight correction. block.db needs to be at a minimum of around 28G(L1+L2+L3) to make sure L3 fits on the SSD, where for most RBD workloads (or any other largish object type workloads) the metadata will likely fit well within this limit. [1] https://www.spinics.net/lists/ceph-devel/msg39315.html > > > > Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk? > > Right. > > > Found a previous bug tracker for something which looks exactly the same case, but should be fixed now: > > > https://tracker.ceph.com/issues/22264 > > > > > > Thanks, > > > Nick > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com