Re: SAS vs SATA for OSD - WAL+DB sizing.

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 3 Jun 2021 17:43:49 -0700

In releases before … Pacific I think, there are certain discrete capacities that DB will actually utilize:  the sum of RocksDB levels.  Lots of discussion in the archives. AIUI in those releases, with a 500 GB BlueStore WAL+DB device, you’ll with default settings only actually use ~~300 GB most of the time, though the extra might accelerate compaction.  With Pacific I believe code was merged that shards OSD RocksDB to make better use of arbitrary partition / devices sizes.

With older releases one can (or so I’ve read) game this a bit by carefully adjusting rocksdb.max-bytes-for-level-base; ISTR that Karan did that for his impressive 10 Billion Object exercise.

I’ve seen threads on the list over the past couple of years that seemed to show spillover despite the DB device not being fully utilized; I hope that’s since been addressed.

My understanding is that with column sharding, compaction only takes place on a fraction of the DB at any one time, so the transient space used for it (and thus prone to spillover) should be lessened.

I may of course be out of my Vulcan mind, but HTH.

— aad

> On Jun 3, 2021, at 5:29 PM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote:
> 
> Mark,
> 
> We are running a mix of RGW, RDB, and CephFS.  Our CephFS is pretty big,
> but we're moving a lot of it to RGW.  What prompted me to go looking for a
> guideline was a high frequency of Spillover warnings as our cluster filled
> up past the 50% mark.  That was with 14.2.9, I think.  I understand that
> some things have changed since, but I think I'd like to have the
> flexibility and performance of a generous WAL+DB - the cluster is used to
> store research data, and the usage pattern is tending to change as the
> research evolves.  No telling what our mix will be a year from now.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> 
> 
> On Thu, Jun 3, 2021 at 7:39 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> 
>> FWIW, those guidelines try to be sort of a one-size-fits-all
>> recommendation that may not apply to your situation.  Typically RBD has
>> pretty low metadata overhead so you can get away with smaller DB
>> partitions.  4% should easily be enough.  If you are running heavy RGW
>> write workloads with small objects, you will almost certainly use more
>> than 4% for metadata (I've seen worst case up to 50%, but that was
>> before column family sharding which should help to some extent).  Having
>> said that, bluestore will roll the higher rocksdb levels over to the
>> slow device and keep the wall, L0, and other lower LSM levels on the
>> fast device.  It's not necessarily the end of the world if you end up
>> with some of the more rarely used metadata on the HDD but having it on
>> flash certain is nice.
>> 
>> 
>> Mark
>> 
>> 
>> On 6/3/21 5:18 PM, Dave Hall wrote:
>>> Anthony,
>>> 
>>> I had recently found a reference in the Ceph docs that indicated
>> something
>>> like 40GB per TB for WAL+DB space.  For a 12TB HDD that comes out to
>>> 480GB.  If this is no longer the guideline I'd be glad to save a couple
>>> dollars.
>>> 
>>> -Dave
>>> 
>>> --
>>> Dave Hall
>>> Binghamton University
>>> kdhall@xxxxxxxxxxxxxx
>>> 
>>> On Thu, Jun 3, 2021 at 6:10 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
>>> wrote:
>>> 
>>>> Agreed.  I think oh …. maybe 15-20 years ago there was often a wider
>>>> difference between SAS and SATA drives, but with modern queuing etc. my
>>>> sense is that there is less of an advantage.  Seek and rotational
>> latency I
>>>> suspect dwarf interface differences wrt performance.  The HBA may be a
>>>> bigger bottleneck (and way more trouble).
>>>> 
>>>> 500 GB NVMe seems like a lot per HDD, are you using that as WAL+DB with
>>>> RGW, or as dmcache or something?
>>>> 
>>>> Depending on your constraints, QLC flash might be more competitive than
>>>> you think ;)
>>>> 
>>>> — aad
>>>> 
>>>> 
>>>>> I suspect the behavior of the controller and the behavior of the drive
>>>> firmware will end up mattering more than SAS vs SATA.  As always it's
>> best
>>>> if you can test it first before committing to buying a pile of them.
>>>> Historically I have seen SATA drives that have performed well as far as
>>>> HDDs go though.
>>>>> 
>>>>> Mark
>>>>> 
>>>>> On 6/3/21 4:25 PM, Dave Hall wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> We're planning another batch of OSD nodes for our cluster.  Our prior
>>>> nodes
>>>>>> have been 8 x 12TB SAS drives plus 500GB NVMe per HDD.  Due to market
>>>>>> circumstances and the shortage of drives those 12TB SAS drives are in
>>>> short
>>>>>> supply.
>>>>>> 
>>>>>> Our integrator has offered an option of 8 x 14TB SATA drives (still
>>>>>> Enterprise).  For Ceph, will the switch to SATA carry a performance
>>>>>> difference that I should be concerned about?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> -Dave
>>>>>> 
>>>>>> --
>>>>>> Dave Hall
>>>>>> Binghamton University
>>>>>> kdhall@xxxxxxxxxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx