Re: which of cpu frequency and number of threads servers osd better?

Frank Schilder <frans@xxxxxx> · Mon, 16 Nov 2020 07:21:16 +0000

We are starting to use 18TB spindles, have loads of cold data and only a thin layer of hot data. One 4/8TB NVMe drive as a cache in front of 6x18TB will provide close to or even matching SSD performance for the hot data at a reasonable extra cost per TB storage. My plan is to wait for 1-2 more years for prices for PCI-NVMe to drop and then start using this method. The second advantage is, that one can continue to deploy collocated HDD OSDs as WAL/DB will certainly land and stay in cache. The cache can be added to existing OSDs without redeployment. In addition, dm-cache uses a hit count method for computing promotion to cache, which works very different from promotion to ceph cache pools. Dm-cache can afford that due to its local nature. In particular, it doesn't promote on just 1 access, which means that a weekly or monthly backup will not flush the entire cache every time.

All SSD pools for this data (ceph-fs in EC pool on HDD) will be unaffordable to us for a long time. Not to mention that these large SSDs are almost certainly QLC, which have much less sustained throughput compared with the 18TB He-drives (they have higher IOP/s though, which is not so relevant for our FS use workloads). The cache method will provide at least the additional IOP/s that WAB/DB devices would, but due to its size also data caching. We need to go NVMe, because the servers we plan to use (R740xd2) provide the largest capacity configuration with 24xHDD+4xPCI NVMe. You can either choose 2 extra drives or 4 PCI NVMe, but not both. So, NVMe cannot be exchanged by fast SSDs as they would eat drive slots.

There were a few threads over the past 1-2 years where people dropped in some of these observations and I just took note of it. It is used in production already and from what I got people are happy with it. Much easier than WAL/DB partitions plus all the sizing problems for L0/L1/... are sorted trivially. With the size of NVMe growing rapidly beyond what WAL/DB devices can utilize and since LVM is the new OSD device, using LVM dm-cache seems to be the way forward for me.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: 16 November 2020 03:00:38
To: Frank Schilder
Subject: Re:  Re: which of cpu frequency and number of threads servers osd better?

Thanks.  I’m curious how the economics for that compare with just using all SSDs:

* HDDs are cheaper
* But colo SSDs are operationally simpler
* And depending on configuration you can provision a cheaper HBA

> On Nov 14, 2020, at 2:04 AM, Frank Schilder <frans@xxxxxx> wrote:
>
> My plan is to use at least 500GB NVMe per HDD OSD. I have not started that yet, but there are threads of other people sharing their experience. If you go beyond 300GB per OSD, apparently the WAL/DB options cannot really use the extra capacity. With dm-cache or the like you would additionally start holding hot data in cache.
>
> Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> Sent: 14 November 2020 10:57:57
> To: Frank Schilder
> Subject: Re:  Re: which of cpu frequency and number of threads servers osd better?
>
> Guten Tag.
>
>> My plan for the future is to use dm-cache for LVM OSDs instead of WAL/DB device.
>
> Do you have any insights into the benefits of that approach instead of WAL/DB, and of dm-cache vs bcache vs dm-writecache vs … ?  And any for sizing the cache device and handling failures?  Presumably the DB will be active enough that it will persist in the cache, so sizing should be at a minimum that to hold 2 copies of the DB to accomodate compaction?
>
> I have an existing RGW cluster on HDDs that utilizes a cache tier; the high water mark is set fairly low so that it doesn’t fill up, something that apparently happened last Christmas.  I’ve been wanting to get a feel for OSD cache as an alternative to deprecated and fussy cache tiering, as well as something like a Varnish cache on RGW load balancers to short-circult small requests.
>
> — Anthony
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx