We are starting to use 18TB spindles, have loads of cold data and only a thin layer of hot data. One 4/8TB NVMe drive as a cache in front of 6x18TB will provide close to or even matching SSD performance for the hot data at a reasonable extra cost per TB storage. My plan is to wait for 1-2 more years for prices for PCI-NVMe to drop and then start using this method. The second advantage is, that one can continue to deploy collocated HDD OSDs as WAL/DB will certainly land and stay in cache. The cache can be added to existing OSDs without redeployment. In addition, dm-cache uses a hit count method for computing promotion to cache, which works very different from promotion to ceph cache pools. Dm-cache can afford that due to its local nature. In particular, it doesn't promote on just 1 access, which means that a weekly or monthly backup will not flush the entire cache every time. All SSD pools for this data (ceph-fs in EC pool on HDD) will be unaffordable to us for a long time. Not to mention that these large SSDs are almost certainly QLC, which have much less sustained throughput compared with the 18TB He-drives (they have higher IOP/s though, which is not so relevant for our FS use workloads). The cache method will provide at least the additional IOP/s that WAB/DB devices would, but due to its size also data caching. We need to go NVMe, because the servers we plan to use (R740xd2) provide the largest capacity configuration with 24xHDD+4xPCI NVMe. You can either choose 2 extra drives or 4 PCI NVMe, but not both. So, NVMe cannot be exchanged by fast SSDs as they would eat drive slots. There were a few threads over the past 1-2 years where people dropped in some of these observations and I just took note of it. It is used in production already and from what I got people are happy with it. Much easier than WAL/DB partitions plus all the sizing problems for L0/L1/... are sorted trivially. With the size of NVMe growing rapidly beyond what WAL/DB devices can utilize and since LVM is the new OSD device, using LVM dm-cache seems to be the way forward for me. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Anthony D'Atri <anthony.datri@xxxxxxxxx> Sent: 16 November 2020 03:00:38 To: Frank Schilder Subject: Re: Re: which of cpu frequency and number of threads servers osd better? Thanks. I’m curious how the economics for that compare with just using all SSDs: * HDDs are cheaper * But colo SSDs are operationally simpler * And depending on configuration you can provision a cheaper HBA > On Nov 14, 2020, at 2:04 AM, Frank Schilder <frans@xxxxxx> wrote: > > My plan is to use at least 500GB NVMe per HDD OSD. I have not started that yet, but there are threads of other people sharing their experience. If you go beyond 300GB per OSD, apparently the WAL/DB options cannot really use the extra capacity. With dm-cache or the like you would additionally start holding hot data in cache. > > Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Anthony D'Atri <anthony.datri@xxxxxxxxx> > Sent: 14 November 2020 10:57:57 > To: Frank Schilder > Subject: Re: Re: which of cpu frequency and number of threads servers osd better? > > Guten Tag. > >> My plan for the future is to use dm-cache for LVM OSDs instead of WAL/DB device. > > Do you have any insights into the benefits of that approach instead of WAL/DB, and of dm-cache vs bcache vs dm-writecache vs … ? And any for sizing the cache device and handling failures? Presumably the DB will be active enough that it will persist in the cache, so sizing should be at a minimum that to hold 2 copies of the DB to accomodate compaction? > > I have an existing RGW cluster on HDDs that utilizes a cache tier; the high water mark is set fairly low so that it doesn’t fill up, something that apparently happened last Christmas. I’ve been wanting to get a feel for OSD cache as an alternative to deprecated and fussy cache tiering, as well as something like a Varnish cache on RGW load balancers to short-circult small requests. > > — Anthony > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx