Re: Performance improvement suggestion

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Tue, 20 Feb 2024 17:59:07 -0500

Cache tiering is deprecated.

> On Feb 20, 2024, at 17:03, Özkan Göksu <ozkangksu@xxxxxxxxx> wrote:
> 
> Hello.
> 
> I didn't test it personally but what about rep 1 write cache pool with nvme
> backed by another rep 2 pool?
> 
> It has the potential exactly what you are looking for in theory.
> 
> 
> 1 Şub 2024 Per 20:54 tarihinde quaglio@xxxxxxxxxx <quaglio@xxxxxxxxxx> şunu
> yazdı:
> 
>> 
>> 
>> Ok Anthony,
>> 
>> I understood what you said. I also believe in all the professional history
>> and experience you have.
>> 
>> Anyway, could there be a configuration flag to make this happen?
>> 
>> As well as those that already exist: "--yes-i-really-mean-it".
>> 
>> This way, the storage pattern would remain as it is. However, it would
>> allow situations like the one I mentioned to be possible.
>> 
>> This situation will permit some rules to be relaxed (even if they are not
>> ok at first).
>> Likewise, there are already situations like lazyio that make some
>> exceptions to standard procedures.
>> 
>> 
>> Remembering: it's just a suggestion.
>> If this type of functionality is not interesting, it is ok.
>> 
>> 
>> Rafael.
>> 
>> ------------------------------
>> 
>> *De: *"Anthony D'Atri" <anthony.datri@xxxxxxxxx>
>> *Enviada: *2024/02/01 12:10:30
>> *Para: *quaglio@xxxxxxxxxx
>> *Cc: * ceph-users@xxxxxxx
>> *Assunto: *  Re: Performance improvement suggestion
>> 
>> 
>> 
>>> I didn't say I would accept the risk of losing data.
>> 
>> That's implicit in what you suggest, though.
>> 
>>> I just said that it would be interesting if the objects were first
>> recorded only in the primary OSD.
>> 
>> What happens when that host / drive smokes before it can replicate? What
>> happens if a secondary OSD gets a read op before the primary updates it?
>> Swift object storage users have to code around this potential. It's a
>> non-starter for block storage.
>> 
>> This is similar to why RoC HBAs (which are a badly outdated thing to begin
>> with) will only enter writeback mode if they have a BBU / supercap -- and
>> of course if their firmware and hardware isn't pervasively buggy. Guess how
>> I know this?
>> 
>>> This way it would greatly increase performance (both for iops and
>> throuput).
>> 
>> It might increase low-QD IOPS for a single client on slow media with
>> certain networking. Depending on media, it wouldn't increase throughput.
>> 
>> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
>> the network resources between the client and the servers.
>> 
>>> Later (in the background), record the replicas. This situation would
>> avoid leaving users/software waiting for the recording response from all
>> replicas when the storage is overloaded.
>> 
>> If one makes the mistake of using HDDs, they're going to be overloaded no
>> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
>> stone. Throughput is going to be limited by the SATA interface and seeking
>> no matter what.
>> 
>>> Where I work, performance is very important and we don't have money to
>> make a entire cluster only with NVMe.
>> 
>> If there isn't money, then it isn't very important. But as I've written
>> before, NVMe clusters *do not cost appreciably more than spinners* unless
>> your procurement processes are bad. In fact they can cost significantly
>> less. This is especially true with object storage and archival where one
>> can leverage QLC.
>> 
>> * Buy generic drives from a VAR, not channel drives through a chassis
>> brand. Far less markup, and moreover you get the full 5 year warranty, not
>> just 3 years. And you can painlessly RMA drives yourself - you don't have
>> to spend hours going back and forth with $chassisvendor's TAC arguing about
>> every single RMA. I've found that this is so bad that it is more economical
>> to just throw away a failed component worth < USD 500 than to RMA it. Do
>> you pay for extended warranty / support? That's expensive too.
>> 
>> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
>> extreme markups. List prices as high as USD2000. Per server, eschewing
>> those abominations makes up for a lot of the drive-only unit economics
>> 
>> * But this is the part that lots of people don't get: You don't just stack
>> up the drives on a desk and use them. They go into *servers* that cost
>> money and *racks* that cost money. They take *power* that costs money.
>> 
>> * $ / IOPS are FAR better for ANY SSD than for HDDs
>> 
>> * RUs cost money, so do chassis and switches
>> 
>> * Drive failures cost money
>> 
>> * So does having your people and applications twiddle their thumbs waiting
>> for stuff to happen. I worked for a supercomputer company who put
>> low-memory low-end diskless workstations on engineer's desks. They spent
>> lots of time doing nothing waiting for their applications to respond. This
>> company no longer exists.
>> 
>> * So does the risk of taking *weeks* to heal from a drive failure
>> 
>> Punch honest numbers into
>> https://www.snia.org/forums/cmsi/programs/TCOcalc
>> 
>> I walked through this with a certain global company. QLC SSDs were
>> demonstrated to have like 30% lower TCO than spinners. Part of the equation
>> is that they were accustomed to limiting HDD size to 8 TB because of the
>> bottlenecks, and thus requiring more servers, more switch ports, more DC
>> racks, more rack/stack time, more administrative overhead. You can fit 1.9
>> PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB
>> of the largest spinners you can get today. 22 TIMES the density. And since
>> many applications can't even barely tolerate the spinner bottlenecks,
>> capping spinner size at even 10T makes that like 40 TIMES better density
>> with SSDs.
>> 
>> 
>>> However, I don't think it's interesting to lose the functionality of the
>> replicas.
>>> I'm just suggesting another way to increase performance without losing
>> the functionality of replicas.
>>> 
>>> 
>>> Rafael.
>>> 
>>> 
>>> De: "Anthony D'Atri" <anthony.datri@xxxxxxxxx>
>>> Enviada: 2024/01/31 17:04:08
>>> Para: quaglio@xxxxxxxxxx
>>> Cc: ceph-users@xxxxxxx
>>> Assunto: Re:  Performance improvement suggestion
>>> 
>>> Would you be willing to accept the risk of data loss?
>>> 
>>>> 
>>>> On Jan 31, 2024, at 2:48 PM, quaglio@xxxxxxxxxx wrote:
>>>> 
>>>> Hello everybody,
>>>> I would like to make a suggestion for improving performance in Ceph
>> architecture.
>>>> I don't know if this group would be the best place or if my proposal is
>> correct.
>>>> 
>>>> My suggestion would be in the item
>> https://docs.ceph.com/en/latest/architecture/, at the end of the topic
>> "Smart Daemons Enable Hyperscale".
>>>> 
>>>> The Client needs to "wait" for the configured amount of replicas to be
>> written (so that the client receives an ok and continues). This way, if
>> there is slowness on any of the disks on which the PG will be updated, the
>> client is left waiting.
>>>> 
>>>> It would be possible:
>>>> 
>>>> 1-) Only record on the primary OSD
>>>> 2-) Write other replicas in background (like the same way as when an
>> OSD fails: "degraded" ).
>>>> 
>>>> This way, client has a faster response when writing to storage:
>> improving latency and performance (throughput and IOPS).
>>>> 
>>>> I would find it plausible to accept a period of time (seconds) until
>> all replicas are ok (written asynchronously) at the expense of improving
>> performance.
>>>> 
>>>> Could you evaluate this scenario?
>>>> 
>>>> 
>>>> Rafael.
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to
>> ceph-users-leave@ceph.io_______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx