Cache tiering is deprecated. > On Feb 20, 2024, at 17:03, Özkan Göksu <ozkangksu@xxxxxxxxx> wrote: > > Hello. > > I didn't test it personally but what about rep 1 write cache pool with nvme > backed by another rep 2 pool? > > It has the potential exactly what you are looking for in theory. > > > 1 Şub 2024 Per 20:54 tarihinde quaglio@xxxxxxxxxx <quaglio@xxxxxxxxxx> şunu > yazdı: > >> >> >> Ok Anthony, >> >> I understood what you said. I also believe in all the professional history >> and experience you have. >> >> Anyway, could there be a configuration flag to make this happen? >> >> As well as those that already exist: "--yes-i-really-mean-it". >> >> This way, the storage pattern would remain as it is. However, it would >> allow situations like the one I mentioned to be possible. >> >> This situation will permit some rules to be relaxed (even if they are not >> ok at first). >> Likewise, there are already situations like lazyio that make some >> exceptions to standard procedures. >> >> >> Remembering: it's just a suggestion. >> If this type of functionality is not interesting, it is ok. >> >> >> Rafael. >> >> ------------------------------ >> >> *De: *"Anthony D'Atri" <anthony.datri@xxxxxxxxx> >> *Enviada: *2024/02/01 12:10:30 >> *Para: *quaglio@xxxxxxxxxx >> *Cc: * ceph-users@xxxxxxx >> *Assunto: * Re: Performance improvement suggestion >> >> >> >>> I didn't say I would accept the risk of losing data. >> >> That's implicit in what you suggest, though. >> >>> I just said that it would be interesting if the objects were first >> recorded only in the primary OSD. >> >> What happens when that host / drive smokes before it can replicate? What >> happens if a secondary OSD gets a read op before the primary updates it? >> Swift object storage users have to code around this potential. It's a >> non-starter for block storage. >> >> This is similar to why RoC HBAs (which are a badly outdated thing to begin >> with) will only enter writeback mode if they have a BBU / supercap -- and >> of course if their firmware and hardware isn't pervasively buggy. Guess how >> I know this? >> >>> This way it would greatly increase performance (both for iops and >> throuput). >> >> It might increase low-QD IOPS for a single client on slow media with >> certain networking. Depending on media, it wouldn't increase throughput. >> >> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x >> the network resources between the client and the servers. >> >>> Later (in the background), record the replicas. This situation would >> avoid leaving users/software waiting for the recording response from all >> replicas when the storage is overloaded. >> >> If one makes the mistake of using HDDs, they're going to be overloaded no >> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a >> stone. Throughput is going to be limited by the SATA interface and seeking >> no matter what. >> >>> Where I work, performance is very important and we don't have money to >> make a entire cluster only with NVMe. >> >> If there isn't money, then it isn't very important. But as I've written >> before, NVMe clusters *do not cost appreciably more than spinners* unless >> your procurement processes are bad. In fact they can cost significantly >> less. This is especially true with object storage and archival where one >> can leverage QLC. >> >> * Buy generic drives from a VAR, not channel drives through a chassis >> brand. Far less markup, and moreover you get the full 5 year warranty, not >> just 3 years. And you can painlessly RMA drives yourself - you don't have >> to spend hours going back and forth with $chassisvendor's TAC arguing about >> every single RMA. I've found that this is so bad that it is more economical >> to just throw away a failed component worth < USD 500 than to RMA it. Do >> you pay for extended warranty / support? That's expensive too. >> >> * Certain chassis brands who shall remain nameless push RoC HBAs hard with >> extreme markups. List prices as high as USD2000. Per server, eschewing >> those abominations makes up for a lot of the drive-only unit economics >> >> * But this is the part that lots of people don't get: You don't just stack >> up the drives on a desk and use them. They go into *servers* that cost >> money and *racks* that cost money. They take *power* that costs money. >> >> * $ / IOPS are FAR better for ANY SSD than for HDDs >> >> * RUs cost money, so do chassis and switches >> >> * Drive failures cost money >> >> * So does having your people and applications twiddle their thumbs waiting >> for stuff to happen. I worked for a supercomputer company who put >> low-memory low-end diskless workstations on engineer's desks. They spent >> lots of time doing nothing waiting for their applications to respond. This >> company no longer exists. >> >> * So does the risk of taking *weeks* to heal from a drive failure >> >> Punch honest numbers into >> https://www.snia.org/forums/cmsi/programs/TCOcalc >> >> I walked through this with a certain global company. QLC SSDs were >> demonstrated to have like 30% lower TCO than spinners. Part of the equation >> is that they were accustomed to limiting HDD size to 8 TB because of the >> bottlenecks, and thus requiring more servers, more switch ports, more DC >> racks, more rack/stack time, more administrative overhead. You can fit 1.9 >> PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB >> of the largest spinners you can get today. 22 TIMES the density. And since >> many applications can't even barely tolerate the spinner bottlenecks, >> capping spinner size at even 10T makes that like 40 TIMES better density >> with SSDs. >> >> >>> However, I don't think it's interesting to lose the functionality of the >> replicas. >>> I'm just suggesting another way to increase performance without losing >> the functionality of replicas. >>> >>> >>> Rafael. >>> >>> >>> De: "Anthony D'Atri" <anthony.datri@xxxxxxxxx> >>> Enviada: 2024/01/31 17:04:08 >>> Para: quaglio@xxxxxxxxxx >>> Cc: ceph-users@xxxxxxx >>> Assunto: Re: Performance improvement suggestion >>> >>> Would you be willing to accept the risk of data loss? >>> >>>> >>>> On Jan 31, 2024, at 2:48 PM, quaglio@xxxxxxxxxx wrote: >>>> >>>> Hello everybody, >>>> I would like to make a suggestion for improving performance in Ceph >> architecture. >>>> I don't know if this group would be the best place or if my proposal is >> correct. >>>> >>>> My suggestion would be in the item >> https://docs.ceph.com/en/latest/architecture/, at the end of the topic >> "Smart Daemons Enable Hyperscale". >>>> >>>> The Client needs to "wait" for the configured amount of replicas to be >> written (so that the client receives an ok and continues). This way, if >> there is slowness on any of the disks on which the PG will be updated, the >> client is left waiting. >>>> >>>> It would be possible: >>>> >>>> 1-) Only record on the primary OSD >>>> 2-) Write other replicas in background (like the same way as when an >> OSD fails: "degraded" ). >>>> >>>> This way, client has a faster response when writing to storage: >> improving latency and performance (throughput and IOPS). >>>> >>>> I would find it plausible to accept a period of time (seconds) until >> all replicas are ok (written asynchronously) at the expense of improving >> performance. >>>> >>>> Could you evaluate this scenario? >>>> >>>> >>>> Rafael. >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to >> ceph-users-leave@ceph.io_______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx