Thanks Casey, I understand the design, status, and further work, give us some time to discuss the design. Will let you know if we have some idea. Thanks! -Chunmei > -----Original Message----- > From: Casey Bodley <cbodley@xxxxxxxxxx> > Sent: Tuesday, August 15, 2023 9:44 AM > To: Liu, Chunmei <chunmei.liu@xxxxxxxxx> > Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Feng, Hualong > <hualong.feng@xxxxxxxxx>; Tang, Guifeng <guifeng.tang@xxxxxxxxx>; > mbenjami <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; > Marcus Watts <mwatts@xxxxxxxxxx>; Gabriel BenHanokh > <gbenhano@xxxxxxxxxx>; dev@xxxxxxx; seenafallah@xxxxxxxxx > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration > > thanks Chunmei (and cc ceph dev list), > > "RGWPutObj uses async_md5 for ETag" is a first draft at hooking this up for > rgw, and the pr description in > https://github.com/ceph/ceph/pull/52488 includes a todo list for further work > > the first step is to run this async hashing in parallel with > filter->process(), which is what applies compression/encryption and > ultimately writes the data to rados. ideally this would mask most of the > latency from md5 > > the pr adds a single instance of async_md5::Batch that runs on a strand > executor of rgw's thread pool. this means the md5 calculations can run on > any available thread, but will only utilize one core at a time. we might improve > utilization by adding more Batch instances, but that could reduce the > probability that rgw requests can fill those batches within the "batch > timeout". this timeout is a free parameter that would need tuning > > i'd like to determine how effective one instance is at offloading the > md5 calculations. if the md5 part (almost) always completes before > filter->process() does, that's probably good enough. if not, we'd want > a way to scale the number of instances based on latency or load > > unfortunately, Mark and i are seeing crashes while testing this rgw integration. > i added a comment about this in > https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699, > and will follow up there > > finally, regarding the base async/batching library in > https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to extend > that to cover other types of hashes and other libraries like QAT. do you think > QAT would work well under the same async/batching interface? > > overall, does this sound like a reasonable design for rgw? > > On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu@xxxxxxxxx> > wrote: > > > > Hi Casey, > > > > In your following two PRS: > > https://github.com/ceph/ceph/pull/52385 builds an asynchronous > batching library on top of the isal crypto library's multi-buffer md5 facilities: > > https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses > async_md5 for ETag. > > Seems rgw can do async md5 batching. I am wondering what the current > work status is. If all features are implemented or still has other features need > to be implemented in next step? What can intel team help here? > > > > Thanks! > > -Chunmei > > > > > -----Original Message----- > > > From: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx> > > > Sent: Thursday, August 10, 2023 10:32 PM > > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Liu, Chunmei > > > <chunmei.liu@xxxxxxxxx>; Feng, Hualong <hualong.feng@xxxxxxxxx> > > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami > > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus > Watts > > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx> > > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > Include Chunmei. > > > > > > Regards, > > > -Yingxin > > > > > > > -----Original Message----- > > > > From: Cheng, Yingxin > > > > Sent: Thursday, July 13, 2023 3:54 PM > > > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Feng, Hualong > > > > <hualong.feng@xxxxxxxxx> > > > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami > > > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; > Marcus > > > Watts > > > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx> > > > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > > > Merge another RGW thread here, it is discussing the same thing. > > > > > > > > I'm also learning and try to answer below: > > > > > > > > > But when less than 100% CPU is running, using AVX512 to save > > > > > core may cause > > > > slowdown. > > > > > > > > > well put, this is the part i'm struggling with too. it feels > > > > > like a tradeoff between cpu usage and added latency (both from > > > > > thread synchronization and waiting for full batches) > > > > > > > > Yeah, there are synthetic effects from both software and hardware: > > > > latency vs batching, CPU downclocking (should have less impacts > > > > from later CPU models such as SPR), sync vs async, etc. > > > > > > > > > I'm not sure if the current work is an informed implementation > > > > > with respect to > > > > the isa-l interface; I'm fairly sure it is with regard to > > > > boost::asio and related topics. > > > > > > > > My understanding is that the asynchronous way has the best > > > > opportunity to fully use the 16 lane AVX512 accelerations by > > > > batching the requests, and falling back to the synchronous way is > > > > worth considering under small sizes or low depth. But the > > > > decisions need to be based on real > > > test results. > > > > > > > > > Is there anything we can do to spread the load evenly on > > > > SSE/AVX/AVX2/AVX512 units? > > > > > I assume that each physical core has a standalone unit - can we > > > > > make sure to > > > > employee them all in parallel? > > > > > > > > SIMD are CPU instructions rather than off-loadable device. They > > > > can only be executed from a thread synchronously. So if the > > > > parallelism exceeds 16 lanes, it should be reasonable to start > > > > another worker thread. And if there are only 8 outstanding lanes > > > > at the moment, AVX2 should be a better choice rather than the > > > > heavier AVX512. I'm not sure yet whether isa-l library is > > > > intelligent enough to pick the right instruction or > > > it is manually controlled. > > > > > > > > > This handles only MD5, which certainly matters to us, but I > > > > > suspect we strongly > > > > need acceleration for sha-256 and sha-512. > > > > > > > > Looks isa-l supports these optimizations, which is the recommended > > > > way because there might be multiple options available at the same > > > > time (AVX512 > > > and sha-ni). > > > > Looking at the source code, the library is able to detect the > > > > availability and select the best possible option. > > > > > > > > Regards, > > > > -Yingxin > > > > > > > > > -----Original Message----- > > > > > From: Casey Bodley <cbodley@xxxxxxxxxx> > > > > > Sent: Thursday, July 13, 2023 3:19 AM > > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx> > > > > > Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Tang, Guifeng > > > > > <guifeng.tang@xxxxxxxxx>; mbenjami <mbenjami@xxxxxxxxxx>; > Mark > > > Kogan > > > > > <mkogan@xxxxxxxxxx> > > > > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > > > > > thanks Hualong, (cc Matt and Mark) > > > > > > > > > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong > > > > > <hualong.feng@xxxxxxxxx> > > > > > wrote: > > > > > > > > > > > > Hi Casey > > > > > > > > > > > > Our team are interested in this, but we need to study more details. > > > > > > > > > > > > We have learned about md5 implemented by AVX512 before. We > > > > > > know that > > > > > md5 can only be calculated serially for an object, and cannot be > > > > > split for concurrent calculation. For AVX512, its basic > > > > > understanding is to divide a core into 16 lanes, and then these > > > > > 16 lanes can be calculated at the same time. So when we use md5 > > > > > implemented by AVX512, we need to wait for multiple request > > > > > objects to > > > calculate md5 at the same time. > > > > > > > > > > > > So here are two difficulties to consider: > > > > > > 1. When calculating, we need to fill all the lanes on a core > > > > > > as much as > > > > possible. > > > > > Only in this way can the advantages of AVX512/AVX2 be reflected. > > > > > > 2. What we have seen so far is the comparison between using > > > > > > AVX512/AVX2 > > > > > on a single core and not using it. But when we actually run RGW, > > > > > unless RGW is allocated or the machine where it is located is > > > > > already running with 100% CPU, then using AVX512 will increase > > > > > the calculation speed of md5. But when less than 100% CPU is > > > > > running, using AVX512 to save core may cause slowdown. So how do > > > > > we know under what circumstances we should use AVX instructions in > code? > > > > > > > > > > well put, this is the part i'm struggling with too. it feels > > > > > like a tradeoff between cpu usage and added latency (both from > > > > > thread synchronization and waiting for full batches) > > > > > > > > > > if we could track the rate of hash updates per second, we might > > > > > use that to decide whether we're likely to get a full batch > > > > > within some limit of acceptable latency > > > > > > > > > > RGWPutObj could probably mask some of this latency by running > > > > > these asynchronous hashes in parallel while reading the next 4MB > > > > > chunk from the frontend > > > > > > > > > > in https://github.com/ceph/ceph/pull/52385 i introduced the > > > > > concept of a batch_timeout, which can force the processing of a partial > batch. > > > > > bounding this latency seemed like a necessary part of the model. > > > > > if RGWPutObj can mask this latency, then we might use that > > > > > batch_timeout to avoid the need to track a global hash rate > > > > > > > > > > > > > > > > > We have implemented a POC before, using QAT to implement the > > > > > > hash algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256, > > > > > > HMACSHA1), > > > > > > > > > > very cool, thanks. what is the benefit of QAT here, compared to > > > > > something like isa-l_crypto that just uses AVX instructions? is > > > > > QAT able to > > > > offload some of this? > > > > > would the use of QAT rule out any hardware (like AMD cpus) that > > > > > would otherwise support AVX? > > > > > > > > > > > However, due to md5 security reasons and there is no > > > > > > convenient alternative > > > > > framework in ceph, it is temporarily blocked. > > > > > > > > > https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_ > > > > > > qa > > > > > > t_ > > > > > > mode > > > > > > > > > > i'll reach out to our security contact to get some more clarity > > > > > on this stuff. i know that openssl's FIPS certification is > > > > > important to downstream products. but > > > > > md5 isn't a cryptographic hash and etag isn't used for security, > > > > > so i've assumed we could use other md5 implementations there. > > > > > i'm less sure about the SHA family, but i know Matt's interested > > > > > in using those for checksumming in rgw > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > -Hualong > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Casey Bodley <cbodley@xxxxxxxxxx> > > > > > > > Sent: Tuesday, July 11, 2023 11:06 PM > > > > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx> > > > > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > > > > > > > > > hey Hualong, > > > > > > > > > > > > > > ceph is already using intel's isa-l_crypto library for > > > > > > > crypto acceleration. i just started looking into its > > > > > > > multi-buffer md5 implementation > > > > > > > (https://github.com/intel/isa- > l_crypto/blob/master/include/md5_mb. > > > > > > > h) for use in rgw to vectorize our ETag calculations (a > > > > > > > feature tracked in https://tracker.ceph.com/issues/61646). > > > > > > > i've started some initial work in > > > > > > > https://github.com/ceph/ceph/pull/52385, > > > > > > > but we'll still need to decide how best to integrate that > > > > > > > into rgw > > > > > > > > > > > > > > would your team be interested in collaborrating on this? > > > > > > > we'd love your input on the design for rgw, and how best to > > > > > > > measure and tune its performance > > > > > > > > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx