thanks Chunmei (and cc ceph dev list), "RGWPutObj uses async_md5 for ETag" is a first draft at hooking this up for rgw, and the pr description in https://github.com/ceph/ceph/pull/52488 includes a todo list for further work the first step is to run this async hashing in parallel with filter->process(), which is what applies compression/encryption and ultimately writes the data to rados. ideally this would mask most of the latency from md5 the pr adds a single instance of async_md5::Batch that runs on a strand executor of rgw's thread pool. this means the md5 calculations can run on any available thread, but will only utilize one core at a time. we might improve utilization by adding more Batch instances, but that could reduce the probability that rgw requests can fill those batches within the "batch timeout". this timeout is a free parameter that would need tuning i'd like to determine how effective one instance is at offloading the md5 calculations. if the md5 part (almost) always completes before filter->process() does, that's probably good enough. if not, we'd want a way to scale the number of instances based on latency or load unfortunately, Mark and i are seeing crashes while testing this rgw integration. i added a comment about this in https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699, and will follow up there finally, regarding the base async/batching library in https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to extend that to cover other types of hashes and other libraries like QAT. do you think QAT would work well under the same async/batching interface? overall, does this sound like a reasonable design for rgw? On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu@xxxxxxxxx> wrote: > > Hi Casey, > > In your following two PRS: > https://github.com/ceph/ceph/pull/52385 builds an asynchronous batching library on top of the isal crypto library's multi-buffer md5 facilities: > https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses async_md5 for ETag. > Seems rgw can do async md5 batching. I am wondering what the current work status is. If all features are implemented or still has other features need to be implemented in next step? What can intel team help here? > > Thanks! > -Chunmei > > > -----Original Message----- > > From: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx> > > Sent: Thursday, August 10, 2023 10:32 PM > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Liu, Chunmei > > <chunmei.liu@xxxxxxxxx>; Feng, Hualong <hualong.feng@xxxxxxxxx> > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus > > Watts <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx> > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration > > > > Include Chunmei. > > > > Regards, > > -Yingxin > > > > > -----Original Message----- > > > From: Cheng, Yingxin > > > Sent: Thursday, July 13, 2023 3:54 PM > > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Feng, Hualong > > > <hualong.feng@xxxxxxxxx> > > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami > > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus > > Watts > > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx> > > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > Merge another RGW thread here, it is discussing the same thing. > > > > > > I'm also learning and try to answer below: > > > > > > > But when less than 100% CPU is running, using AVX512 to save core > > > > may cause > > > slowdown. > > > > > > > well put, this is the part i'm struggling with too. it feels like a > > > > tradeoff between cpu usage and added latency (both from thread > > > > synchronization and waiting for full batches) > > > > > > Yeah, there are synthetic effects from both software and hardware: > > > latency vs batching, CPU downclocking (should have less impacts from > > > later CPU models such as SPR), sync vs async, etc. > > > > > > > I'm not sure if the current work is an informed implementation with > > > > respect to > > > the isa-l interface; I'm fairly sure it is with regard to boost::asio > > > and related topics. > > > > > > My understanding is that the asynchronous way has the best opportunity > > > to fully use the 16 lane AVX512 accelerations by batching the > > > requests, and falling back to the synchronous way is worth considering > > > under small sizes or low depth. But the decisions need to be based on real > > test results. > > > > > > > Is there anything we can do to spread the load evenly on > > > SSE/AVX/AVX2/AVX512 units? > > > > I assume that each physical core has a standalone unit - can we make > > > > sure to > > > employee them all in parallel? > > > > > > SIMD are CPU instructions rather than off-loadable device. They can > > > only be executed from a thread synchronously. So if the parallelism > > > exceeds 16 lanes, it should be reasonable to start another worker > > > thread. And if there are only 8 outstanding lanes at the moment, AVX2 > > > should be a better choice rather than the heavier AVX512. I'm not sure > > > yet whether isa-l library is intelligent enough to pick the right instruction or > > it is manually controlled. > > > > > > > This handles only MD5, which certainly matters to us, but I suspect > > > > we strongly > > > need acceleration for sha-256 and sha-512. > > > > > > Looks isa-l supports these optimizations, which is the recommended way > > > because there might be multiple options available at the same time (AVX512 > > and sha-ni). > > > Looking at the source code, the library is able to detect the > > > availability and select the best possible option. > > > > > > Regards, > > > -Yingxin > > > > > > > -----Original Message----- > > > > From: Casey Bodley <cbodley@xxxxxxxxxx> > > > > Sent: Thursday, July 13, 2023 3:19 AM > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx> > > > > Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Tang, Guifeng > > > > <guifeng.tang@xxxxxxxxx>; mbenjami <mbenjami@xxxxxxxxxx>; Mark > > Kogan > > > > <mkogan@xxxxxxxxxx> > > > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > > > thanks Hualong, (cc Matt and Mark) > > > > > > > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong > > > > <hualong.feng@xxxxxxxxx> > > > > wrote: > > > > > > > > > > Hi Casey > > > > > > > > > > Our team are interested in this, but we need to study more details. > > > > > > > > > > We have learned about md5 implemented by AVX512 before. We know > > > > > that > > > > md5 can only be calculated serially for an object, and cannot be > > > > split for concurrent calculation. For AVX512, its basic > > > > understanding is to divide a core into 16 lanes, and then these 16 > > > > lanes can be calculated at the same time. So when we use md5 > > > > implemented by AVX512, we need to wait for multiple request objects to > > calculate md5 at the same time. > > > > > > > > > > So here are two difficulties to consider: > > > > > 1. When calculating, we need to fill all the lanes on a core as > > > > > much as > > > possible. > > > > Only in this way can the advantages of AVX512/AVX2 be reflected. > > > > > 2. What we have seen so far is the comparison between using > > > > > AVX512/AVX2 > > > > on a single core and not using it. But when we actually run RGW, > > > > unless RGW is allocated or the machine where it is located is > > > > already running with 100% CPU, then using AVX512 will increase the > > > > calculation speed of md5. But when less than 100% CPU is running, > > > > using AVX512 to save core may cause slowdown. So how do we know > > > > under what circumstances we should use AVX instructions in code? > > > > > > > > well put, this is the part i'm struggling with too. it feels like a > > > > tradeoff between cpu usage and added latency (both from thread > > > > synchronization and waiting for full batches) > > > > > > > > if we could track the rate of hash updates per second, we might use > > > > that to decide whether we're likely to get a full batch within some > > > > limit of acceptable latency > > > > > > > > RGWPutObj could probably mask some of this latency by running these > > > > asynchronous hashes in parallel while reading the next 4MB chunk > > > > from the frontend > > > > > > > > in https://github.com/ceph/ceph/pull/52385 i introduced the concept > > > > of a batch_timeout, which can force the processing of a partial batch. > > > > bounding this latency seemed like a necessary part of the model. if > > > > RGWPutObj can mask this latency, then we might use that > > > > batch_timeout to avoid the need to track a global hash rate > > > > > > > > > > > > > > We have implemented a POC before, using QAT to implement the hash > > > > > algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256, > > > > > HMACSHA1), > > > > > > > > very cool, thanks. what is the benefit of QAT here, compared to > > > > something like isa-l_crypto that just uses AVX instructions? is QAT > > > > able to > > > offload some of this? > > > > would the use of QAT rule out any hardware (like AMD cpus) that > > > > would otherwise support AVX? > > > > > > > > > However, due to md5 security reasons and there is no convenient > > > > > alternative > > > > framework in ceph, it is temporarily blocked. > > > > > > > https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_ > > > > > qa > > > > > t_ > > > > > mode > > > > > > > > i'll reach out to our security contact to get some more clarity on > > > > this stuff. i know that openssl's FIPS certification is important to > > > > downstream products. but > > > > md5 isn't a cryptographic hash and etag isn't used for security, so > > > > i've assumed we could use other md5 implementations there. i'm less > > > > sure about the SHA family, but i know Matt's interested in using > > > > those for checksumming in rgw > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > -Hualong > > > > > > > > > > > -----Original Message----- > > > > > > From: Casey Bodley <cbodley@xxxxxxxxxx> > > > > > > Sent: Tuesday, July 11, 2023 11:06 PM > > > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx> > > > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration > > > > > > > > > > > > hey Hualong, > > > > > > > > > > > > ceph is already using intel's isa-l_crypto library for crypto > > > > > > acceleration. i just started looking into its multi-buffer md5 > > > > > > implementation > > > > > > (https://github.com/intel/isa-l_crypto/blob/master/include/md5_mb. > > > > > > h) for use in rgw to vectorize our ETag calculations (a feature > > > > > > tracked in https://tracker.ceph.com/issues/61646). i've started > > > > > > some initial work in https://github.com/ceph/ceph/pull/52385, > > > > > > but we'll still need to decide how best to integrate that into > > > > > > rgw > > > > > > > > > > > > would your team be interested in collaborrating on this? we'd > > > > > > love your input on the design for rgw, and how best to measure > > > > > > and tune its performance > > > > > > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx