RE: rgw: interest in isa-l_crypto for md5 acceleration

"Liu, Chunmei" <chunmei.liu@xxxxxxxxx> · Mon, 21 Aug 2023 06:31:06 +0000

Thanks Casey,  I understand the design, status, and further work, give us some time to discuss the design. Will let you know if we have some idea. 

Thanks!
-Chunmei

> -----Original Message-----
> From: Casey Bodley <cbodley@xxxxxxxxxx>
> Sent: Tuesday, August 15, 2023 9:44 AM
> To: Liu, Chunmei <chunmei.liu@xxxxxxxxx>
> Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Feng, Hualong
> <hualong.feng@xxxxxxxxx>; Tang, Guifeng <guifeng.tang@xxxxxxxxx>;
> mbenjami <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>;
> Marcus Watts <mwatts@xxxxxxxxxx>; Gabriel BenHanokh
> <gbenhano@xxxxxxxxxx>; dev@xxxxxxx; seenafallah@xxxxxxxxx
> Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
> 
> thanks Chunmei (and cc ceph dev list),
> 
> "RGWPutObj uses async_md5 for ETag" is a first draft at hooking this up for
> rgw, and the pr description in
> https://github.com/ceph/ceph/pull/52488 includes a todo list for further work
> 
> the first step is to run this async hashing in parallel with
> filter->process(), which is what applies compression/encryption and
> ultimately writes the data to rados. ideally this would mask most of the
> latency from md5
> 
> the pr adds a single instance of async_md5::Batch that runs on a strand
> executor of rgw's thread pool. this means the md5 calculations can run on
> any available thread, but will only utilize one core at a time. we might improve
> utilization by adding more Batch instances, but that could reduce the
> probability that rgw requests can fill those batches within the "batch
> timeout". this timeout is a free parameter that would need tuning
> 
> i'd like to determine how effective one instance is at offloading the
> md5 calculations. if the md5 part (almost) always completes before
> filter->process() does, that's probably good enough. if not, we'd want
> a way to scale the number of instances based on latency or load
> 
> unfortunately, Mark and i are seeing crashes while testing this rgw integration.
> i added a comment about this in
> https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699,
> and will follow up there
> 
> finally, regarding the base async/batching library in
> https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to extend
> that to cover other types of hashes and other libraries like QAT. do you think
> QAT would work well under the same async/batching interface?
> 
> overall, does this sound like a reasonable design for rgw?
> 
> On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu@xxxxxxxxx>
> wrote:
> >
> > Hi Casey,
> >
> >     In your following two PRS:
> >     https://github.com/ceph/ceph/pull/52385 builds an asynchronous
> batching library on top of the isal crypto library's multi-buffer md5 facilities:
> >     https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses
> async_md5 for ETag.
> > Seems rgw can do async md5 batching. I am wondering what the current
> work status is.  If all features are implemented or still has other features need
> to be implemented in next step? What can intel team help here?
> >
> >  Thanks!
> > -Chunmei
> >
> > > -----Original Message-----
> > > From: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>
> > > Sent: Thursday, August 10, 2023 10:32 PM
> > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Liu, Chunmei
> > > <chunmei.liu@xxxxxxxxx>; Feng, Hualong <hualong.feng@xxxxxxxxx>
> > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami
> > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus
> Watts
> > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx>
> > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> > >
> > > Include Chunmei.
> > >
> > > Regards,
> > > -Yingxin
> > >
> > > > -----Original Message-----
> > > > From: Cheng, Yingxin
> > > > Sent: Thursday, July 13, 2023 3:54 PM
> > > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Feng, Hualong
> > > > <hualong.feng@xxxxxxxxx>
> > > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami
> > > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>;
> Marcus
> > > Watts
> > > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx>
> > > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> > > >
> > > > Merge another RGW thread here, it is discussing the same thing.
> > > >
> > > > I'm also learning and try to answer below:
> > > >
> > > > > But when less than 100% CPU is running, using AVX512 to save
> > > > > core may cause
> > > > slowdown.
> > > >
> > > > > well put, this is the part i'm struggling with too. it feels
> > > > > like a tradeoff between cpu usage and added latency (both from
> > > > > thread synchronization and waiting for full batches)
> > > >
> > > > Yeah, there are synthetic effects from both software and hardware:
> > > > latency vs batching, CPU downclocking (should have less impacts
> > > > from later CPU models such as SPR), sync vs async, etc.
> > > >
> > > > > I'm not sure if the current work is an informed implementation
> > > > > with respect to
> > > > the isa-l interface;  I'm fairly sure it is with regard to
> > > > boost::asio and related topics.
> > > >
> > > > My understanding is that the asynchronous way has the best
> > > > opportunity to fully use the 16 lane AVX512 accelerations by
> > > > batching the requests, and falling back to the synchronous way is
> > > > worth considering under small sizes or low depth. But the
> > > > decisions need to be based on real
> > > test results.
> > > >
> > > > > Is there anything we can do to spread the load evenly on
> > > > SSE/AVX/AVX2/AVX512 units?
> > > > > I assume that each physical core has a standalone unit - can we
> > > > > make sure to
> > > > employee them all in parallel?
> > > >
> > > > SIMD are CPU instructions rather than off-loadable device. They
> > > > can only be executed from a thread synchronously. So if the
> > > > parallelism exceeds 16 lanes, it should be reasonable to start
> > > > another worker thread. And if there are only 8 outstanding lanes
> > > > at the moment, AVX2 should be a better choice rather than the
> > > > heavier AVX512. I'm not sure yet whether isa-l library is
> > > > intelligent enough to pick the right instruction or
> > > it is manually controlled.
> > > >
> > > > > This handles only MD5, which certainly matters to us, but I
> > > > > suspect we strongly
> > > > need acceleration for sha-256 and sha-512.
> > > >
> > > > Looks isa-l supports these optimizations, which is the recommended
> > > > way because there might be multiple options available at the same
> > > > time (AVX512
> > > and sha-ni).
> > > > Looking at the source code, the library is able to detect the
> > > > availability and select the best possible option.
> > > >
> > > > Regards,
> > > > -Yingxin
> > > >
> > > > > -----Original Message-----
> > > > > From: Casey Bodley <cbodley@xxxxxxxxxx>
> > > > > Sent: Thursday, July 13, 2023 3:19 AM
> > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx>
> > > > > Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Tang, Guifeng
> > > > > <guifeng.tang@xxxxxxxxx>; mbenjami <mbenjami@xxxxxxxxxx>;
> Mark
> > > Kogan
> > > > > <mkogan@xxxxxxxxxx>
> > > > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
> > > > >
> > > > > thanks Hualong, (cc Matt and Mark)
> > > > >
> > > > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong
> > > > > <hualong.feng@xxxxxxxxx>
> > > > > wrote:
> > > > > >
> > > > > > Hi Casey
> > > > > >
> > > > > > Our team are interested in this, but we need to study more details.
> > > > > >
> > > > > > We have learned about md5 implemented by AVX512 before. We
> > > > > > know that
> > > > > md5 can only be calculated serially for an object, and cannot be
> > > > > split for concurrent calculation. For AVX512, its basic
> > > > > understanding is to divide a core into 16 lanes, and then these
> > > > > 16 lanes can be calculated at the same time. So when we use md5
> > > > > implemented by AVX512, we need to wait for multiple request
> > > > > objects to
> > > calculate md5 at the same time.
> > > > > >
> > > > > > So here are two difficulties to consider:
> > > > > > 1. When calculating, we need to fill all the lanes on a core
> > > > > > as much as
> > > > possible.
> > > > > Only in this way can the advantages of AVX512/AVX2 be reflected.
> > > > > > 2. What we have seen so far is the comparison between using
> > > > > > AVX512/AVX2
> > > > > on a single core and not using it. But when we actually run RGW,
> > > > > unless RGW is allocated or the machine where it is located is
> > > > > already running with 100% CPU, then using AVX512 will increase
> > > > > the calculation speed of md5. But when less than 100% CPU is
> > > > > running, using AVX512 to save core may cause slowdown. So how do
> > > > > we know under what circumstances we should use AVX instructions in
> code?
> > > > >
> > > > > well put, this is the part i'm struggling with too. it feels
> > > > > like a tradeoff between cpu usage and added latency (both from
> > > > > thread synchronization and waiting for full batches)
> > > > >
> > > > > if we could track the rate of hash updates per second, we might
> > > > > use that to decide whether we're likely to get a full batch
> > > > > within some limit of acceptable latency
> > > > >
> > > > > RGWPutObj could probably mask some of this latency by running
> > > > > these asynchronous hashes in parallel while reading the next 4MB
> > > > > chunk from the frontend
> > > > >
> > > > > in https://github.com/ceph/ceph/pull/52385 i introduced the
> > > > > concept of a batch_timeout, which can force the processing of a partial
> batch.
> > > > > bounding this latency seemed like a necessary part of the model.
> > > > > if RGWPutObj can mask this latency, then we might use that
> > > > > batch_timeout to avoid the need to track a global hash rate
> > > > >
> > > > > >
> > > > > > We have implemented a POC before, using QAT to implement the
> > > > > > hash algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256,
> > > > > > HMACSHA1),
> > > > >
> > > > > very cool, thanks. what is the benefit of QAT here, compared to
> > > > > something like isa-l_crypto that just uses AVX instructions? is
> > > > > QAT able to
> > > > offload some of this?
> > > > > would the use of QAT rule out any hardware (like AMD cpus) that
> > > > > would otherwise support AVX?
> > > > >
> > > > > > However, due to md5 security reasons and there is no
> > > > > > convenient alternative
> > > > > framework in ceph, it is temporarily blocked.
> > > > > >
> > > https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_
> > > > > > qa
> > > > > > t_
> > > > > > mode
> > > > >
> > > > > i'll reach out to our security contact to get some more clarity
> > > > > on this stuff. i know that openssl's FIPS certification is
> > > > > important to downstream products. but
> > > > > md5 isn't a cryptographic hash and etag isn't used for security,
> > > > > so i've assumed we could use other md5 implementations there.
> > > > > i'm less sure about the SHA family, but i know Matt's interested
> > > > > in using those for checksumming in rgw
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > > -Hualong
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Casey Bodley <cbodley@xxxxxxxxxx>
> > > > > > > Sent: Tuesday, July 11, 2023 11:06 PM
> > > > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx>
> > > > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration
> > > > > > >
> > > > > > > hey Hualong,
> > > > > > >
> > > > > > > ceph is already using intel's isa-l_crypto library for
> > > > > > > crypto acceleration. i just started looking into its
> > > > > > > multi-buffer md5 implementation
> > > > > > > (https://github.com/intel/isa-
> l_crypto/blob/master/include/md5_mb.
> > > > > > > h) for use in rgw to vectorize our ETag calculations (a
> > > > > > > feature tracked in https://tracker.ceph.com/issues/61646).
> > > > > > > i've started some initial work in
> > > > > > > https://github.com/ceph/ceph/pull/52385,
> > > > > > > but we'll still need to decide how best to integrate that
> > > > > > > into rgw
> > > > > > >
> > > > > > > would your team be interested in collaborrating on this?
> > > > > > > we'd love your input on the design for rgw, and how best to
> > > > > > > measure and tune its performance
> > > > > >
> >

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx