Re: rgw: interest in isa-l_crypto for md5 acceleration

Casey Bodley <cbodley@xxxxxxxxxx> · Tue, 15 Aug 2023 12:43:55 -0400

thanks Chunmei (and cc ceph dev list),

"RGWPutObj uses async_md5 for ETag" is a first draft at hooking this
up for rgw, and the pr description in
https://github.com/ceph/ceph/pull/52488 includes a todo list for
further work

the first step is to run this async hashing in parallel with
filter->process(), which is what applies compression/encryption and
ultimately writes the data to rados. ideally this would mask most of
the latency from md5

the pr adds a single instance of async_md5::Batch that runs on a
strand executor of rgw's thread pool. this means the md5 calculations
can run on any available thread, but will only utilize one core at a
time. we might improve utilization by adding more Batch instances, but
that could reduce the probability that rgw requests can fill those
batches within the "batch timeout". this timeout is a free parameter
that would need tuning

i'd like to determine how effective one instance is at offloading the
md5 calculations. if the md5 part (almost) always completes before
filter->process() does, that's probably good enough. if not, we'd want
a way to scale the number of instances based on latency or load

unfortunately, Mark and i are seeing crashes while testing this rgw
integration. i added a comment about this in
https://github.com/ceph/ceph/pull/52385#pullrequestreview-1578597699,
and will follow up there

finally, regarding the base async/batching library in
https://github.com/ceph/ceph/pull/52385, i'd like to explore ways to
extend that to cover other types of hashes and other libraries like
QAT. do you think QAT would work well under the same async/batching
interface?

overall, does this sound like a reasonable design for rgw?

On Mon, Aug 14, 2023 at 5:52 PM Liu, Chunmei <chunmei.liu@xxxxxxxxx> wrote:
>
> Hi Casey,
>
>     In your following two PRS:
>     https://github.com/ceph/ceph/pull/52385 builds an asynchronous batching library on top of the isal crypto library's multi-buffer md5 facilities:
>     https://github.com/ceph/ceph/pull/52488 rgw/op: RGWPutObj uses async_md5 for ETag.
> Seems rgw can do async md5 batching. I am wondering what the current work status is.  If all features are implemented or still has other features need to be implemented in next step? What can intel team help here?
>
>  Thanks!
> -Chunmei
>
> > -----Original Message-----
> > From: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>
> > Sent: Thursday, August 10, 2023 10:32 PM
> > To: Casey Bodley <cbodley@xxxxxxxxxx>; Liu, Chunmei
> > <chunmei.liu@xxxxxxxxx>; Feng, Hualong <hualong.feng@xxxxxxxxx>
> > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami
> > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus
> > Watts <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx>
> > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> >
> > Include Chunmei.
> >
> > Regards,
> > -Yingxin
> >
> > > -----Original Message-----
> > > From: Cheng, Yingxin
> > > Sent: Thursday, July 13, 2023 3:54 PM
> > > To: Casey Bodley <cbodley@xxxxxxxxxx>; Feng, Hualong
> > > <hualong.feng@xxxxxxxxx>
> > > Cc: Tang, Guifeng <guifeng.tang@xxxxxxxxx>; mbenjami
> > > <mbenjami@xxxxxxxxxx>; Mark Kogan <mkogan@xxxxxxxxxx>; Marcus
> > Watts
> > > <mwatts@xxxxxxxxxx>; Gabriel BenHanokh <gbenhano@xxxxxxxxxx>
> > > Subject: RE: rgw: interest in isa-l_crypto for md5 acceleration
> > >
> > > Merge another RGW thread here, it is discussing the same thing.
> > >
> > > I'm also learning and try to answer below:
> > >
> > > > But when less than 100% CPU is running, using AVX512 to save core
> > > > may cause
> > > slowdown.
> > >
> > > > well put, this is the part i'm struggling with too. it feels like a
> > > > tradeoff between cpu usage and added latency (both from thread
> > > > synchronization and waiting for full batches)
> > >
> > > Yeah, there are synthetic effects from both software and hardware:
> > > latency vs batching, CPU downclocking (should have less impacts from
> > > later CPU models such as SPR), sync vs async, etc.
> > >
> > > > I'm not sure if the current work is an informed implementation with
> > > > respect to
> > > the isa-l interface;  I'm fairly sure it is with regard to boost::asio
> > > and related topics.
> > >
> > > My understanding is that the asynchronous way has the best opportunity
> > > to fully use the 16 lane AVX512 accelerations by batching the
> > > requests, and falling back to the synchronous way is worth considering
> > > under small sizes or low depth. But the decisions need to be based on real
> > test results.
> > >
> > > > Is there anything we can do to spread the load evenly on
> > > SSE/AVX/AVX2/AVX512 units?
> > > > I assume that each physical core has a standalone unit - can we make
> > > > sure to
> > > employee them all in parallel?
> > >
> > > SIMD are CPU instructions rather than off-loadable device. They can
> > > only be executed from a thread synchronously. So if the parallelism
> > > exceeds 16 lanes, it should be reasonable to start another worker
> > > thread. And if there are only 8 outstanding lanes at the moment, AVX2
> > > should be a better choice rather than the heavier AVX512. I'm not sure
> > > yet whether isa-l library is intelligent enough to pick the right instruction or
> > it is manually controlled.
> > >
> > > > This handles only MD5, which certainly matters to us, but I suspect
> > > > we strongly
> > > need acceleration for sha-256 and sha-512.
> > >
> > > Looks isa-l supports these optimizations, which is the recommended way
> > > because there might be multiple options available at the same time (AVX512
> > and sha-ni).
> > > Looking at the source code, the library is able to detect the
> > > availability and select the best possible option.
> > >
> > > Regards,
> > > -Yingxin
> > >
> > > > -----Original Message-----
> > > > From: Casey Bodley <cbodley@xxxxxxxxxx>
> > > > Sent: Thursday, July 13, 2023 3:19 AM
> > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx>
> > > > Cc: Cheng, Yingxin <yingxin.cheng@xxxxxxxxx>; Tang, Guifeng
> > > > <guifeng.tang@xxxxxxxxx>; mbenjami <mbenjami@xxxxxxxxxx>; Mark
> > Kogan
> > > > <mkogan@xxxxxxxxxx>
> > > > Subject: Re: rgw: interest in isa-l_crypto for md5 acceleration
> > > >
> > > > thanks Hualong, (cc Matt and Mark)
> > > >
> > > > On Wed, Jul 12, 2023 at 6:22 AM Feng, Hualong
> > > > <hualong.feng@xxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > Hi Casey
> > > > >
> > > > > Our team are interested in this, but we need to study more details.
> > > > >
> > > > > We have learned about md5 implemented by AVX512 before. We know
> > > > > that
> > > > md5 can only be calculated serially for an object, and cannot be
> > > > split for concurrent calculation. For AVX512, its basic
> > > > understanding is to divide a core into 16 lanes, and then these 16
> > > > lanes can be calculated at the same time. So when we use md5
> > > > implemented by AVX512, we need to wait for multiple request objects to
> > calculate md5 at the same time.
> > > > >
> > > > > So here are two difficulties to consider:
> > > > > 1. When calculating, we need to fill all the lanes on a core as
> > > > > much as
> > > possible.
> > > > Only in this way can the advantages of AVX512/AVX2 be reflected.
> > > > > 2. What we have seen so far is the comparison between using
> > > > > AVX512/AVX2
> > > > on a single core and not using it. But when we actually run RGW,
> > > > unless RGW is allocated or the machine where it is located is
> > > > already running with 100% CPU, then using AVX512 will increase the
> > > > calculation speed of md5. But when less than 100% CPU is running,
> > > > using AVX512 to save core may cause slowdown. So how do we know
> > > > under what circumstances we should use AVX instructions in code?
> > > >
> > > > well put, this is the part i'm struggling with too. it feels like a
> > > > tradeoff between cpu usage and added latency (both from thread
> > > > synchronization and waiting for full batches)
> > > >
> > > > if we could track the rate of hash updates per second, we might use
> > > > that to decide whether we're likely to get a full batch within some
> > > > limit of acceptable latency
> > > >
> > > > RGWPutObj could probably mask some of this latency by running these
> > > > asynchronous hashes in parallel while reading the next 4MB chunk
> > > > from the frontend
> > > >
> > > > in https://github.com/ceph/ceph/pull/52385 i introduced the concept
> > > > of a batch_timeout, which can force the processing of a partial batch.
> > > > bounding this latency seemed like a necessary part of the model. if
> > > > RGWPutObj can mask this latency, then we might use that
> > > > batch_timeout to avoid the need to track a global hash rate
> > > >
> > > > >
> > > > > We have implemented a POC before, using QAT to implement the hash
> > > > > algorithm in ceph (SHA256, MD5, SHA1, SHA512, HMACSHA256,
> > > > > HMACSHA1),
> > > >
> > > > very cool, thanks. what is the benefit of QAT here, compared to
> > > > something like isa-l_crypto that just uses AVX instructions? is QAT
> > > > able to
> > > offload some of this?
> > > > would the use of QAT rule out any hardware (like AMD cpus) that
> > > > would otherwise support AVX?
> > > >
> > > > > However, due to md5 security reasons and there is no convenient
> > > > > alternative
> > > > framework in ceph, it is temporarily blocked.
> > > > >
> > https://github.com/ceph/ceph/compare/main...hualongfeng:ceph:hash_
> > > > > qa
> > > > > t_
> > > > > mode
> > > >
> > > > i'll reach out to our security contact to get some more clarity on
> > > > this stuff. i know that openssl's FIPS certification is important to
> > > > downstream products. but
> > > > md5 isn't a cryptographic hash and etag isn't used for security, so
> > > > i've assumed we could use other md5 implementations there. i'm less
> > > > sure about the SHA family, but i know Matt's interested in using
> > > > those for checksumming in rgw
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > > -Hualong
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Casey Bodley <cbodley@xxxxxxxxxx>
> > > > > > Sent: Tuesday, July 11, 2023 11:06 PM
> > > > > > To: Feng, Hualong <hualong.feng@xxxxxxxxx>
> > > > > > Subject: rgw: interest in isa-l_crypto for md5 acceleration
> > > > > >
> > > > > > hey Hualong,
> > > > > >
> > > > > > ceph is already using intel's isa-l_crypto library for crypto
> > > > > > acceleration. i just started looking into its multi-buffer md5
> > > > > > implementation
> > > > > > (https://github.com/intel/isa-l_crypto/blob/master/include/md5_mb.
> > > > > > h) for use in rgw to vectorize our ETag calculations (a feature
> > > > > > tracked in https://tracker.ceph.com/issues/61646). i've started
> > > > > > some initial work in https://github.com/ceph/ceph/pull/52385,
> > > > > > but we'll still need to decide how best to integrate that into
> > > > > > rgw
> > > > > >
> > > > > > would your team be interested in collaborrating on this? we'd
> > > > > > love your input on the design for rgw, and how best to measure
> > > > > > and tune its performance
> > > > >
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx