Re: RGW encrypt is implemented by qat batch and queue mode

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 12 Oct 2022 11:10:57 -0400

On Thu, Sep 22, 2022 at 9:31 PM Feng, Hualong <hualong.feng@xxxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Casey Bodley <cbodley@xxxxxxxxxx>
> > Sent: Wednesday, September 21, 2022 10:20 PM
> > To: Feng, Hualong <hualong.feng@xxxxxxxxx>
> > Cc: Mark Kogan <mkogan@xxxxxxxxxx>; Tang, Guifeng
> > <guifeng.tang@xxxxxxxxx>; Ma, Jianpeng <jianpeng.ma@xxxxxxxxx>;
> > dev@xxxxxxx
> > Subject: Re: RGW encrypt is implemented by qat batch and queue mode
> >
> > On Mon, Sep 19, 2022 at 4:06 AM Feng, Hualong <hualong.feng@xxxxxxxxx>
> > wrote:
> > >
> > > Hi Mark, Casey
> > >
> > >
> > >
> > > Could you spare some time to help review these two PRs or add them to
> > your plan?
> > >
> > >
> > >
> > > The PR link is below:
> > >
> > > https://github.com/ceph/ceph/pull/47040
> > >
> > > https://github.com/ceph/ceph/pull/47845
> > >
> > >
> > >
> > > I reimplemented the qat encryption plugin. Since the existing RGW
> > encryption uses 4KB as an encryption unit, the performance is poor when the
> > qat batch interface is not used. Now I have reimplemented the encryption
> > plug-in using the qat batch interface, which is done in two PRs. PR47040 is
> > used to realize that when the encrypted data block is larger than 128KB, 32
> > pieces of 4K data are taken out for a batch submission each time. PR47845 is
> > based on PR47040, each time the encrypted data block is smaller than 128KB,
> > it is put into a buffer queue first, and when 32 pieces of 4K data or timeout
> > can be reached, a batch submission is performed.
> > >
> > >
> > >
> > >
> > >
> > > The performance result is below, and moreover, the higher the CPU usage,
> > the more obvious the effect of qat.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From the flame graph, the proportion of the encryption plug-in
> > implemented by qat in the RGWPutObj::execute function is lower than that of
> > the encryption plug-in implemented by isal.
> > >
> > >
> > >
> > >
> > >
> > > Thanks
> > >
> > > -Hualong
> >
> > hey Hualong et al, (cc dev list)
> >
> > thanks for reaching out, this really helps me understand what those PRs are
> > trying to accomplish
> >
> > in general i'm concerned about the need for threads, locking, and buffering
> > down in the crypto plugins. ideally this stuff would be under the application's
> > control. in radosgw, we've been trying to eliminate any blocking waits on
> > condition variables in our i/o path now that requests are handled in
> > coroutines - instead of blocking an entire thread, we just suspend the
> > coroutine and run others in the meantime
>
> I agree with your view, but now crypto function calls are still using the synchronous interface. If we don't want the plugin to contain condition variables, we need to implement the plugin in an asynchronous way and provide an asynchronous interface. This requires the RGW to call the interface to make changes.
>
> And the number of QAT instances is difficult to keep consistent with the number of threads. The number of QAT instance(hardware resources) is limited. When the number of instances is less than the number of threads, we still need to wait for the free instance. If the asynchronous interface is used, we can use the queue as a buffer to avoid blocking the current thread while waiting for a free instance.
>
> If it is still a synchronous interface, there is no good way to eliminate the condition variable. Do you have a better suggestion here?

below you suggest that we could fall back to CPU processing for small
object uploads. could we use that same fallback in the cases where
we'd otherwise have to block waiting for a QAT instance?

>
> > seeing that graph by object size, my first impression was that radosgw should
> > be using bigger buffers.
>
> Use a bigger buffer? You mean we should change the encrypted CHUNK_SIZE, from the current 4096B, to a bigger one?
> Or other buffers?

sorry not the CHUNK_SIZE, but the total amount of data we can feed
into QAT at a time. i see in https://github.com/ceph/ceph/pull/47040
that you've found the loop in AES_256_CBC::cbc_transform() which
breaks the input into CHUNK_SIZEd pieces, and you converted that loop
into a single batch() call - that part looks great

if each call to cbc_transform() is getting large enough buffers, it
could acquire exclusive access to one QAT instance, feed all of its
data through it, then release the instance back to a pool. for a large
object workload it seems like this strategy would best utilize the
hardware resources, because you never have to coordinate a single
batch across several threads. you just need to acquire/release access
to a QAT instance every 4MB, which you can use for 32 batches
(assuming batch size is 32*4k=128k?) at a time

>
> > GetObj and PutObj are both reading data in 4MB
> > chunks, maybe we can find a way to use the qat batch interfaces within those
> > chunks?
>
> Yes, they are both reading data in 4MB.
> But when the object we put is larger than 4MB, the data block size when calling the encryption function is not necessarily 4MB.
>
> Such as the below that put an object, but the data block size that the encryption function using is 64KB
> PUT
> /examplebucket/chunkObject.txt
>
> content-encoding:aws-chunked
> content-length:66824
> host:s3.amazonaws.com
> x-amz-content-sha256:STREAMING-AWS4-HMAC-SHA256-PAYLOAD
> x-amz-date:20130524T000000Z
> x-amz-decoded-content-length:66560
> x-amz-storage-class:REDUCED_REDUNDANCY
> Authorization:AWS4-HMAC-SHA256 Credential=AKIAIOSFODNN7EXAMPLE/20130524/us-east-1/s3/aws4_request,SignedHeaders=content-encoding;content-length;host;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length;x-amz-storage-class,Signature=4f232c4386841ef735655705268965c44a0e4690baa4adea153f7db9fa80a0a9
> ---------------
> 10000;chunk-signature=ad80c730a21e5b8d04586a2213dd63b9a0e99e0e2307b0ade35a65485a288648
> <65536-bytes>
> ---------------
> 10000;chunk-signature=ad80c730a21e5b8d04586a2213dd63b9a0e99e0e2307b0ade35a65485a288648
> <65536-bytes>

are you sure that this http-level chunking has an effect on the buffer
sizes that encryption sees? it may cause the buffers to be segmented
at 64k, but encryption and decryption both call bufferlist::c_str() to
reallocate a single contiguous buffer:
https://github.com/ceph/ceph/blob/9aa8bed/src/rgw/rgw_crypt.cc#L490

so i'd still expect this loop in RGWPutObj::execute() to read up to
rgw_max_chunk_size at a time:
https://github.com/ceph/ceph/blob/fc01eeb7/src/rgw/rgw_op.cc#L4111-L4141

if there are cases where the RGWPutObj_BlockEncrypt filter isn't
getting large enough buffers, we can use the same strategy as
https://github.com/ceph/ceph/pull/21479, where we improved compression
ratios by adding a buffering filter in front

>
> > that could avoid the need for cross-thread queues and
> > synchronization. compared to your approach in
> > https://github.com/ceph/ceph/pull/47845, i imagine this would show less of a
> > benefit for small object uploads, but more of a benefit for the big ones. do
> > you think this could work?
>
> If in order to avoid the need for cross-thread queues and show less of a benefit for small object uploads, we can turn small objects to CPU processing. Only for big object, we use QAT batch api.
>
> Hi Casey
>
> Thanks for your reply. The detail message and some question are above.
>
> Thanks
> -Hualong
>

all of my feedback here relates to large objects, though you've really
focused on small objects in https://github.com/ceph/ceph/pull/47845.
for small object workloads, i do agree that the queuing and thread
synchronization is necessary to take advantage of this batching

it's just hard for me to tell whether that extra complexity is worth
it. we've tried to minimize any synchronization between rgw's requests
so that we're able to scale to thousands of concurrent
requests/connections. at scale, i'd worry that lock contention here
would negate some of the gains from QAT

in workloads with a mix of small and large objects, i think we'd make
the best use of QAT if we applied it to the larger objects (>= 128k?)
where we can use it most efficiently

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx