Shoveling data into and out of the crypto subsystem

Simon Richter <Simon.Richter@xxxxxxxxxx> · Mon, 18 Oct 2021 16:22:22 +0200

Hi,

I'm building a small accelerator card that should provide crypto 
primitives, and I'm wondering how large data transfers from and to 
userspace are supposed to work -- especially if these are file backed 
and larger than available memory.

For testing, I've created an 8GB random file, and used kcapi-dgst on it:

    $ strace kcapi-dgst -c sha256 -i test8G.bin --hex
    [...]
    openat(AT_FDCWD, 0x7ffc7e4b5896, O_RDONLY|O_CLOEXEC) = 6
    fstat(6, 0x7ffc7e4a5da0)                = 0
    mmap(NULL, 8589934592, PROT_READ, MAP_SHARED, 6, 0) = 0x7f8d911cf000
    accept(3, NULL, NULL)                   = 7
    sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE)    = 2147479552
    vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
    splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
    sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE)    = 2147479552
    vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
    splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
    sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE)    = 2147479552
    vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
    splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
    sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE)    = 2147479552
    vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
    splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
    sendto(7, 0x7f8f911ceffc, 4, MSG_MORE, NULL, 0) = 4
    recvmsg(7, 0x7ffc7e4a5cd0, 0)           = 32
    fstat(1, 0x7ffc7e4a5bc0)                = 0
    munmap(0x7f8d911cf000, 0)               = -1 EINVAL (Invalid argument)

This seems wrong to me:

 - Every sendmsg call is 2GB - 4kB. That probably makes sense when 
trying to keep every transfer page aligned.
 - The vmsplice()/splice() transfers 4095 bytes -- that would likely 
trigger a copy and leave the file pointer unaligned after
 - The last sendto() call then cleans up the remaining four bytes and 
still uses MSG_MORE.
 - The munmap() call is just confused.

Is that the optimal way to transfer data from disk to an ahash?

Now my PCIe device can operate directly on DMA memory, and the way I've 
understood the crypto API is that the "src" scatterlist can be mapped 
using dma_map_sg, so somehow the data is in DMA memory at this point, 
which makes me suspect that the data was copied several times in between 
as the result of mmap() is unsuitable for DMA.

crypto+mm Questions so far:

 - How does flow control work for the 2GB sendmsg(mmap()) if the data 
needs to be made available for DMA -- presumably I can't dma_map_sg() 
all of the pages if I have 4 GB physical memory?
 - Is there a zerocopy path for disk->crypto that can be used with 
large data blobs?
 - Are there suitable paths for crypto->disk (for encryption and 
compression)?
 - If the device implements PCIe Address Translation and Page Request 
Interface, can I use the IOMMU to pin pages instead of doing that in a 
driver, i.e. can a crypto driver indicate that the scatterlist can refer 
to virtual memory that need not be pinned or even present yet, and can 
this be used to avoid copies or partial mappings?

Crypto only questions so far:

 - The ahash interface seems to still expect the result to be filled 
out on return, when I kind of expected it to wait for me to send a 
callback. Am I missing something, or do I need to suspend the current 
thread and wake it up from an interrupt? Can I somehow report completion 
from an interrupt handler? Does it make sense to make interrupts CPU affine?
 - The result pointer for ahash points to vmalloc()ed memory -- is 
there a way to get a DMA buffer instead (not that there's a performance 
difference here, but space in the result DMA buffer is another resource 
I need to track otherwise).
 - The POWER9 NX driver has a separate interface for gzip 
compression/decompression of large blobs, is there a technical reason 
why it cannot implement the crypto API?

Basically my goal is to have fast gzip compression and decompression 
support with the same interface on both of my workstations, one of which 
has an FPGA card, and the other has two POWER9 CPUs with NX. :)

   Simon

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature