Hi,I'm building a small accelerator card that should provide crypto primitives, and I'm wondering how large data transfers from and to userspace are supposed to work -- especially if these are file backed and larger than available memory.
For testing, I've created an 8GB random file, and used kcapi-dgst on it: $ strace kcapi-dgst -c sha256 -i test8G.bin --hex [...] openat(AT_FDCWD, 0x7ffc7e4b5896, O_RDONLY|O_CLOEXEC) = 6 fstat(6, 0x7ffc7e4a5da0) = 0 mmap(NULL, 8589934592, PROT_READ, MAP_SHARED, 6, 0) = 0x7f8d911cf000 accept(3, NULL, NULL) = 7 sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552 vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095 splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095 sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552 vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095 splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095 sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552 vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095 splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095 sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552 vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095 splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095 sendto(7, 0x7f8f911ceffc, 4, MSG_MORE, NULL, 0) = 4 recvmsg(7, 0x7ffc7e4a5cd0, 0) = 32 fstat(1, 0x7ffc7e4a5bc0) = 0 munmap(0x7f8d911cf000, 0) = -1 EINVAL (Invalid argument) This seems wrong to me:- Every sendmsg call is 2GB - 4kB. That probably makes sense when trying to keep every transfer page aligned. - The vmsplice()/splice() transfers 4095 bytes -- that would likely trigger a copy and leave the file pointer unaligned after - The last sendto() call then cleans up the remaining four bytes and still uses MSG_MORE.
- The munmap() call is just confused. Is that the optimal way to transfer data from disk to an ahash?Now my PCIe device can operate directly on DMA memory, and the way I've understood the crypto API is that the "src" scatterlist can be mapped using dma_map_sg, so somehow the data is in DMA memory at this point, which makes me suspect that the data was copied several times in between as the result of mmap() is unsuitable for DMA.
crypto+mm Questions so far:- How does flow control work for the 2GB sendmsg(mmap()) if the data needs to be made available for DMA -- presumably I can't dma_map_sg() all of the pages if I have 4 GB physical memory? - Is there a zerocopy path for disk->crypto that can be used with large data blobs? - Are there suitable paths for crypto->disk (for encryption and compression)? - If the device implements PCIe Address Translation and Page Request Interface, can I use the IOMMU to pin pages instead of doing that in a driver, i.e. can a crypto driver indicate that the scatterlist can refer to virtual memory that need not be pinned or even present yet, and can this be used to avoid copies or partial mappings?
Crypto only questions so far:- The ahash interface seems to still expect the result to be filled out on return, when I kind of expected it to wait for me to send a callback. Am I missing something, or do I need to suspend the current thread and wake it up from an interrupt? Can I somehow report completion from an interrupt handler? Does it make sense to make interrupts CPU affine? - The result pointer for ahash points to vmalloc()ed memory -- is there a way to get a DMA buffer instead (not that there's a performance difference here, but space in the result DMA buffer is another resource I need to track otherwise). - The POWER9 NX driver has a separate interface for gzip compression/decompression of large blobs, is there a technical reason why it cannot implement the crypto API?
Basically my goal is to have fast gzip compression and decompression support with the same interface on both of my workstations, one of which has an FPGA card, and the other has two POWER9 CPUs with NX. :)
Simon
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature