On Tue, Sep 27, 2022 at 12:45 PM Adel Abouchaev <adel.abushaev@xxxxxxxxx> wrote: > > > On 9/25/22 11:04 AM, Willem de Bruijn wrote: > >>> The patch seems to get the crypto_ctx by doing a connection hash table > >>> lookup in the sendmsg(), which is not good from the performance side. > >>> One QUIC connection can go over multiple UDP sockets, but I don't > >>> think one socket can be used by multiple QUIC connections. So why not > >>> save the ctx in the socket instead? > >> A single socket could have multiple connections originated from it, > >> having different destinations, if the socket is not connected. An > >> optimization could be made for connected sockets to cache the context > >> and save time on a lookup. The measurement of kernel operations timing > >> did not reveal a significant amount of time spent in this lookup due to > >> a relatively small number of connections per socket in general. A shared > >> table across multiple sockets might experience a different performance > >> grading. > > I'm late to this patch series, sorry. High quality implementation. I > > have a few design questions similar to Xin. > > > > If multiplexing, instead of looking up a connection by { address, port > > variable length connection ID }, perhaps return a connection table > > index on setsockopt and use that in sendmsg. > > > It was deliberate to not to return anything other than 0 from > setsockopt() as defined in the spec for the function. Despite that it > says "shall", the doc says that 0 is the only value for successful > operation. This was the reason not to use setsockopt() for any > bidirectional transfers of data and or status. A more sophisticated > approach with netlink sockets would be more suitable for it. The second > reason is the API asymmetry for Tx and Rx which will be introduced - the I thought the cover letter indicated that due to asymmetry of most QUIC workloads, only Tx offload is implemented. You do intend to add Rx later? > Rx will still need to match on the address, port and cid. The third > reason is that in current implementations there are no more than a few > connections per socket, This is very different from how we do QUIC at Google. There we definitely multiplex many connections across essentially a socket per CPU IIRC. > which does not abuse the rhashtable that does a > lookup, although it takes time to hash the key into a hash for a seek. > The performance measurement ran against the runtime and did not flag > this path as underperforming either, there were other parts that > substantially add to the runtime, not the key lookup though. > > > >>> The patch is to reduce the copying operations between user space and > >>> the kernel. I might miss something in your user space code, but the > >>> msg to send is *already packed* into the Stream Frame in user space, > >>> what's the difference if you encrypt it in userspace and then > >>> sendmsg(udp_sk) with zero-copy to the kernel. > >> It is possible to do it this way. Zero-copy works best with packet sizes > >> starting at 32K and larger. Anything less than that would consume the > >> improvements of zero-copy by zero-copy pre/post operations and needs to > >> align memory. > > Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user > > pages. This series re-implements that with its own get_user_pages. > > That is duplicative non-trivial code. And it will incur the same cost. > > What this implementation saves is the (indeed non-trivial) > > asynchronous completion notification over the error queue. > > > > The cover letter gives some performance numbers against a userspace > > implementation that has to copy from user to kernel. It might be more > > even to compare against an implementation using MSG_ZEROCOPY and > > UDP_SEGMENT. A userspace crypto implementation may have other benefits > > compared to a kernel implementation, such as not having to convert to > > crypto API scatter-gather arrays and back to network structures. > > > > A few related points > > > > - The implementation support multiplexed connections, but only one > > crypto sendmsg can be outstanding at any time: > > > > + /** > > + * To synchronize concurrent sendmsg() requests through the same socket > > + * and protect preallocated per-context memory. > > + **/ > > + struct mutex sendmsg_mux; > > > > That is quite limiting for production workloads. > > The use case that we have with MVFST library currently runs a single > worker for a connection and has a single socket attached to it. QUIC > allows simultaneous use of multiple connection IDs to swap them in > runtime, and implementation would request only a handful of these. The > MVFST batches writes into a block of about 8Kb and then uses GSO to send > them all at once. > > > - Crypto operations are also executed synchronously, using > > crypto_wait_req after each operationn. This limits throughput by using > > at most one core per UDP socket. And adds sendmsg latency (which may > > or may not be important to the application). Wireguard shows an > > example of how to parallelize software crypto across cores. > > > > - The implementation avoids dynamic allocation of cipher text pages by > > using a single ctx->cipher_page. This is protected by sendmsg_mux (see > > above). Is that safe when packets leave the protocol stack and are > > then held in a qdisc or when being processed by the NIC? > > quic_sendmsg_locked will return, but the cipher page is not free to > > reuse yet. > There is currently no use case that we have in hands that requires > parallel transmission of data for the same connection. Multiple > connections would have no issue running in parallel as each of them will > have it's own preallocated cipher_page in the context. This still leaves the point that sendmsg may return and the mutex released while the cipher_page is still associated with an skb in the transmit path. > There is a fragmentation further down the stack with > ip_generic_getfrag() that eventually does copy_from_iter() and makea a > copy of the data. This is executed as part of __ip_append_data() called > from udp_sendmsg() in ipv4/udp.c. The assumption was that this is > executed synchronously and the queues and NIC will see a mapping of a > different memory area than the ciphertext in the pre-allocated page. > > > > > - The real benefit of kernel QUIC will come from HW offload. Would it > > be better to avoid the complexity of an in-kernel software > > implementation and only focus on HW offload? Basically, pass the > > plaintext QUIC packets over a standard UDP socket and alongside in a > > cmsg pass either an index into a HW security association database or > > the immediate { key, iv } connection_info (for stateless sockets), to > > be encoded into the descriptor by the device driver. > Hardware usually targets a single ciphersuite such as AES-GCM-128/256, > while QUIC also supports Chacha20-Poly1305 and AES-CCM. The generalized > support for offload prompted implementation of these ciphers in kernel > code. All userspace libraries also support all protocols as fall-back. No need for two fall-backs if HW support is missing? > The kernel code could also engage if the future hardware has > capacity caps preventing it from handling all requests in the hardware. > > - With such a simpler path, could we avoid introducing ULP and just > > have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined > > state type yet. > > > > - Small aside: as the series introduces new APIs with non-trivial > > parsing in the kernel, it's good to run a fuzzer like syzkaller on it > > (if not having done so yet). > Agreed. > >> The other possible obstacle would be that eventual support > >> of QUIC encryption and decryption in hardware would integrate well with > >> this current approach. > >>> Didn't really understand the "GSO" you mentioned, as I don't see any > >>> code about kernel GSO, I guess it's just "Fragment size", right? > >>> BTW, it‘s not common to use "//" for the kernel annotation. > > minor point: fragment has meaning in IPv4. For GSO, prefer gso_size. > Sure, will change it to gso_size. > > > >> Once the payload arrives into the kernel, the GSO on the interface would > >> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC > >> packets should be aligned on the GSO marks less the tag size that would > >> be added by encryption. For GSO size 1000, the QUIC packets in the batch > >> for transmission should all be 984 bytes long, except maybe the last > >> one. Once the tag is attached, the new size of 1000 will correctly split > >> the QUIC packets further down the stack for transmission in individual > >> IP/UDP packets. The code is also saving processing time by sending all > >> packets at once to UDP in a single call, when GSO is enabled. > >>> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC > >>> TX only. Honestly, I'm more supporting doing a full QUIC stack in the > >>> kernel independently with socket APIs to use it: > >>> https://github.com/lxin/tls_hs. > >>> > >>> Thanks.