This patch series introduces zero copy capability to the 9p transport layer. 9P Linux client makes an additional copy of read/write buffer into the kernel before sending it down to the transport layer. There is no functional need for this additional copy hence it is eliminated by sending the payload buffer directly to the transport layer. While this is advantageous to all transports, it can be further exploited by virtualized transport layers like VirtIO, by directly send user buffer to the server and there by achieving real zero copy. Design Goals. - Have minimal changes to the net layer so that common code is not polluted by the transport specifics. - Create a common transport library which can be used by other transports. - Avoid additional optimizations in the initial attempt (more details below) and focus on achieving basic functionality. Design This patch added infrastructure to send the payload buffers directly to the transport layer if the later prefers. To accomplish this preferences property is added to the transport layer and additional elements are added to the PDU structure (struct 9p_fcall) Transport layer specifies the preference through newly introduced field in the transport module. (clnt->trans_mod->pref) and net layer sends the the payload through pubuf/pkbuf elements of struct 9p_fcall. This method has few advantages. - Keeps the net layer clean and lets the transport layer deal with specifics. - mapping user addr into kernel pages pins the memory this could make the system vulnerable to denial-of-service attacks. This change gives transport layer more control to implement effective flow control. Expect flow control patches shortly. - If a transport layer doesn't see the need to handle payload separately, it can set the preference accordingly so that current code works with no changes. This is very useful for transports which has no plans of converting/pinning user pages. There is rather a sticky issue with is a rather sticky issue with TREAD/RERROR scenario in non-9P2000.L protocols (Legacy, 9P2000.u) If the server has to fail the READ request, it can send an error up to ERRMAX(256). As this is not fixed size, it is hard to allocate fixed amount of buffer from the transport layer perspective. In 9P2000.L, the error is a fixed size (errno) hence not an issue. On success the received packet will be PDU header + read size + payload. On error it is PDU header + errno. Hence non-payload size is constant (11) irrespective of success or failure. But this is not the case in non-9P2000.L the header size is different in the failure (TREAD/RERROR) case. To take care of this the patch makes sure that the read buffer is big enough to accommodate ERRMAX string. It also means that there is a chance of scribbling on the payload/user buffer in the error case for those non-POSIX complaint protocols. The added trans_mod->pref will give the option of not participating in the zero copy. This series also created trans_common.[ch] to house common functions so that other transport layers can take advantage of them. Testing/Performance: Setup: HS21 blade a two socket quad core Xeon with 4 GB memory, IO to the local disk. WRITE dd if=/dev/zero of=/pmnt/file1 bs=4096 count=1MB (variable bs = IO SIZE) IO SIZE TOTAL SIZE No ZC ZC 1 1MB 22.4 kb/s 19.8 kb/s 32 32MB 711 kb/s 633 kb/s 64 64MB 1.4 mb/s 1.3 mb/s 128 128MB 2.8 mb/s 2.6 mb/s 256 256MB 5.6 mb/s 5.1 mb/s 512 512MB 10.4 mb/s 10.2 mb/s 1024 1GB 19.7 mb/s 20.4 mb/s 2048 2GB 40.1 mb/s 43.7 mb/s 4096 4GB 71.4 mb/s 73.1 mb/s READ dd of=/dev/null if=/pmnt/file1 bs=4096 count=1MB(variable bs = IO SIZE) IO SIZE TOTAL SIZE No ZC ZC 1 1MB 26.6 kb/s 23.1 kb/s 32 32MB 783 kb/s 734 kb/s 64 64MB 1.7 mb/s 1.5 mb/s 128 128MB 3.4 mb/s 3.0 mb/s 256 256MB 4.2 mb/s 5.9 mb/s 512 512MB 6.9 mb/s 11.6 mb/s 1024 1GB 23.3 mb/s 23.4 mb/s 2048 2GB 42.5 mb/s 45.4 mb/s 4096 4GB 67.4 mb/s 73.9 mb/s ZC benefits are seen beyond 1k buffer. Hence the patch makes sure that the zero copy is not enforced for smaller IO (< 1024) My setup/box could be a bottleneck as it gave similar numbers even on the host. But observed better numbers with zero copy on bigger setup. What is following this patch series (Future work) 1. One of the major advantage of this patch series is to have bigger msize to pull off bigger read/writes from the server. Increasing the msize is not really a solution as majority of other transactions are extremely small which could result in waste of kernel heap. To address this problem we need to have two sizes of PDUs. 2. Add flow-control capability to the transport layer. 3. Add a mount option to disable the zero copy even if the user prefers to. Signed-off-by: Venkateswararao Jujjuri <jvrao@xxxxxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html