On Wed, 2013-01-23 at 15:22 +0000, Alex Bligh wrote: > Trond, > > --On 21 January 2013 17:20:36 +0000 "Myklebust, Trond" > <Trond.Myklebust@xxxxxxxxxx> wrote: > > >> So, just to be clear, if a process is using NFS and AIO with O_DSYNC > >> (but not O_DIRECT) - which is I think what QEMU is meant to be doing - > >> then it should *never* be zero copy (even if writes happen to be > >> appropriately aligned). Is that correct? If so, I can strace the > >> process and see exactly what flags it is using. > >> > > > > That is correct. If you want zero-copy, then O_DIRECT is your thing > > (with or without aio). Otherwise, the kernel will always write to disk > > by copying through the page cache. > > Just to follow up on this, QEMU (specifically hw/xen_disk.c) was using > O_DIRECT. If O_DIRECT is turned off, we get an additional page copy > but the bug does not appear. > > It thus appears that the root of the problem is that if an AIO NFS > request is made with O_DIRECT, AIO can report the request is completed > even when the segment may need to be retransmitted, and whilst the > TCP stack correctly holds a reference to the page concerned, this > is not currently preventing Xen unmapping it as Xen thinks the IO > has completed. It is not limited to aio/dio. It can happen with ordinary synchronous O_DIRECT too. As I said, it is a known problem and is one of the reasons why we want to set retransmission timeouts to a high value. The real fix would be to implement something along the lines of Ian's patchset. > I believe this problem may apply to iSCSI and for that matter (e.g.) > DRDB too. I've no idea if they do zero copy to the socket in these situations. If they do, then they probably have similar issues. The problem can be mitigated by breaking the connection on retransmission; we can't do that in NFS < NFSv4.1, since the duplicate replay cache is typically indexed to the port number (and port number reuse is difficult with TCP due to the existence of the TIME_WAIT state). -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html