Re: Fatal crash with NFS, AIO & tcp retransmit

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Mon, 21 Jan 2013 14:38:20 +0000

On Mon, 2013-01-21 at 13:06 +0000, Alex Bligh wrote:
> I'm trying to resolve a fatal bug that happens with Linux 3.2.0-32-generic
> (Ubuntu variant of 3.2), and the magic combination of
> 1. NFSv4
> 2. AIO from Qemu
> 3. Xen with upstream qemu DM
> 4. QCOW plus backing file.
> 
> The background is here:
>   http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html
> It is completely replicable on different NFS client hardware. We've
> tried other kernels to no avail.
> 
> The bug is quite nasty in that dom0 crashes fatally due to a VM action.
> 
> Within the link, you'll see references to an issue found by Ian Campbell
> a while ago, which turned out to be an NFS issue independent of Xen but
> apparently not in NFS4. The links are:
>  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=640941
>  http://marc.info/?l=linux-nfs&m=122424132729720&w=2
> 
> In essence, my understanding of what appears to be happening (which
> may be entirely wrong) is:
> 1. Xen 4.2 HVM domU VM has a PV disk driver
> 2. domU writes a page
> 3. Xen maps domU's page into dom0's VM space
> 4. Xen asks Qemu (userspace) to write a page
> 5. Qemu's disk (and backing file for the disk) are on NFSv4
> 6. Qemu uses AIO to write the page to NFS
> 7. AIO claims the page write is complete
> 8. Qemu marks the write as complete
> 9. Xen unmaps the page from dom0's VM space
> 10. Apparently, the write is not actually complete at this
>     point
> 11. TCP retransmit is triggered (not quite sure why, possibly
>     due to slow filer)
> 12. TCP goes to resend the page, and finds it's not in dom0
>     memory.
> 13. Bang
> 
> The Xen folks think this is nothing to do with either Xen or QEMU, and
> believe the problem is AIO on NFS. The links to earlier investigations
> suggest this is/was true, but not for NFSv4, and was fixed. An NFSv4 case
> may have been missed.
> 
> Against this explanation:
> a) it does not happen in KVM (again with QEMU doing AIO to
>    NFS) - though here the page mapping fanciness doesn't
>    happen as KVM VMs share the same memory space as the kernel
>    as I understand it.
> b) it does not happen on Xen without a QEMU backing file (though
>    that may be just what's necessary timing wise to trigger
>    the race condition).
> 
> Any insight you have would be appreciated.
> 
> Specifically, the question I'd ask is as follows. Is it correct behaviour
> that Linux+NFSv4 marks an AIO request completed when all the relevant data
> may have been sent by TCP but not yet ACK'd? If so, how is Linux meant to
> deal with retransmits? Are the pages referenced by the TCP stack meant to
> be marked COW or something? What is meant to happen if those pages get
> removed from the memory map entirely?
> 
> As an aside, we're looking for someone to fix this (and things like it) on
> a contract basis. Contact me off list if interested.
> 

The Oops would be due to a bug in the socket layer: the socket is
supposed to take a reference count on the page in order to ensure that
it can copy the contents.

As for the O_DIRECT bug, the problem there is that we have no way of
knowing when the socket is done writing the page. Just because we got an
answer from the server doesn't mean that the socket is done
retransmitting the data. It is quite possible that the server is just
replying to the first transmission.
I thought that Ian was working on a fix for this issue. At one point, he
had a bunch of patches to allow sendpage() to call you back when the
transmission was done. What happened to those patches?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥