On Fri, 2008-10-17 at 12:01 +0100, Ian Campbell wrote: > (please CC me, I am not currently subscribed to linux-nfs) > > Hi, > > For some time now I have been tracking down a mysterious host crash when > using Xen with blktap (userspace paravirtual disk implementation) under > stress conditions. I have now managed to create a simple test case with > no Xen components which appears to show RPC retransmissions of write > requests containing invalid data. I'll start at the beginning and > explain the original bug although the test case is much simpler. > > blktap is a userspace daemon which provides paravirtualised disk support > to Xen guests. The pages of data for a block write are mapped from the > Xen guest into a domain 0 driver called blktap which has a userspace > component which implements qcow/vhd/etc and writes the data to the > backing file using O_DIRECT and aio in a zero copy manner. Once the the > aio is completed the pages are returned to the guest and unmapped from > domain 0. When a page is unmapped in this way the pte is set not present > and the PFN is mapped to an invalid MFN. > > We have been seeing a crash in the domain 0 network driver's start_xmit > routine when it attempts to access data from a page where the PFN maps > to an invalid MFN. I added some tracing to the kernel and observed this > sequence of events on an individual page: > tap: 0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0 > tap: 1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000 > tap: 2/41 at 4fac7a166cca "call_start" c8792844 0 > tap: 3/41 at 4fac7a167540 "call_reserve" c8792844 0 > tap: 4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0 > tap: 5/41 at 4fac7a168620 "call_allocate" c8792844 0 > tap: 6/41 at 4fac7a168f08 "call_bind" c8792844 0 > tap: 7/41 at 4fac7a169712 "call_connect" c8792844 0 > tap: 8/41 at 4fac7a169f28 "call_transmit" c8792844 0 > tap: 9/41 at 4fac7a16a7f2 "call_encode" c8792844 0 > tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0 > tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0 > tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0 > tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)" cef08b00 0 > tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0 > tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500 c9dca5a8 > tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0 > tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req->rq_received" c8792844 0 > tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0 > tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92 > tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0 > tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0 > tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0 > tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0 > tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0 > tap: 25/41 at 4faeeec145ae "call_header" c8792844 0 > tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0 > tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0 > tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)" d06afe40 0 > tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0 > tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0 > tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88 > tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0 > tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0 > tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0 > tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000 > tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)" ece19850 0 > tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb" ece19850 0 > tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0 > tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0 > tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40 d06afee8 > (nb: fast_flush_area is the blktap function which returns the pages to > the guest and unmaps them from domain 0, it is called via ioctl from the > userspace process once the aio write returns successfully. 4fac7.... is > the tsc, processor is 2.33GHz) > > So what we see is the initial request being constructed and transmitted > (around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41) > which causes us to queue a retransmit (26/41-29/41) but, critically, not > yet actually transmit it. At this point we get the reply to the original > request and complete the NFS write (35/41-38/41) returning success to > userspace which causes it to unmap the pages and return them to the > guest (39/41). At this point (40/41) we attempt to transmit the > duplicate request and crash because the pages are not present any more. > > By using libnetfilter_queue I was then able to reproduce this in > non-stress situations by introducing delays into the network. Stalling > all network traffic with the NFS server for 65s every 1000 packets seems > to do the trick (probably on the aggressive side, I didn't attempt to > minimise the interruption required to reproduce and I realise that this > represents a pretty crappy network and/or server). > > Given the observed sequence of events I then constructed a fairly simple > test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't > think it matters) which shows the same issue without involving any Xen > parts. The test writes a buffer filed with a counter to a file and > immediately after the write() returns fills the buffer with 0xdeadbeef. > By using tcpdump I capture and observe duplicated requests on the wire > containing 0xdeadbeef in the payload and not the expected counter > values. I usually see this in a matter of minutes. I've attached a pcap > of a single request/reply pair which was corrupted. > > Presumably in the case of a decent NFS server the XID request cache > would prevent the bogus data actually reaching the disk but on a > non-decent server I suspect it might actually lead to corruption (AIUI > the request cache is not a hard requirement of the NFS protocol?). > Perhaps even a decent server might have timed out the entry in the cache > after such a delay? > > The Xen case and the blktest2 repro was on 2.6.18. I have also > reproduced the blktest2 case on 2.6.27 native but not with Xen since no > domain 0 support exists just yet. > > I can think of several possible option to fix this issue: > * Do not return success to userspace until all duplicated requests > have actually hit the wire, even if the response comes back > earlier than that. > * Cancel queued duplicates if a response comes in late. > * Copy the data pages on retransmission. > > I think I prefer the 3rd option since the first two are a bit tricky > since request has been merged into a tcp stream and may have been > fragmented/segmented already etc. I don't think doing the copy on > retransmits has a massive performance impact -- you must already be > suffering from pretty severe network or server issues! > > I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan) > because I have so far only been able to reproduce this with the broadcom > NIC. Even in the same server if I switch to e1000 I cannot reproduce. > However given the analysis above I'm not convinced it is likely to be a > driver bug. > > Ian. I don't see how this could be an RPC bug. The networking layer is supposed to either copy the data sent to the socket, or take a reference to any pages that are pushed via the ->sendpage() abi. IOW: the pages are supposed to be still referenced by the networking layer even if the NFS layer and page cache have dropped their references. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html