Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 13 Apr 2011 10:02:28 -0400

On Wed, 13 Apr 2011 06:43:53 -0700
Badari Pulavarty <pbadari@xxxxxxxxxx> wrote:

> On 4/13/2011 5:36 AM, Jeff Layton wrote:
> > On Tue, 12 Apr 2011 10:46:00 -0700
> > Badari Pulavarty<pbadari@xxxxxxxxxx>  wrote:
> >
> >    
> >> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> >>      
> >>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> >>>
> >>>        
> >>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> >>>>          
> >>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> >>>>>
> >>>>>            
> >>>>>> Hi,
> >>>>>>
> >>>>>> We recently ran into serious performance issue with NFS client.
> >>>>>> It turned out that its due to lack of readv/write support for
> >>>>>> NFS (O_DIRECT) client.
> >>>>>>
> >>>>>> Here is our use-case:
> >>>>>>
> >>>>>> In our cloud environment, our storage is over NFS. Files
> >>>>>> on NFS are passed as a blockdevices to the guest (using
> >>>>>> O_DIRECT). When guest is doing IO on these block devices,
> >>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
> >>>>>>
> >>>>>> QEMU (on the host) gets a vector from virtio-ring and
> >>>>>> submits them. Old versions of QEMU, linearized the vector
> >>>>>> it got from KVM (copied them into a buffer) and submits
> >>>>>> the buffer. So, NFS client always received a single buffer.
> >>>>>>
> >>>>>> Later versions of QEMU, eliminated this copy and submits
> >>>>>> a vector directly using preadv/pwritev().
> >>>>>>
> >>>>>> NFS client loops through the vector and submits each
> >>>>>> vector as separate request for each IO<  wsize. In our
> >>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
> >>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> >>>>>> Server end up doing each 4K synchronously. This causes
> >>>>>> serious performance degrade. We are trying to see if the
> >>>>>> performance improves if we convert IOs to ASYNC - but
> >>>>>> our initial results doesn't look good.
> >>>>>>
> >>>>>> readv/writev support NFS client for all possible cases is
> >>>>>> hard. Instead, if all vectors are page-aligned and
> >>>>>> iosizes page-multiple - it fits the current code easily.
> >>>>>> Luckily, QEMU use-case fits these requirements.
> >>>>>>
> >>>>>> Here is the patch to add this support. Comments ?
> >>>>>>              
> >>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
> >>>>>            
> >>>> I am not suggesting an API change at all. All I am doing is, if all
> >>>> the IOs are aligned - we can do a fast path as we can do in a single
> >>>> IO request. (as if we got a single buffer). Otherwise, we do hard
> >>>> way as today - loop through each one and do them individually.
> >>>>          
> >>> Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> >>>
> >>>        
> >>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
> >>>>>            
> >>>> We are trying that patch. It does improve the performance by little,
> >>>> but not anywhere close to doing it as a single vector/buffer.
> >>>>
> >>>> Khoa, can you share your performance data for all the
> >>>> suggestions/patches you tried so far ?
> >>>>          
> >>> The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?
> >>>        
> >> We haven't tried both fixes together (RPC slot increase, Turn into
> >> ASYNC). Each one individually didn't help much. We will try them
> >> together.
> >>
> >>      
> > I must have missed some of these emails, but here's the patch that
> > Chuck mentioned and based on his suggestion. It may be reasonable as a
> > stopgap solution until Trond's overhaul of the DIO code is complete.
> >
> > With this + a larger slot table size, I would think you'd have a
> > substantial write performance improvement. Probably not as good as if
> > the writes were coalesced, but it should help.
> >
> > Badari, if you can let us know whether this plus increasing the slot
> > table size helps, then I'll plan to "officially" submit it. This one is
> > against RHEL6 but it should apply to mainline kernels with a small
> > offset.
> >
> > -----------------[snip]-----------------
> >
> > BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
> >
> > Currently, the client uses FILE_SYNC whenever it's writing less than or
> > equal data to the wsize. This is a problem though if we have a bunch
> > of small iovec's batched up in a single writev call. The client will
> > iterate over them and do a single FILE_SYNC WRITE for each.
> >
> > Instead, change the code to do unstable writes when we'll need to do
> > multiple WRITE RPC's in order to satisfy the request. While we're at
> > it, optimize away the allocation of commit_data when we aren't going
> > to use it anyway.
> >
> > Signed-off-by: Jeff Layton<jlayton@xxxxxxxxxx>
> >    
> 
> Khoa is running tests with this + RPC table change. Single thread 
> performance improved,
> but multi-thread tests doesn't scale (I guess running out of RPC table 
> entires even with 128 ?).
> He will share the results shortly.
> 
> Thanks,
> Badari
> 

Ok, good to know.

I tend to think that the whole slot table concept needs some
reconsideration anyway. Off the top of my head...

We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
mempool with a minimum number of slots. Have them all be allocated with
GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
the waitqueue like it does today. Then, the clients can allocate
rpc_rqst's as they need as long as memory holds out for it.

We have the reserve_xprt stuff to handle congestion control anyway so I
don't really see the value in the artificial limits that the slot table
provides.

Maybe I should hack up a patchset for this...

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html