RE: [PATCH 2/3] pnfs: introduce pnfs private workqueue

<tao.peng@xxxxxxx> · Wed, 21 Sep 2011 23:30:14 -0400

Hi, Trond,

> -----Original Message-----
> From: linux-nfs-owner@xxxxxxxxxxxxxxx [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx]
> On Behalf Of Trond Myklebust
> Sent: Thursday, September 22, 2011 12:04 AM
> To: Peng Tao
> Cc: Boaz Harrosh; Benny Halevy; Peng, Tao; linux-nfs@xxxxxxxxxxxxxxx;
> honey@xxxxxxxxxxxxxx; rees@xxxxxxxxx
> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue
> 
> On Wed, 2011-09-21 at 23:45 +0800, Peng Tao wrote:
> > On Wed, Sep 21, 2011 at 9:56 PM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote:
> > > On 09/21/2011 02:50 PM, Benny Halevy wrote:
> > >> On 2011-09-21 14:42, Boaz Harrosh wrote:
> > >>> On 09/21/2011 02:27 PM, Benny Halevy wrote:
> > >>>>> Unless we do following:
> > >>>>> 1. preallocate memory for extent state convertion
> > >>>>> 2. use nfsiod/rpciod to handle bl_write_cleanup
> > >>>>> 3. for pnfs error case, create a kthread to recollapse and resend to MDS
> > >>>>> I don't quite understand. How do you use nfs state manager to do other
> tasks?
> > >>>>
> > >>>> You need to keep a list of things to do hanging off of the nfs client structure
> > >>>> and set a bit in cl_state telling the state manager it has work to do
> > >>>> and wake it up.  It then needs to go over the list of, say nfs_inodes
> > >>>> and call into the layout driver to handle the errors.
> > >>>>
> > >>>> Benny
> > >>>
> > >>> Good god, Is it not already too complicated?
> > >>>
> > >>> The LD is out of the picture. You all seemed to agree that
> > >>> the LD has reported an io_done on the nfsiod/rpciod, and in the error case
> > >>> Generic layer needs to do it's coalescing on some other thread. So
> > >>> your description above is not correct, the LD is out of the picture.
> > >>>
> > >>
> > >> True, if the ld cleanup on io_done is sufficient.
> > >>
> > >>> It all looks too complicated for me. A pnfs workqueue for both 2 and 3
> > >>> above is very good. Specially since the workqueue also shares global
> > >>> pool threads, No? I like it that there is a preallocated thread for
> > >>> the error-case, think about it.
> > >>
> > >> I'm fine too with using a workqueue for the error case.
> > >> But I'd rather have the common case done path do only lightweight,
> > >> wait free processing.
> > >>
> > >> Benny
> > >>
> > >
> > > If by "common case done path do only lightweight" you mean
> > > "preallocate memory for extent state conversion". Then I absolutely
> > > agree. But as far as workqueue/kthread then nfsiod/rpciod-wq or
> > > pnfs-wq is exactly the same for the "common case". Unless I'm
> > > totally missing the point. What are you saying?
> > >
> > > These are the options so far:
> > >
> > > [Toe's option which he rather not]
> > > 1. preallocate memory for extent state conversion
> > > 2. use nfsiod/rpciod to handle bl_write_cleanup
> > > 3. for pnfs error case, create a kthread to recollapse and resend to MDS
> > >
> > > [My option which I think Toe agrees with]
> > > 1. preallocate memory for extent state conversion
> > > 2. use pnfs-wq to handle bl_write_cleanup
> > > 3. pnfs error case, just like Toe's patches as part of io_done
> > >   on pnfs-wq
> > Yeah, I would vote for this one because of its simplicity. ;-)
> 
> Sigh... The problem is that it completely fails to address the problem.
> 
> What's the difference between having pNFS completions run on nfsiod or
> their own work queue? You'd be running i/o and allocations on the same
> queue in both cases.
OK, I got your point. I was under the impression that you don’t want pnfs io end to use nfsiod because nfsiod is already a highly contended workqueue... And now I see that it is because you don't want to block (because of memory allocation) on the workqueue.
But sorry for my ignorance. I thought workqueue was invented just for handling the case that functions cannot block in interrupt context because workqueue runs in process context. Could you please shed some light on why nfs's workqueue (nfsiod/pnfsiod) cannot block?

Looking into some users of nfsiod, it seems it already may block due to memory allocation in some cases like in code path: nfs4_open_confirm_release-> nfs4_opendata_to_nfs4_state-> nfs_fhget-> iget5_locked-> alloc_inode-> kmem_cache_alloc

Also looking at other filesystems (e.g., ext4's ext4_end_io_work and reiserfs's flush_async_commits), they are also warpping up i/o in workqueue and may sleep with memory allocation or wait_on_buffer() there...

Best Regards,
Tao

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥