> -----Original Message----- > From: Benny Halevy [mailto:bhalevy@xxxxxxxxxx] > Sent: Wednesday, September 21, 2011 6:57 PM > To: Peng, Tao > Cc: Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; > honey@xxxxxxxxxxxxxx; rees@xxxxxxxxx > Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue > > On 2011-09-21 13:23, tao.peng@xxxxxxx wrote: > > Hi, Benny, > > > >> -----Original Message----- > >> From: linux-nfs-owner@xxxxxxxxxxxxxxx > [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] > >> On Behalf Of Benny Halevy > >> Sent: Wednesday, September 21, 2011 3:05 PM > >> To: Peng, Tao > >> Cc: Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; > >> honey@xxxxxxxxxxxxxx; rees@xxxxxxxxx > >> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue > >> > >> On 2011-09-21 08:16, tao.peng@xxxxxxx wrote: > >>>> -----Original Message----- > >>>> From: Myklebust, Trond [mailto:Trond.Myklebust@xxxxxxxxxx] > >>>> Sent: Wednesday, September 21, 2011 12:20 PM > >>>> To: Peng, Tao > >>>> Cc: bhalevy@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; honey@xxxxxxxxxxxxxx; > >>>> rees@xxxxxxxxx > >>>> Subject: RE: [PATCH 2/3] pnfs: introduce pnfs private workqueue > >>>> > >>>>> -----Original Message----- > >>>>> From: tao.peng@xxxxxxx [mailto:tao.peng@xxxxxxx] > >>>>> Sent: Tuesday, September 20, 2011 10:44 PM > >>>>> To: Myklebust, Trond > >>>>> Cc: bhalevy@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; > >>>> honey@xxxxxxxxxxxxxx; > >>>>> rees@xxxxxxxxx > >>>>> Subject: RE: [PATCH 2/3] pnfs: introduce pnfs private workqueue > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: linux-nfs-owner@xxxxxxxxxxxxxxx > >>>>>> [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] > >>>>>> On Behalf Of Jim Rees > >>>>>> Sent: Wednesday, September 21, 2011 8:29 AM > >>>>>> To: Trond Myklebust > >>>>>> Cc: Benny Halevy; linux-nfs@xxxxxxxxxxxxxxx; peter honeyman > >>>>>> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue > >>>>>> > >>>>>> Trond Myklebust wrote: > >>>>>> > >>>>>> On Mon, 2011-09-19 at 23:18 -0400, Jim Rees wrote: > >>>>>> > From: Peng Tao <bergwolf@xxxxxxxxx> > >>>>>> > > >>>>>> > For layoutdriver io done functions, default workqueue is not a > >>>> good > >>>>> place as > >>>>>> > the code is executed in IO path. So add a pnfs private workqueue > >>>> to > >>>>> handle > >>>>>> > them. > >>>>>> > > >>>>>> > Also change block and object layout code to make use of this > >>>> private > >>>>>> > workqueue. > >>>>>> > > >>>>>> > >>>>>> Wait, what???? > >>>>>> > >>>>>> Why is the I/O path (i.e. the nfsiod queue) inappropriate for > >>>>>> layoutdriver io_done functions? > >>>>> The current code uses the system_wq to handle io_done functions. If > >>>> any > >>>>> function allocates memory in system_wq and causes dirty writeback in > >>>> nfs > >>>>> path, the io_done function can never be called because it is after the > >>>> original > >>>>> function in the system_wq. And you said that the rpciod/nfsiod is not > >>>> the > >>>>> ideal place because of memory allocation. So I suggested pnfs private > >>>>> workqueue and Benny agreed on it. > >>>> > >>>> You appear to be optimizing for a corner case. Why? > >>> Current code uses system_wq for io_done and it can cause deadlock. So this is > >> more than just an optimization. > >>> > >>>> > >>>> IOW: why do we need a whole new workqueue for something which is > >>>> supposed to be extremely rare? Why can't we just use standard keventd or > >>>> allocate a perfectly normal thread (e.g. the state recovery thread)? > >>> Unlike nfsiod and rpciod, keventd (aka, system_wq) is not created with > >> WQ_MEM_RECLAIM, indicating that it should not be used for IO path which can > be > >> invoked during memory reclaim. And I agree that a normal kthread can solve it. > >>> > >>> However, the blocklayout write end_io function still needs a context where it > can > >> allocate memory to convert extent state (for every IO that touches new extents). > If > >> you look into the patch, we are not just using the pnfs private workqueue to > handle > >> pnfs_ld_read/write_done, but also calling mark_extents_written() inside the > >> workqueue. > >> > >> Can this memory be preallocated before the I/O for the common case? > > It is possible to preallocate memory for the common case. But it doesn't remove > the need for workqueue. Without workqueue, we will need to put a bunch of extent > manipulation code into bio->end_io, which is executed in bottom half and we > always need minimize its execution. > > > > Unless we do following: > > 1. preallocate memory for extent state convertion > > 2. use nfsiod/rpciod to handle bl_write_cleanup > > 3. for pnfs error case, create a kthread to recollapse and resend to MDS > > > > not sure if it worth the complexity though... > > We do want to avoid memory allocation on the completion path to ensure > completion under memory pressure and graceful handling of low memory > conditions > so I think this approach is worth while. > > Having a lightweight handler on the bottom half done path is ok too. > > Doing the heavy lifting for the (assumingly rare) error case in a workqueue/ > kthread makes a sense, just it mustn't block. Could you use the nfs state manager > for that? I don't quite understand. How do you use nfs state manager to do other tasks? Thanks, Tao > > Benny > > > > > Cheers, > > Tao > > > >> > >> Benny > >> > >>> If we don't create a pnfs private workqueue for pnfs_ld_read/write_done (i.e., > by > >> just creating a normal kthread in error case), we would still have to create a > >> workqueue inside blocklayout driver to handle write end_io. And if blocklayout > has a > >> private workqueue, it no longer needs the normal kthread for read/write_done > error > >> handling, which leaves the kthread only specific to object layout (if it doesn't > need a > >> workqueue like blocklayout does). > >>> > >>> So I think a generic pnfs workqueue is better because it simplifies things and > solve > >> both problems. > >>> > >>> Best, > >>> Tao > >>> > >>>> > >>>> Trond > >>> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html