On 2011-09-21 14:10, tao.peng@xxxxxxx wrote: >> -----Original Message----- >> From: Benny Halevy [mailto:bhalevy@xxxxxxxxxx] >> Sent: Wednesday, September 21, 2011 6:57 PM >> To: Peng, Tao >> Cc: Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; >> honey@xxxxxxxxxxxxxx; rees@xxxxxxxxx >> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue >> >> On 2011-09-21 13:23, tao.peng@xxxxxxx wrote: >>> Hi, Benny, >>> >>>> -----Original Message----- >>>> From: linux-nfs-owner@xxxxxxxxxxxxxxx >> [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] >>>> On Behalf Of Benny Halevy >>>> Sent: Wednesday, September 21, 2011 3:05 PM >>>> To: Peng, Tao >>>> Cc: Trond.Myklebust@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; >>>> honey@xxxxxxxxxxxxxx; rees@xxxxxxxxx >>>> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue >>>> >>>> On 2011-09-21 08:16, tao.peng@xxxxxxx wrote: >>>>>> -----Original Message----- >>>>>> From: Myklebust, Trond [mailto:Trond.Myklebust@xxxxxxxxxx] >>>>>> Sent: Wednesday, September 21, 2011 12:20 PM >>>>>> To: Peng, Tao >>>>>> Cc: bhalevy@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; honey@xxxxxxxxxxxxxx; >>>>>> rees@xxxxxxxxx >>>>>> Subject: RE: [PATCH 2/3] pnfs: introduce pnfs private workqueue >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: tao.peng@xxxxxxx [mailto:tao.peng@xxxxxxx] >>>>>>> Sent: Tuesday, September 20, 2011 10:44 PM >>>>>>> To: Myklebust, Trond >>>>>>> Cc: bhalevy@xxxxxxxxxx; linux-nfs@xxxxxxxxxxxxxxx; >>>>>> honey@xxxxxxxxxxxxxx; >>>>>>> rees@xxxxxxxxx >>>>>>> Subject: RE: [PATCH 2/3] pnfs: introduce pnfs private workqueue >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: linux-nfs-owner@xxxxxxxxxxxxxxx >>>>>>>> [mailto:linux-nfs-owner@xxxxxxxxxxxxxxx] >>>>>>>> On Behalf Of Jim Rees >>>>>>>> Sent: Wednesday, September 21, 2011 8:29 AM >>>>>>>> To: Trond Myklebust >>>>>>>> Cc: Benny Halevy; linux-nfs@xxxxxxxxxxxxxxx; peter honeyman >>>>>>>> Subject: Re: [PATCH 2/3] pnfs: introduce pnfs private workqueue >>>>>>>> >>>>>>>> Trond Myklebust wrote: >>>>>>>> >>>>>>>> On Mon, 2011-09-19 at 23:18 -0400, Jim Rees wrote: >>>>>>>> > From: Peng Tao <bergwolf@xxxxxxxxx> >>>>>>>> > >>>>>>>> > For layoutdriver io done functions, default workqueue is not a >>>>>> good >>>>>>> place as >>>>>>>> > the code is executed in IO path. So add a pnfs private workqueue >>>>>> to >>>>>>> handle >>>>>>>> > them. >>>>>>>> > >>>>>>>> > Also change block and object layout code to make use of this >>>>>> private >>>>>>>> > workqueue. >>>>>>>> > >>>>>>>> >>>>>>>> Wait, what???? >>>>>>>> >>>>>>>> Why is the I/O path (i.e. the nfsiod queue) inappropriate for >>>>>>>> layoutdriver io_done functions? >>>>>>> The current code uses the system_wq to handle io_done functions. If >>>>>> any >>>>>>> function allocates memory in system_wq and causes dirty writeback in >>>>>> nfs >>>>>>> path, the io_done function can never be called because it is after the >>>>>> original >>>>>>> function in the system_wq. And you said that the rpciod/nfsiod is not >>>>>> the >>>>>>> ideal place because of memory allocation. So I suggested pnfs private >>>>>>> workqueue and Benny agreed on it. >>>>>> >>>>>> You appear to be optimizing for a corner case. Why? >>>>> Current code uses system_wq for io_done and it can cause deadlock. So this is >>>> more than just an optimization. >>>>> >>>>>> >>>>>> IOW: why do we need a whole new workqueue for something which is >>>>>> supposed to be extremely rare? Why can't we just use standard keventd or >>>>>> allocate a perfectly normal thread (e.g. the state recovery thread)? >>>>> Unlike nfsiod and rpciod, keventd (aka, system_wq) is not created with >>>> WQ_MEM_RECLAIM, indicating that it should not be used for IO path which can >> be >>>> invoked during memory reclaim. And I agree that a normal kthread can solve it. >>>>> >>>>> However, the blocklayout write end_io function still needs a context where it >> can >>>> allocate memory to convert extent state (for every IO that touches new extents). >> If >>>> you look into the patch, we are not just using the pnfs private workqueue to >> handle >>>> pnfs_ld_read/write_done, but also calling mark_extents_written() inside the >>>> workqueue. >>>> >>>> Can this memory be preallocated before the I/O for the common case? >>> It is possible to preallocate memory for the common case. But it doesn't remove >> the need for workqueue. Without workqueue, we will need to put a bunch of extent >> manipulation code into bio->end_io, which is executed in bottom half and we >> always need minimize its execution. >>> >>> Unless we do following: >>> 1. preallocate memory for extent state convertion >>> 2. use nfsiod/rpciod to handle bl_write_cleanup >>> 3. for pnfs error case, create a kthread to recollapse and resend to MDS >>> >>> not sure if it worth the complexity though... >> >> We do want to avoid memory allocation on the completion path to ensure >> completion under memory pressure and graceful handling of low memory >> conditions >> so I think this approach is worth while. >> >> Having a lightweight handler on the bottom half done path is ok too. >> >> Doing the heavy lifting for the (assumingly rare) error case in a workqueue/ >> kthread makes a sense, just it mustn't block. Could you use the nfs state manager >> for that? > I don't quite understand. How do you use nfs state manager to do other tasks? You need to keep a list of things to do hanging off of the nfs client structure and set a bit in cl_state telling the state manager it has work to do and wake it up. It then needs to go over the list of, say nfs_inodes and call into the layout driver to handle the errors. Benny > > Thanks, > Tao > >> >> Benny >> >>> >>> Cheers, >>> Tao >>> >>>> >>>> Benny >>>> >>>>> If we don't create a pnfs private workqueue for pnfs_ld_read/write_done (i.e., >> by >>>> just creating a normal kthread in error case), we would still have to create a >>>> workqueue inside blocklayout driver to handle write end_io. And if blocklayout >> has a >>>> private workqueue, it no longer needs the normal kthread for read/write_done >> error >>>> handling, which leaves the kthread only specific to object layout (if it doesn't >> need a >>>> workqueue like blocklayout does). >>>>> >>>>> So I think a generic pnfs workqueue is better because it simplifies things and >> solve >>>> both problems. >>>>> >>>>> Best, >>>>> Tao >>>>> >>>>>> >>>>>> Trond >>>>> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html