Re: SQUASHME: missing from FIXME: async layout return

Boaz Harrosh <bharrosh@xxxxxxxxxxx> · Thu, 13 May 2010 19:23:28 +0300

On 05/13/2010 07:04 PM, William A. (Andy) Adamson wrote:
> On Thu, May 13, 2010 at 11:31 AM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote:
>> On 05/13/2010 06:17 PM, William A. (Andy) Adamson wrote:
>>> On Thu, May 13, 2010 at 10:40 AM, Benny Halevy <bhalevy@xxxxxxxxxxx> wrote:
>>>> On May. 13, 2010, 17:19 +0300, "William A. (Andy) Adamson" <androsadamson@xxxxxxxxx> wrote:
>>>>> On Thu, May 13, 2010 at 4:40 AM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote:
>>>>>> I've tested the patch:
>>>>>>        FIXME: async layout return
>>>>>>
>>>>>> And there is a missing small hunk
>>>>>>
>>>>>> I have tested with this patch and it is a very good patch
>>>>>> that should also go into 2.6.33. It is necessary in the rare
>>>>>> case when one inode have more then one open_context.
>>>>>
>>>>> Do you mean more than one open context per open owner?
>>>>
>>>> What we see is one "regular" open context and one which is the layout_commit_ctx
>>>
>>> Isn't that a BUG?
>>>
>>> Here is what we're seeing:
>>>
>>> nfs_file_release->nfs_release->nfs_file_clear_open_context->__put_nfs_open_context->
>>> NFS_PROTO(inode)->close_context->nfs4_close_sync->__nfs4_close->pnfs_layoutcommit_inode,
>>>
>>> This is the same code that Boaz's 'missing small hunk' adds the wait
>>> to the pnfs_layout_commit_inode to. This is good, because when
>>> __nfs4_close is called with sync, every thing must be sent/returned
>>> prior to the nfs_do_close call.
>>>
>>> But there is still a problem. Here is the __nfs4_close call (with out
>>> the wait that Boaz added)
>>>
>>>                 if (nfsi->layoutcommit_ctx)
>>>                         pnfs_layoutcommit_inode(state->inode, 0);
>>>                 if (has_layout(nfsi) && nfsi->layout.roc_iomode) {
>>>                         struct nfs4_pnfs_layout_segment range;
>>>
>>>                         range.iomode = nfsi->layout.roc_iomode;
>>>                         range.offset = 0;
>>>                         range.length = NFS4_MAX_UINT64;
>>>                         pnfs_return_layout(state->inode, &range, NULL,
>>>                                            RETURN_FILE);
>>>
>>> Note that a pnfs_return_layout which is always 'sync' (async with wait
>>> for completion). So the (currently) async LAYOUTCOMMIT call returns,
>>> and a LAYOUTRETURN is put on the wire
>>>
>>> Then the LAYOUTCOMMIT rpc_call_done routine calls
>>> pnfs_layoutcommit_done which calls put_nfs_open_context. If the open
>>> context is different from the open context that was put by
>>> nfs_file_release, then
>>>
>>> pnfs_layoutcommit_done->put_nfs_open_context-> ..... ->__nfs4_close
>>> and the return on close LAYOUTRETURN is sent again!
>>>
>>> Of couse this second LAYOUTRETURN either gets a zero stateid, or uses
>>> the same stateid as the first LAYOUTRETURN, and the reply to the
>>> second LAYOUTRETURN will result in a NFS4ERR_BAD_STATEID .....
>>>
>>> This will occur whether the LAYOUTCOMMIT is async or sync as they both
>>> call pnfs_layoutcommit_done.
>>>
>>> I need to understand why there are two open contexts. On the face of
>>> it, it seems wrong.
>>>
>>> We also add a pointer to the open context in the nfs_write_data, and
>>> in the pnfs_layoutcommit_data. Do we take a reference on the open data
>>> in theses cases? .
>>>
>>> -->Andy
>>>
>>>
>>
>> Yes on all acounts. This is what Benny's patch is fixing please try it out
>> it is most important (Add my small fix)
> 
> Is this what you mean?
> 
> With Benny's patch, with return-on-close set and the layoutcommit_ctx
> different from the open context that caused the __nfs4_close:
> 
> The (now) sync LAYOUTCOMMIT triggered from the nfs_file_release
> __nfs4_close() will complete prior to calling the return-on-close
> LAYOUTRETURN call. Since pnfs_layoutcommit_done is still called,
> put_nfs_open_context will be call on the layoutconmit_ctx (saved in
> pnfs_layoutcommit_data->ctx) and __nfs4_close will be called.
> 
> In _this_ instance of calling __nfs4_close (from
> pnfs_layoutcommit_done), the nfsi->layoutcommit_ctx pointer is null so
> the pnfs_layoutcommit_inode call is skipped, and  pnfs_return_layout
> is called. Since it is also a sync call, pnfs_layoutcommit_done does
> not return until it is complete. So the layout is returned, the layout
> is freed (all layout segments 'cause the range covers the whole file)
> and pnfs_layoutcommit_inode returns to the FIRST __nfs4_close (from
> nfs_file_release)
> 
> Now since return-on-close is set, pnfs_layout_return is called (in the
> nfs_file_release __nfs4_close). This LAYOUTRETURN call will never go
> on the wire because the first pnfs_layout_return (from the
> pnfs_layoutcommit_done __nfs4_close) has removed the layout segment,
> and all is well.
> 
> Sheese.
> 
> --->Andy
> 

Yes I have seen this as well. I went even farther then Benny's patch
and added the following hunk which works very well BTW, but I now understand
the it is not necessary, exactly as you described above.

After all this will settle, I have a pending patch that cleans this code up 
by moving the hunk from __nfs4_close() into a pnfs_close_context() function in pnfs.c
(And empty if not 4.1) this way lots of duplicate code gets thrown away
for example the double lo_commit codes

---

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 06f20b9..1740648 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -739,6 +754,11 @@ _pnfs_return_layout(struct inode *ino, struct nfs4_pnfs_layout_segment *range,
 		}
 
 		lo = get_lock_current_layout(nfsi);
+		if (wait && atomic_read(&nfsi->layout.lretcount)) {
+			printk(KERN_ERR "%s(%lx): wait lretcount=%d\n", __func__, ino->i_ino,
+				atomic_read(&nfsi->layout.lretcount));
+ 			drain_layoutreturns(&nfsi->layout);
+		}
 		if (lo && !has_layout_to_return(lo, &arg)) {
 			put_unlock_current_layout(lo);
 			lo = NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html