Re: [PATCH] pnfs: Kick a pnfs_layoutcommit_inode on recall

Boaz Harrosh <boaz@xxxxxxxxxxxxx> · Tue, 26 Aug 2014 21:19:41 +0300

On 08/26/2014 08:54 PM, Trond Myklebust wrote:
> On Tue, Aug 26, 2014 at 1:06 PM, Boaz Harrosh <boaz@xxxxxxxxxxxxx> wrote:

> 
> The deadlock occurs _if_ the above layout commit  is unable to get a
> slot. You can't guarantee that it will, because the slot table is a
> finite resource and it can be exhausted 

Yes all I ever seen is 1 slot in any of the clients/servers I've
seen so I assume 1 slot ever

> if you allow fore channel
> calls to trigger synchronous recalls on the back channel 

Beep! but this is exactly what I'm trying to say. The STD specifically
forbids that. The server is not allowed to wait here, it must return
imitatively, with an error that frees the slot and then later issue the
RECALL.

This is what I said exactly three times in my mail, and what I have
depicted in my flow:
	Server async operation (mandated by the STD)
	Client back-channel can be sync with for channel (Not mentioned by the STD)

> that again trigger synchronous calls on the fore channel. 

> You're basically saying
> that the client needs to guarantee that it can allocate 2 slots before
> it is allowed to send a layoutget just in case the server needs to
> recall a layout.
> 

No I am not saying that, please count. Since the Server is not allowed
sync operation then one slot is enough and the client can do sync lo_commit
while in recall.

> If, OTOH, the layoutcommit is asynchronous, then there is no
> serialisation and the back channel thread can happily reply to the
> layout recall even if there are no free slots in the fore channel.
> 

Sure that will work as well, but not optimally, and for no good reason.

Please go back to my flow with the 3 cases. See how the server never waits
for anything and will always imitatively reply to the layout_get.
Since the server is not allowed a sync operation and is mandated by the
RFC text to not wait, then the client is allowed and can do sync operations
because it is enough that only one do async.

BTW: If what you are saying is true than there is a bug in the slot code
because this patch does work, and everything flows past this situation.
I have a reproducer test that fails 100% of the time without this patch
and only fails much later at some other place, but not at this deadlock,
with this patch applied.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html