Re: RFC on fix to bug #802414

Anand Avati <aavati@xxxxxxxxxx> · Tue, 22 May 2012 00:11:36 -0700

<in continuation from our chat>

The PARENT_DOWN_HANDLED approach will take us backwards from the current 
state where we are resiliant to frame losses and other class of bugs 
(i.e, if a frame loss happens on either server or client, it only 
results in prevented graph cleanup but the graph switch still happens).

The root "cause" here is that we are giving up on a very important and 
fundamental principle of immutability on the fd object. The real 
solution here is to never modify fd->inode. Instead we must bring about 
a more native fd "migration" than just re-opening an existing fd on the 
new graph.

Think of the inode migration analogy. The handle coming from FUSE (the 
address of the object) is a "hint". Usually the hint is right, if the 
object in the address belongs to the latest graph. If not, using the 
GFID we resolve a new inode on the latest graph and use it.

In case of FD we can do something similar, except there are not GFIDs 
(which should not be a problem). We need to make the handle coming from 
FUSE (the address of fd_t) just a hint. If the 
fd->inode->table->xl->graph is the latest, then the hint was a HIT. If 
the graph was not the latest, we look for a previous migration 
attempt+result in the "base" (original) fd's context. If that does not 
exist or is not fresh (on the latest graph) then we do a new fd 
creation, open on new graph, fd_unref the old cached result in the fd 
context of the "base fd" and keep ref to this new result. All this must 
happen from fuse_resolve_fd(). The setting of the latest fd and updation 
of the latest fd pointer happens under the scope of the base_fd->lock() 
which gives it a very clear and unambiguous scope which was missing with 
the old scheme.

[The next step will be to nuke the fd->inode swapping in fuse_create_cbk]

Avati

On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote:

----- Original Message -----
From: "Pranith Kumar Karampuri"<pkarampu@xxxxxxxxxx>
To: "Anand Avati"<aavati@xxxxxxxxxx>
Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi"<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
<kparthas@xxxxxxxxxx>, "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx>
Sent: Tuesday, May 22, 2012 8:42:58 AM
Subject: Re: RFC on fix to bug #802414

Dude,
     We have already put logs yesterday in LOCK and UNLOCK and saw
     that the&fd->inode->lock address changed from LOCK to UNLOCK.

Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race:
fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work)

poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup
fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock
poll-thr: gets woken up from lock call on old_inode->lock.
poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph.

we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup.

"The hang in fuse_migrate_fd is _before_ the inode swap performed
there."
All the fds are opened on the same file. So all fds in the fd
migration point to same inode. The race is hit by nth fd, (n+1)th fd
hangs. We have seen that afr_local_cleanup was doing fd_unref, and
LOCK(fd->inode->lock) was done with one address then by the time
UNLOCK(fd->inode->lock) is done the address changed. So the next fd
that has to migrate hung because the prev inode lock is not
unlocked.

If after nth fd introduces the race a _cbk comes in epoll thread on
(n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will
hang.
Which is my theory for the hang we observed on Saturday.

Pranith.
----- Original Message -----
From: "Anand Avati"<aavati@xxxxxxxxxx>
To: "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx>
Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi"
<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
<kparthas@xxxxxxxxxx>, "Pranith Kumar Karampuri"
<pkarampu@xxxxxxxxxx>
Sent: Tuesday, May 22, 2012 2:09:33 AM
Subject: Re: RFC on fix to bug #802414

On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:
Avati,

fuse_migrate_fd (running in reader thread - rdthr) assigns new
inode to fd, once it looks up inode in new graph. But this
assignment can race with code that accesses fd->inode->lock
executing in poll-thread (pthr) as follows

pthr: LOCK (fd->inode->lock); (inode in old graph)
rdthr: fd->inode = inode (resolved in new graph)
pthr: UNLOCK (fd->inode->lock); (inode in new graph)

The way I see it (the backtrace output in the other mail), the swap
happening in fuse_create_cbk() must be the one causing lock/unlock to
land on different inode objects. The hang in fuse_migrate_fd is
_before_
the inode swap performed there. Can you put some logs in
fuse_create_cbk()'s inode swap code and confirm this?

Now, any lock operations on inode in old graph will block. Thanks
to pranith for pointing to this race-condition.

The problem here is we don't have a single lock that can
synchronize assignment "fd->inode = inode" and other locking
attempts on fd->inode->lock. So, we are thinking that instead of
trying to synchronize, eliminate the parallel accesses altogether.
This can be done by splitting fd migration into two tasks.

1. Actions on old graph (like fsync to flush writes to disk)
2. Actions in new graph (lookup, open)

We can send PARENT_DOWN when,
1. Task 1 is complete.
2. No fop sent by fuse is pending.

on receiving PARENT_DOWN, protocol/client will shutdown transports.
As part of transport cleanup, all pending frames are unwound and
protocol/client will notify its parents with PARENT_DOWN_HANDLED
event. Each of the translator will pass this event to its parents
once it is convinced that there are no pending fops started by it
(like background self-heal, reads as part of read-ahead etc). Once
fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
will be no replies that will be racing with migration (note that
migration is done using syncops). At this point in time, it is
safe to start Task 2 (which associates fd with an inode in new
graph).

Also note that reader thread will not do other operations till it
completes both tasks.

As far as the implementation of this patch goes, major work is in
translators like read-ahead, afr, dht to provide the guarantee
required to send PARENT_DOWN_HANDLED event to their parents.

Please let me know your thoughts on this.

All the above steps might not apply if it is caused by the swap in
fuse_create_cbk(). Let's confirm that first.

Avati