Re: RFC on fix to bug #802414

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Tue, 22 May 2012 04:44:00 -0400 (EDT)

----- Original Message -----
> From: "Anand Avati" <aavati@xxxxxxxxxx>
> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur" <vbellur@xxxxxxxxxx>, "Amar Tumballi"
> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx
> Sent: Tuesday, May 22, 2012 12:41:36 PM
> Subject: Re: RFC on fix to bug #802414
> 
> <in continuation from our chat>
> 
> The PARENT_DOWN_HANDLED approach will take us backwards from the
> current
> state where we are resiliant to frame losses and other class of bugs
> (i.e, if a frame loss happens on either server or client, it only
> results in prevented graph cleanup but the graph switch still
> happens).
> 
> The root "cause" here is that we are giving up on a very important
> and
> fundamental principle of immutability on the fd object. The real
> solution here is to never modify fd->inode. Instead we must bring
> about
> a more native fd "migration" than just re-opening an existing fd on
> the
> new graph.
> 
> Think of the inode migration analogy. The handle coming from FUSE
> (the
> address of the object) is a "hint". Usually the hint is right, if the
> object in the address belongs to the latest graph. If not, using the
> GFID we resolve a new inode on the latest graph and use it.
> 
> In case of FD we can do something similar, except there are not GFIDs
> (which should not be a problem). We need to make the handle coming
> from
> FUSE (the address of fd_t) just a hint. If the
> fd->inode->table->xl->graph is the latest, then the hint was a HIT.
> If
> the graph was not the latest, we look for a previous migration
> attempt+result in the "base" (original) fd's context. If that does
> not
> exist or is not fresh (on the latest graph) then we do a new fd
> creation, open on new graph, fd_unref the old cached result in the fd
> context of the "base fd" and keep ref to this new result. All this
> must
> happen from fuse_resolve_fd(). The setting of the latest fd and
> updation
> of the latest fd pointer happens under the scope of the
> base_fd->lock()
> which gives it a very clear and unambiguous scope which was missing
> with
> the old scheme.

I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3?

> 
> [The next step will be to nuke the fd->inode swapping in
> fuse_create_cbk]
> 
> Avati
> 
> On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Pranith Kumar Karampuri"<pkarampu@xxxxxxxxxx>
> >> To: "Anand Avati"<aavati@xxxxxxxxxx>
> >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar
> >> Tumballi"<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
> >> <kparthas@xxxxxxxxxx>, "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx>
> >> Sent: Tuesday, May 22, 2012 8:42:58 AM
> >> Subject: Re: RFC on fix to bug #802414
> >>
> >> Dude,
> >>      We have already put logs yesterday in LOCK and UNLOCK and saw
> >>      that the&fd->inode->lock address changed from LOCK to UNLOCK.
> >
> > Yes, even I too believe that the hang is because of fd->inode swap
> > in fuse_migrate_fd and not the one in fuse_create_cbk. We could
> > clearly see in the log files following race:
> > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this
> > was a naive fix - hold lock on inode in old graph - to the
> > race-condition caused by swapping fd->inode, which didn't work)
> >
> > poll-thr: tries to acquire fd->inode->lock (inode is old_inode
> > present in old-graph) in afr_local_cleanup
> > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock
> > poll-thr: gets woken up from lock call on old_inode->lock.
> > poll-thr: does its work, but while unlocking, uses fd->inode where
> > inode belongs to new graph.
> >
> > we had logs printing lock address before and after acquisition of
> > lock and we could clearly see that lock address changed after
> > acquiring lock in afr_local_cleanup.
> >
> >>
> >>>> "The hang in fuse_migrate_fd is _before_ the inode swap
> >>>> performed
> >>>> there."
> >> All the fds are opened on the same file. So all fds in the fd
> >> migration point to same inode. The race is hit by nth fd, (n+1)th
> >> fd
> >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and
> >> LOCK(fd->inode->lock) was done with one address then by the time
> >> UNLOCK(fd->inode->lock) is done the address changed. So the next
> >> fd
> >> that has to migrate hung because the prev inode lock is not
> >> unlocked.
> >>
> >> If after nth fd introduces the race a _cbk comes in epoll thread
> >> on
> >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will
> >> hang.
> >> Which is my theory for the hang we observed on Saturday.
> >>
> >> Pranith.
> >> ----- Original Message -----
> >> From: "Anand Avati"<aavati@xxxxxxxxxx>
> >> To: "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx>
> >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi"
> >> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
> >> <kparthas@xxxxxxxxxx>, "Pranith Kumar Karampuri"
> >> <pkarampu@xxxxxxxxxx>
> >> Sent: Tuesday, May 22, 2012 2:09:33 AM
> >> Subject: Re: RFC on fix to bug #802414
> >>
> >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:
> >>> Avati,
> >>>
> >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new
> >>> inode to fd, once it looks up inode in new graph. But this
> >>> assignment can race with code that accesses fd->inode->lock
> >>> executing in poll-thread (pthr) as follows
> >>>
> >>> pthr: LOCK (fd->inode->lock); (inode in old graph)
> >>> rdthr: fd->inode = inode (resolved in new graph)
> >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph)
> >>>
> >>
> >> The way I see it (the backtrace output in the other mail), the
> >> swap
> >> happening in fuse_create_cbk() must be the one causing lock/unlock
> >> to
> >> land on different inode objects. The hang in fuse_migrate_fd is
> >> _before_
> >> the inode swap performed there. Can you put some logs in
> >> fuse_create_cbk()'s inode swap code and confirm this?
> >>
> >>
> >>> Now, any lock operations on inode in old graph will block. Thanks
> >>> to pranith for pointing to this race-condition.
> >>>
> >>> The problem here is we don't have a single lock that can
> >>> synchronize assignment "fd->inode = inode" and other locking
> >>> attempts on fd->inode->lock. So, we are thinking that instead of
> >>> trying to synchronize, eliminate the parallel accesses
> >>> altogether.
> >>> This can be done by splitting fd migration into two tasks.
> >>>
> >>> 1. Actions on old graph (like fsync to flush writes to disk)
> >>> 2. Actions in new graph (lookup, open)
> >>>
> >>> We can send PARENT_DOWN when,
> >>> 1. Task 1 is complete.
> >>> 2. No fop sent by fuse is pending.
> >>>
> >>> on receiving PARENT_DOWN, protocol/client will shutdown
> >>> transports.
> >>> As part of transport cleanup, all pending frames are unwound and
> >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED
> >>> event. Each of the translator will pass this event to its parents
> >>> once it is convinced that there are no pending fops started by it
> >>> (like background self-heal, reads as part of read-ahead etc).
> >>> Once
> >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
> >>> will be no replies that will be racing with migration (note that
> >>> migration is done using syncops). At this point in time, it is
> >>> safe to start Task 2 (which associates fd with an inode in new
> >>> graph).
> >>>
> >>> Also note that reader thread will not do other operations till it
> >>> completes both tasks.
> >>>
> >>> As far as the implementation of this patch goes, major work is in
> >>> translators like read-ahead, afr, dht to provide the guarantee
> >>> required to send PARENT_DOWN_HANDLED event to their parents.
> >>>
> >>> Please let me know your thoughts on this.
> >>>
> >>
> >> All the above steps might not apply if it is caused by the swap in
> >> fuse_create_cbk(). Let's confirm that first.
> >>
> >> Avati
> >>
> 
>