Re: RFC on fix to bug #802414

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Tue, 05 Jun 2012 07:53:13 -0400 (EDT)

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> To: "Anand Avati" <aavati@xxxxxxxxxx>
> Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur" <vbellur@xxxxxxxxxx>, "Amar Tumballi"
> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx
> Sent: Tuesday, June 5, 2012 5:14:50 PM
> Subject: Re: RFC on fix to bug #802414
> 
> Avati,
> 
> ----- Original Message -----
> > From: "Anand Avati" <aavati@xxxxxxxxxx>
> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur"
> > <vbellur@xxxxxxxxxx>, "Amar Tumballi"
> > <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
> > <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx
> > Sent: Tuesday, May 22, 2012 12:41:36 PM
> > Subject: Re: RFC on fix to bug #802414
> > 
> > <in continuation from our chat>
> > 
> > The PARENT_DOWN_HANDLED approach will take us backwards from the
> > current
> > state where we are resiliant to frame losses and other class of
> > bugs
> > (i.e, if a frame loss happens on either server or client, it only
> > results in prevented graph cleanup but the graph switch still
> > happens).
> > 
> > The root "cause" here is that we are giving up on a very important
> > and
> > fundamental principle of immutability on the fd object. The real
> > solution here is to never modify fd->inode. Instead we must bring
> > about
> > a more native fd "migration" than just re-opening an existing fd on
> > the
> > new graph.
> > 
> > Think of the inode migration analogy. The handle coming from FUSE
> > (the
> > address of the object) is a "hint". Usually the hint is right, if
> > the
> > object in the address belongs to the latest graph. If not, using
> > the
> > GFID we resolve a new inode on the latest graph and use it.
> > 
> > In case of FD we can do something similar, except there are not
> > GFIDs
> > (which should not be a problem). We need to make the handle coming
> > from
> > FUSE (the address of fd_t) just a hint. If the
> > fd->inode->table->xl->graph is the latest, then the hint was a HIT.
> > If
> > the graph was not the latest, we look for a previous migration
> > attempt+result in the "base" (original) fd's context. If that does
> > not
> > exist or is not fresh (on the latest graph) then we do a new fd
> > creation, open on new graph, fd_unref the old cached result in the
> > fd
> > context of the "base fd" and keep ref to this new result. All this
> > must
> > happen from fuse_resolve_fd().
> 
> Should we make this migration "on demand" (the way inode migration
> happen) or can we retain current approach of migrating all opened
> fds en-mass and trying on-demand migration in fuse_resolve_fd only
> those fds on which migration was never attempted
> (7503c63ee141931556cf066b)?

on a related note, if we are creating a new fd, we would be loosing all context in old-fd, so that automagic lock-migration (to new graph) in protocol/client won't happen. We should be migrating fd-contexts explicitly. If so, we need to discuss specifics of the same.

> 
> > The setting of the latest fd and
> > updation
> > of the latest fd pointer happens under the scope of the
> > base_fd->lock()
> > which gives it a very clear and unambiguous scope which was missing
> > with
> > the old scheme.
> > 
> > [The next step will be to nuke the fd->inode swapping in
> > fuse_create_cbk]
> > 
> > Avati
> > 
> > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote:
> > >
> > >
> > > ----- Original Message -----
> > >> From: "Pranith Kumar Karampuri"<pkarampu@xxxxxxxxxx>
> > >> To: "Anand Avati"<aavati@xxxxxxxxxx>
> > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar
> > >> Tumballi"<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
> > >> <kparthas@xxxxxxxxxx>, "Raghavendra
> > >> Gowdappa"<rgowdapp@xxxxxxxxxx>
> > >> Sent: Tuesday, May 22, 2012 8:42:58 AM
> > >> Subject: Re: RFC on fix to bug #802414
> > >>
> > >> Dude,
> > >>      We have already put logs yesterday in LOCK and UNLOCK and
> > >>      saw
> > >>      that the&fd->inode->lock address changed from LOCK to
> > >>      UNLOCK.
> > >
> > > Yes, even I too believe that the hang is because of fd->inode
> > > swap
> > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could
> > > clearly see in the log files following race:
> > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode
> > > (this
> > > was a naive fix - hold lock on inode in old graph - to the
> > > race-condition caused by swapping fd->inode, which didn't work)
> > >
> > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode
> > > present in old-graph) in afr_local_cleanup
> > > fuse-mig-thr: swaps fd->inode and releases lock on
> > > old_inode->lock
> > > poll-thr: gets woken up from lock call on old_inode->lock.
> > > poll-thr: does its work, but while unlocking, uses fd->inode
> > > where
> > > inode belongs to new graph.
> > >
> > > we had logs printing lock address before and after acquisition of
> > > lock and we could clearly see that lock address changed after
> > > acquiring lock in afr_local_cleanup.
> > >
> > >>
> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap
> > >>>> performed
> > >>>> there."
> > >> All the fds are opened on the same file. So all fds in the fd
> > >> migration point to same inode. The race is hit by nth fd,
> > >> (n+1)th
> > >> fd
> > >> hangs. We have seen that afr_local_cleanup was doing fd_unref,
> > >> and
> > >> LOCK(fd->inode->lock) was done with one address then by the time
> > >> UNLOCK(fd->inode->lock) is done the address changed. So the next
> > >> fd
> > >> that has to migrate hung because the prev inode lock is not
> > >> unlocked.
> > >>
> > >> If after nth fd introduces the race a _cbk comes in epoll thread
> > >> on
> > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread
> > >> will
> > >> hang.
> > >> Which is my theory for the hang we observed on Saturday.
> > >>
> > >> Pranith.
> > >> ----- Original Message -----
> > >> From: "Anand Avati"<aavati@xxxxxxxxxx>
> > >> To: "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx>
> > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi"
> > >> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi"
> > >> <kparthas@xxxxxxxxxx>, "Pranith Kumar Karampuri"
> > >> <pkarampu@xxxxxxxxxx>
> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM
> > >> Subject: Re: RFC on fix to bug #802414
> > >>
> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:
> > >>> Avati,
> > >>>
> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new
> > >>> inode to fd, once it looks up inode in new graph. But this
> > >>> assignment can race with code that accesses fd->inode->lock
> > >>> executing in poll-thread (pthr) as follows
> > >>>
> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph)
> > >>> rdthr: fd->inode = inode (resolved in new graph)
> > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph)
> > >>>
> > >>
> > >> The way I see it (the backtrace output in the other mail), the
> > >> swap
> > >> happening in fuse_create_cbk() must be the one causing
> > >> lock/unlock
> > >> to
> > >> land on different inode objects. The hang in fuse_migrate_fd is
> > >> _before_
> > >> the inode swap performed there. Can you put some logs in
> > >> fuse_create_cbk()'s inode swap code and confirm this?
> > >>
> > >>
> > >>> Now, any lock operations on inode in old graph will block.
> > >>> Thanks
> > >>> to pranith for pointing to this race-condition.
> > >>>
> > >>> The problem here is we don't have a single lock that can
> > >>> synchronize assignment "fd->inode = inode" and other locking
> > >>> attempts on fd->inode->lock. So, we are thinking that instead
> > >>> of
> > >>> trying to synchronize, eliminate the parallel accesses
> > >>> altogether.
> > >>> This can be done by splitting fd migration into two tasks.
> > >>>
> > >>> 1. Actions on old graph (like fsync to flush writes to disk)
> > >>> 2. Actions in new graph (lookup, open)
> > >>>
> > >>> We can send PARENT_DOWN when,
> > >>> 1. Task 1 is complete.
> > >>> 2. No fop sent by fuse is pending.
> > >>>
> > >>> on receiving PARENT_DOWN, protocol/client will shutdown
> > >>> transports.
> > >>> As part of transport cleanup, all pending frames are unwound
> > >>> and
> > >>> protocol/client will notify its parents with
> > >>> PARENT_DOWN_HANDLED
> > >>> event. Each of the translator will pass this event to its
> > >>> parents
> > >>> once it is convinced that there are no pending fops started by
> > >>> it
> > >>> (like background self-heal, reads as part of read-ahead etc).
> > >>> Once
> > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
> > >>> will be no replies that will be racing with migration (note
> > >>> that
> > >>> migration is done using syncops). At this point in time, it is
> > >>> safe to start Task 2 (which associates fd with an inode in new
> > >>> graph).
> > >>>
> > >>> Also note that reader thread will not do other operations till
> > >>> it
> > >>> completes both tasks.
> > >>>
> > >>> As far as the implementation of this patch goes, major work is
> > >>> in
> > >>> translators like read-ahead, afr, dht to provide the guarantee
> > >>> required to send PARENT_DOWN_HANDLED event to their parents.
> > >>>
> > >>> Please let me know your thoughts on this.
> > >>>
> > >>
> > >> All the above steps might not apply if it is caused by the swap
> > >> in
> > >> fuse_create_cbk(). Let's confirm that first.
> > >>
> > >> Avati
> > >>
> > 
> > 
>