----- Original Message ----- > From: "Anand Avati" <aavati@xxxxxxxxxx> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur" <vbellur@xxxxxxxxxx>, "Amar Tumballi" > <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx > Sent: Tuesday, May 22, 2012 12:41:36 PM > Subject: Re: RFC on fix to bug #802414 > > <in continuation from our chat> > > The PARENT_DOWN_HANDLED approach will take us backwards from the > current > state where we are resiliant to frame losses and other class of bugs > (i.e, if a frame loss happens on either server or client, it only > results in prevented graph cleanup but the graph switch still > happens). > > The root "cause" here is that we are giving up on a very important > and > fundamental principle of immutability on the fd object. The real > solution here is to never modify fd->inode. Instead we must bring > about > a more native fd "migration" than just re-opening an existing fd on > the > new graph. > > Think of the inode migration analogy. The handle coming from FUSE > (the > address of the object) is a "hint". Usually the hint is right, if the > object in the address belongs to the latest graph. If not, using the > GFID we resolve a new inode on the latest graph and use it. > > In case of FD we can do something similar, except there are not GFIDs > (which should not be a problem). We need to make the handle coming > from > FUSE (the address of fd_t) just a hint. If the > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > If > the graph was not the latest, we look for a previous migration > attempt+result in the "base" (original) fd's context. If that does > not > exist or is not fresh (on the latest graph) then we do a new fd > creation, open on new graph, fd_unref the old cached result in the fd > context of the "base fd" and keep ref to this new result. All this > must > happen from fuse_resolve_fd(). The setting of the latest fd and > updation > of the latest fd pointer happens under the scope of the > base_fd->lock() > which gives it a very clear and unambiguous scope which was missing > with > the old scheme. I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > > [The next step will be to nuke the fd->inode swapping in > fuse_create_cbk] > > Avati > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar Karampuri"<pkarampu@xxxxxxxxxx> > >> To: "Anand Avati"<aavati@xxxxxxxxxx> > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar > >> Tumballi"<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" > >> <kparthas@xxxxxxxxxx>, "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx> > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> Dude, > >> We have already put logs yesterday in LOCK and UNLOCK and saw > >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > > > Yes, even I too believe that the hang is because of fd->inode swap > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > clearly see in the log files following race: > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this > > was a naive fix - hold lock on inode in old graph - to the > > race-condition caused by swapping fd->inode, which didn't work) > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > present in old-graph) in afr_local_cleanup > > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > > poll-thr: gets woken up from lock call on old_inode->lock. > > poll-thr: does its work, but while unlocking, uses fd->inode where > > inode belongs to new graph. > > > > we had logs printing lock address before and after acquisition of > > lock and we could clearly see that lock address changed after > > acquiring lock in afr_local_cleanup. > > > >> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > >>>> performed > >>>> there." > >> All the fds are opened on the same file. So all fds in the fd > >> migration point to same inode. The race is hit by nth fd, (n+1)th > >> fd > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and > >> LOCK(fd->inode->lock) was done with one address then by the time > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > >> fd > >> that has to migrate hung because the prev inode lock is not > >> unlocked. > >> > >> If after nth fd introduces the race a _cbk comes in epoll thread > >> on > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will > >> hang. > >> Which is my theory for the hang we observed on Saturday. > >> > >> Pranith. > >> ----- Original Message ----- > >> From: "Anand Avati"<aavati@xxxxxxxxxx> > >> To: "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx> > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi" > >> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" > >> <kparthas@xxxxxxxxxx>, "Pranith Kumar Karampuri" > >> <pkarampu@xxxxxxxxxx> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > >>> Avati, > >>> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > >>> inode to fd, once it looks up inode in new graph. But this > >>> assignment can race with code that accesses fd->inode->lock > >>> executing in poll-thread (pthr) as follows > >>> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > >>> rdthr: fd->inode = inode (resolved in new graph) > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > >>> > >> > >> The way I see it (the backtrace output in the other mail), the > >> swap > >> happening in fuse_create_cbk() must be the one causing lock/unlock > >> to > >> land on different inode objects. The hang in fuse_migrate_fd is > >> _before_ > >> the inode swap performed there. Can you put some logs in > >> fuse_create_cbk()'s inode swap code and confirm this? > >> > >> > >>> Now, any lock operations on inode in old graph will block. Thanks > >>> to pranith for pointing to this race-condition. > >>> > >>> The problem here is we don't have a single lock that can > >>> synchronize assignment "fd->inode = inode" and other locking > >>> attempts on fd->inode->lock. So, we are thinking that instead of > >>> trying to synchronize, eliminate the parallel accesses > >>> altogether. > >>> This can be done by splitting fd migration into two tasks. > >>> > >>> 1. Actions on old graph (like fsync to flush writes to disk) > >>> 2. Actions in new graph (lookup, open) > >>> > >>> We can send PARENT_DOWN when, > >>> 1. Task 1 is complete. > >>> 2. No fop sent by fuse is pending. > >>> > >>> on receiving PARENT_DOWN, protocol/client will shutdown > >>> transports. > >>> As part of transport cleanup, all pending frames are unwound and > >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED > >>> event. Each of the translator will pass this event to its parents > >>> once it is convinced that there are no pending fops started by it > >>> (like background self-heal, reads as part of read-ahead etc). > >>> Once > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > >>> will be no replies that will be racing with migration (note that > >>> migration is done using syncops). At this point in time, it is > >>> safe to start Task 2 (which associates fd with an inode in new > >>> graph). > >>> > >>> Also note that reader thread will not do other operations till it > >>> completes both tasks. > >>> > >>> As far as the implementation of this patch goes, major work is in > >>> translators like read-ahead, afr, dht to provide the guarantee > >>> required to send PARENT_DOWN_HANDLED event to their parents. > >>> > >>> Please let me know your thoughts on this. > >>> > >> > >> All the above steps might not apply if it is caused by the swap in > >> fuse_create_cbk(). Let's confirm that first. > >> > >> Avati > >> > >