----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "Anand Avati" <aavati@xxxxxxxxxx> > Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur" <vbellur@xxxxxxxxxx>, "Amar Tumballi" > <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx > Sent: Tuesday, June 5, 2012 5:14:50 PM > Subject: Re: RFC on fix to bug #802414 > > Avati, > > ----- Original Message ----- > > From: "Anand Avati" <aavati@xxxxxxxxxx> > > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Vijay Bellur" > > <vbellur@xxxxxxxxxx>, "Amar Tumballi" > > <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" > > <kparthas@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx > > Sent: Tuesday, May 22, 2012 12:41:36 PM > > Subject: Re: RFC on fix to bug #802414 > > > > <in continuation from our chat> > > > > The PARENT_DOWN_HANDLED approach will take us backwards from the > > current > > state where we are resiliant to frame losses and other class of > > bugs > > (i.e, if a frame loss happens on either server or client, it only > > results in prevented graph cleanup but the graph switch still > > happens). > > > > The root "cause" here is that we are giving up on a very important > > and > > fundamental principle of immutability on the fd object. The real > > solution here is to never modify fd->inode. Instead we must bring > > about > > a more native fd "migration" than just re-opening an existing fd on > > the > > new graph. > > > > Think of the inode migration analogy. The handle coming from FUSE > > (the > > address of the object) is a "hint". Usually the hint is right, if > > the > > object in the address belongs to the latest graph. If not, using > > the > > GFID we resolve a new inode on the latest graph and use it. > > > > In case of FD we can do something similar, except there are not > > GFIDs > > (which should not be a problem). We need to make the handle coming > > from > > FUSE (the address of fd_t) just a hint. If the > > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > > If > > the graph was not the latest, we look for a previous migration > > attempt+result in the "base" (original) fd's context. If that does > > not > > exist or is not fresh (on the latest graph) then we do a new fd > > creation, open on new graph, fd_unref the old cached result in the > > fd > > context of the "base fd" and keep ref to this new result. All this > > must > > happen from fuse_resolve_fd(). > > Should we make this migration "on demand" (the way inode migration > happen) or can we retain current approach of migrating all opened > fds en-mass and trying on-demand migration in fuse_resolve_fd only > those fds on which migration was never attempted > (7503c63ee141931556cf066b)? on a related note, if we are creating a new fd, we would be loosing all context in old-fd, so that automagic lock-migration (to new graph) in protocol/client won't happen. We should be migrating fd-contexts explicitly. If so, we need to discuss specifics of the same. > > > The setting of the latest fd and > > updation > > of the latest fd pointer happens under the scope of the > > base_fd->lock() > > which gives it a very clear and unambiguous scope which was missing > > with > > the old scheme. > > > > [The next step will be to nuke the fd->inode swapping in > > fuse_create_cbk] > > > > Avati > > > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > > > > ----- Original Message ----- > > >> From: "Pranith Kumar Karampuri"<pkarampu@xxxxxxxxxx> > > >> To: "Anand Avati"<aavati@xxxxxxxxxx> > > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar > > >> Tumballi"<atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" > > >> <kparthas@xxxxxxxxxx>, "Raghavendra > > >> Gowdappa"<rgowdapp@xxxxxxxxxx> > > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > > >> Subject: Re: RFC on fix to bug #802414 > > >> > > >> Dude, > > >> We have already put logs yesterday in LOCK and UNLOCK and > > >> saw > > >> that the&fd->inode->lock address changed from LOCK to > > >> UNLOCK. > > > > > > Yes, even I too believe that the hang is because of fd->inode > > > swap > > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > > clearly see in the log files following race: > > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode > > > (this > > > was a naive fix - hold lock on inode in old graph - to the > > > race-condition caused by swapping fd->inode, which didn't work) > > > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > > present in old-graph) in afr_local_cleanup > > > fuse-mig-thr: swaps fd->inode and releases lock on > > > old_inode->lock > > > poll-thr: gets woken up from lock call on old_inode->lock. > > > poll-thr: does its work, but while unlocking, uses fd->inode > > > where > > > inode belongs to new graph. > > > > > > we had logs printing lock address before and after acquisition of > > > lock and we could clearly see that lock address changed after > > > acquiring lock in afr_local_cleanup. > > > > > >> > > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > > >>>> performed > > >>>> there." > > >> All the fds are opened on the same file. So all fds in the fd > > >> migration point to same inode. The race is hit by nth fd, > > >> (n+1)th > > >> fd > > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, > > >> and > > >> LOCK(fd->inode->lock) was done with one address then by the time > > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > > >> fd > > >> that has to migrate hung because the prev inode lock is not > > >> unlocked. > > >> > > >> If after nth fd introduces the race a _cbk comes in epoll thread > > >> on > > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread > > >> will > > >> hang. > > >> Which is my theory for the hang we observed on Saturday. > > >> > > >> Pranith. > > >> ----- Original Message ----- > > >> From: "Anand Avati"<aavati@xxxxxxxxxx> > > >> To: "Raghavendra Gowdappa"<rgowdapp@xxxxxxxxxx> > > >> Cc: "Vijay Bellur"<vbellur@xxxxxxxxxx>, "Amar Tumballi" > > >> <atumball@xxxxxxxxxx>, "Krishnan Parthasarathi" > > >> <kparthas@xxxxxxxxxx>, "Pranith Kumar Karampuri" > > >> <pkarampu@xxxxxxxxxx> > > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > > >> Subject: Re: RFC on fix to bug #802414 > > >> > > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > > >>> Avati, > > >>> > > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > > >>> inode to fd, once it looks up inode in new graph. But this > > >>> assignment can race with code that accesses fd->inode->lock > > >>> executing in poll-thread (pthr) as follows > > >>> > > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > > >>> rdthr: fd->inode = inode (resolved in new graph) > > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > > >>> > > >> > > >> The way I see it (the backtrace output in the other mail), the > > >> swap > > >> happening in fuse_create_cbk() must be the one causing > > >> lock/unlock > > >> to > > >> land on different inode objects. The hang in fuse_migrate_fd is > > >> _before_ > > >> the inode swap performed there. Can you put some logs in > > >> fuse_create_cbk()'s inode swap code and confirm this? > > >> > > >> > > >>> Now, any lock operations on inode in old graph will block. > > >>> Thanks > > >>> to pranith for pointing to this race-condition. > > >>> > > >>> The problem here is we don't have a single lock that can > > >>> synchronize assignment "fd->inode = inode" and other locking > > >>> attempts on fd->inode->lock. So, we are thinking that instead > > >>> of > > >>> trying to synchronize, eliminate the parallel accesses > > >>> altogether. > > >>> This can be done by splitting fd migration into two tasks. > > >>> > > >>> 1. Actions on old graph (like fsync to flush writes to disk) > > >>> 2. Actions in new graph (lookup, open) > > >>> > > >>> We can send PARENT_DOWN when, > > >>> 1. Task 1 is complete. > > >>> 2. No fop sent by fuse is pending. > > >>> > > >>> on receiving PARENT_DOWN, protocol/client will shutdown > > >>> transports. > > >>> As part of transport cleanup, all pending frames are unwound > > >>> and > > >>> protocol/client will notify its parents with > > >>> PARENT_DOWN_HANDLED > > >>> event. Each of the translator will pass this event to its > > >>> parents > > >>> once it is convinced that there are no pending fops started by > > >>> it > > >>> (like background self-heal, reads as part of read-ahead etc). > > >>> Once > > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > > >>> will be no replies that will be racing with migration (note > > >>> that > > >>> migration is done using syncops). At this point in time, it is > > >>> safe to start Task 2 (which associates fd with an inode in new > > >>> graph). > > >>> > > >>> Also note that reader thread will not do other operations till > > >>> it > > >>> completes both tasks. > > >>> > > >>> As far as the implementation of this patch goes, major work is > > >>> in > > >>> translators like read-ahead, afr, dht to provide the guarantee > > >>> required to send PARENT_DOWN_HANDLED event to their parents. > > >>> > > >>> Please let me know your thoughts on this. > > >>> > > >> > > >> All the above steps might not apply if it is caused by the swap > > >> in > > >> fuse_create_cbk(). Let's confirm that first. > > >> > > >> Avati > > >> > > > > >