Hi All, The following patch is sent to address changelog rpc mem-leak issues. The fix is intricate and needs lot of testing before taking in. Please review the same. http://review.gluster.org/#/c/13658/1 Thanks and Regards, Kotresh H R ----- Original Message ----- > From: "Venky Shankar" <vshankar@xxxxxxxxxx> > To: "Raghavendra G" <raghavendra@xxxxxxxxxxx> > Cc: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Monday, March 7, 2016 3:52:13 PM > Subject: Re: Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t > > On Fri, Mar 04, 2016 at 02:02:32PM +0530, Raghavendra G wrote: > > On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar < > > khiremat@xxxxxxxxxx> wrote: > > > > > Hi, > > > > > > Yes, with this patch we need not set conn->trans to NULL in > > > rpc_clnt_disable > > > > > > > While [1] fixes the crash, things can be improved in the way how changelog > > is using rpc. > > > > 1. In the current code, there is an rpc_clnt object leak during disconnect > > event. > > Just for the record (as I couldn't find this information while building up > rpc infrastructure in changelog): > > Unref rpc-clnt object after calling rpc_clnt_disable() in notify() upon > RPC_CLNT_DISCONNECT. Free up 'mydata' in notify() upon RPC_CLNT_DESTROY. > > > 2. Also, freed "mydata" of changelog is still associated with rpc_clnt > > object (corollary of 1), though change log might not get any events with > > "mydata" (as connection is dead). > > > > I've discussed with Kotresh about changes needed, offline. So, following > > are the action items. > > 1. Soumya's patch [2] is valid and is needed for 3.7 branch too. > > 2. [2] can be accepted. However, someone might want to re-use an rpc object > > after disabling it, like introducing a new api rpc_clnt_enable_again > > (though no of such use-cases is very less). But [2] doesn't allow it. The > > point is as long as rpc-clnt object is alive, transport object is alive > > (though disconnected) and we can re-use it. So, I would prefer not to > > accept it. > > 3. Kotresh will work on new changes to make sure changelog makes correct > > use of rpc-clnt. > > > > [1] http://review.gluster.org/#/c/13592 > > [2] http://review.gluster.org/#/c/1359 > > > > regards, > > Raghavendra. > > > > > > > Thanks and Regards, > > > Kotresh H R > > > > > > ----- Original Message ----- > > > > From: "Soumya Koduri" <skoduri@xxxxxxxxxx> > > > > To: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx>, "Raghavendra > > > G" <raghavendra@xxxxxxxxxxx> > > > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > > > > Sent: Thursday, March 3, 2016 5:06:00 PM > > > > Subject: Re: Cores generated with > > > ./tests/geo-rep/georep-basic-dr-tarssh.t > > > > > > > > > > > > > > > > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote: > > > > > [Replying on top of my own reply] > > > > > > > > > > Hi, > > > > > > > > > > I have submitted the below patch [1] to avoid the issue of > > > > > 'rpc_clnt_submit' > > > > > getting reconnected. But it won't take care of memory leak problem > > > > > you > > > were > > > > > trying to fix. That we have to carefully go through all cases and fix > > > it. > > > > > Please have a look at it. > > > > > > > > > Looks good. IIUC, with this patch, we need not set conn->trans to NULL > > > > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak > > > > as > > > > the transport object shall then get freed as part of > > > > 'rpc_clnt_trigger_destroy'. > > > > > > > > > > > > > http://review.gluster.org/#/c/13592/ > > > > > > > > > > Thanks and Regards, > > > > > Kotresh H R > > > > > > > > > > ----- Original Message ----- > > > > >> From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx> > > > > >> To: "Soumya Koduri" <skoduri@xxxxxxxxxx> > > > > >> Cc: "Raghavendra G" <raghavendra@xxxxxxxxxxx>, "Gluster Devel" > > > > >> <gluster-devel@xxxxxxxxxxx> > > > > >> Sent: Thursday, March 3, 2016 3:39:11 PM > > > > >> Subject: Re: Cores generated with > > > > >> ./tests/geo-rep/georep-basic-dr-tarssh.t > > > > >> > > > > >> Hi Soumya, > > > > >> > > > > >> I tested the lastes patch [2] on master where your previous patch > > > > >> [1] > > > in > > > > >> merged. > > > > >> I see crashes at different places. > > > > >> > > > > >> 1. If there are code paths that are holding rpc object without > > > > >> taking > > > ref > > > > >> on > > > > >> it, all those > > > > >> code path will crash on invoking rpc submit on that object as > > > > >> rpc > > > > >> object > > > > >> would have freed > > > > >> by last unref on DISCONNECT event. I see this kind of use-case > > > > >> in > > > > >> chagnelog rpc code. > > > > >> Need to check on other users of rpc. > > > > Agree. We should fix all such code-paths. Since this seem to be an > > > > intricate fix, shall we take these patches only in master branch and > > > > not > > > > in 3.7 release for now till we fix all such paths as we encounter? > > > > > > > > >> > > > > >> 2. And also we need to take care of reconnect timers that are being > > > set > > > > >> and > > > > >> are re-tried to > > > > >> connect back on expiration. In those cases also, we might crash > > > as rpc > > > > >> object would have freed. > > > > Your patch addresses this..right? > > > > > > > > Thanks, > > > > Soumya > > > > > > > > >> > > > > >> > > > > >> [1] http://review.gluster.org/#/c/13507/ > > > > >> [2] http://review.gluster.org/#/c/13587/ > > > > >> > > > > >> Thanks and Regards, > > > > >> Kotresh H R > > > > >> > > > > >> ----- Original Message ----- > > > > >>> From: "Soumya Koduri" <skoduri@xxxxxxxxxx> > > > > >>> To: "Raghavendra G" <raghavendra@xxxxxxxxxxx>, "Kotresh Hiremath > > > > >>> Ravishankar" <khiremat@xxxxxxxxxx> > > > > >>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > > > > >>> Sent: Thursday, March 3, 2016 12:24:00 PM > > > > >>> Subject: Re: Cores generated with > > > > >>> ./tests/geo-rep/georep-basic-dr-tarssh.t > > > > >>> > > > > >>> Thanks a lot Kotresh. > > > > >>> > > > > >>> On 03/03/2016 08:47 AM, Raghavendra G wrote: > > > > >>>> Hi Soumya, > > > > >>>> > > > > >>>> Can you send a fix to this regression on upstream master too? This > > > patch > > > > >>>> is merged there. > > > > >>>> > > > > >>> I have submitted below patch. > > > > >>> http://review.gluster.org/#/c/13587/ > > > > >>> > > > > >>> Kindly review the same. > > > > >>> > > > > >>> Thanks, > > > > >>> Soumya > > > > >>> > > > > >>>> regards, > > > > >>>> Raghavendra > > > > >>>> > > > > >>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar > > > > >>>> <khiremat@xxxxxxxxxx <mailto:khiremat@xxxxxxxxxx>> wrote: > > > > >>>> > > > > >>>> Hi Soumya, > > > > >>>> > > > > >>>> I analysed the issue and found out that crash has happened > > > because > > > > >>>> of the patch [1]. > > > > >>>> > > > > >>>> The patch doesn't set transport object to NULL in > > > > >>>> 'rpc_clnt_disable' > > > > >>>> but instead does it on > > > > >>>> 'rpc_clnt_trigger_destroy'. So if there are pending rpc > > > invocations > > > > >>>> on the rpc object that > > > > >>>> is disabled (those instances are possible as happening now in > > > > >>>> changelog), it will trigger a > > > > >>>> CONNECT notify again with 'mydata' that is freed causing a > > > crash. > > > > >>>> This happens because > > > > >>>> 'rpc_clnt_submit' reconnects if rpc is not connected. > > > > >>>> > > > > >>>> rpc_clnt_submit (...) { > > > > >>>> ... > > > > >>>> if (conn->connected == 0) { > > > > >>>> ret = rpc_transport_connect > > > (conn->trans, > > > > >>>> > > > > >>>> conn->config.remote_port); > > > > >>>> } > > > > >>>> ... > > > > >>>> } > > > > >>>> > > > > >>>> Without your patch, conn->trans was set NULL and hence > > > > >>>> CONNECT > > > > >>>> fails > > > > >>>> not resulting with > > > > >>>> CONNECT notify call. And also the cleanup happens in failure > > > path. > > > > >>>> > > > > >>>> So the memory leak can happen, if there is no try for rpc > > > > >>>> invocation > > > > >>>> after DISCONNECT. > > > > >>>> It will be cleaned up otherwise. > > > > >>>> > > > > >>>> > > > > >>>> [1] http://review.gluster.org/#/c/13507/ > > > > >>>> > > > > >>>> Thanks and Regards, > > > > >>>> Kotresh H R > > > > >>>> > > > > >>>> ----- Original Message ----- > > > > >>>> > From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx > > > > >>>> <mailto:khiremat@xxxxxxxxxx>> > > > > >>>> > To: "Soumya Koduri" <skoduri@xxxxxxxxxx > > > > >>>> > <mailto:skoduri@xxxxxxxxxx>> > > > > >>>> > Cc: avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>, > > > "Gluster > > > > >>>> Devel" <gluster-devel@xxxxxxxxxxx > > > > >>>> <mailto:gluster-devel@xxxxxxxxxxx>> > > > > >>>> > Sent: Monday, February 29, 2016 4:15:22 PM > > > > >>>> > Subject: Re: Cores generated with > > > > >>>> ./tests/geo-rep/georep-basic-dr-tarssh.t > > > > >>>> > > > > > >>>> > Hi Soumya, > > > > >>>> > > > > > >>>> > I just tested that it is reproducible only with your patch > > > both > > > > >>>> in master and > > > > >>>> > 3.76 branch. > > > > >>>> > The geo-rep test cases are marked bad in master. So it's > > > not hit > > > > >>>> in master. > > > > >>>> > rpc is introduced > > > > >>>> > in changelog xlator to communicate to applications via > > > > >>>> libgfchangelog. > > > > >>>> > Venky/Me will check > > > > >>>> > why is the crash happening and will update. > > > > >>>> > > > > > >>>> > > > > > >>>> > Thanks and Regards, > > > > >>>> > Kotresh H R > > > > >>>> > > > > > >>>> > ----- Original Message ----- > > > > >>>> > > From: "Soumya Koduri" <skoduri@xxxxxxxxxx > > > > >>>> <mailto:skoduri@xxxxxxxxxx>> > > > > >>>> > > To: avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>, > > > > >>>> > > "kotresh" > > > > >>>> <khiremat@xxxxxxxxxx <mailto:khiremat@xxxxxxxxxx>> > > > > >>>> > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx > > > > >>>> <mailto:gluster-devel@xxxxxxxxxxx>> > > > > >>>> > > Sent: Monday, February 29, 2016 2:10:51 PM > > > > >>>> > > Subject: Cores generated with > > > > >>>> ./tests/geo-rep/georep-basic-dr-tarssh.t > > > > >>>> > > > > > > >>>> > > Hi Aravinda/Kotresh, > > > > >>>> > > > > > > >>>> > > With [1], I consistently see cores generated with the > > > > >>>> > > test > > > > >>>> > > './tests/geo-rep/georep-basic-dr-tarssh.t' in > > > > >>>> > > release-3.7 > > > > >>>> branch. From > > > > >>>> > > the cores, looks like we are trying to dereference a > > > > >>>> > > freed > > > > >>>> > > changelog_rpc_clnt_t(crpc) object in > > > changelog_rpc_notify(). > > > > >>>> Strangely > > > > >>>> > > this was not reported in master branch. > > > > >>>> > > > > > > >>>> > > I tried debugging but couldn't find any possible > > > suspects. I > > > > >>>> request you > > > > >>>> > > to take a look and let me know if [1] caused any > > > regression. > > > > >>>> > > > > > > >>>> > > Thanks, > > > > >>>> > > Soumya > > > > >>>> > > > > > > >>>> > > [1] http://review.gluster.org/#/c/13507/ > > > > >>>> > > > > > > >>>> > > > > > >>>> _______________________________________________ > > > > >>>> Gluster-devel mailing list > > > > >>>> Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > > > > >>>> http://www.gluster.org/mailman/listinfo/gluster-devel > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> Raghavendra G > > > > >>> > > > > >> _______________________________________________ > > > > >> Gluster-devel mailing list > > > > >> Gluster-devel@xxxxxxxxxxx > > > > >> http://www.gluster.org/mailman/listinfo/gluster-devel > > > > >> > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxxx > > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > > > > > > > -- > > Raghavendra G > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://www.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel