Re: bad reference counting for module (was Re: BUG in sctp crashes sles10sp2 kernel)

Michal Hocko <mhocko@xxxxxxx> · Thu, 5 Feb 2009 10:18:43 +0100

On Fri 09-01-09 16:05:19, Vlad Yasevich wrote:
> Michal Hocko wrote:
> > Hi Vlad,
> > 
> > On Thu 08-01-09 11:56:28, Vlad Yasevich wrote:
> >> Michal Hocko wrote:
> > [...]
> >>> However we are currently seeing another issue. It is not a crash (only
> >>> process is killed with BUG message in the log - see attached) but it is
> >>> the problem with module reference counting (
> >>> BUG_ON(module_refcount(module)==0) in __module_get is called). 
> >>>
> >>> I am not sure whether this is a real problem, because we were able to
> >>> trigger this only on _one_ testing configuration while other one is OK.
> >>>
> >>> I have checked all places where sctp decreases module reference count
> >>> (sock_put) and it seems that all places are correctly balanced with
> >>> sock_hold resp. __module_get:
> >>> - sctp_association_init vs. sctp_association_destroy
> >>> - sctp_association_migrate - put for old and hold for new
> >>> - sctp_endpoint_int vs. sctp_endpoint_destroy
> >>> - sctp_close - one artificial hold because of sk_common_release (which calls
> >>>                put)
> >>>              - one put balanced with sys_accept which calls __module_get
> >>>
> >>> And all sock_put corresponds to the current upstream.
> >>>
> >>> Do you have any idea or remember any problem in this area which could
> >>> trigger this? 
> >>>
> >>> It smells either as some misconfiguration of the testing system or
> >>> another race condition or just I am overlooking something.
> >>>
> >>>
> >> Try this commit: 027f6e1ad32de32f9fe1c61d0f744e329e8acfd9
> >> SCTP: Fix a potential race between timers and receive path.
> > 
> > Thanks for this tip, but the same result on the same testing machine. 
> > 
> >> Also, what does lsmod tell you about the reference count on sctp module?
> > 
> > It shown something like 48 when I asked tester about that. Do you want
> > some finer grained values (with 1s interval). I understand that this
> > value can change rapidly so it is very imprecise.
> > I would like to give a try another test HW configuration on Monday
> > (whether we are able to reproduce at all).
> 
> If you are still running the same test, then 48 makes sense.  It
> should start 22 when just listening sockets (10 servers, 1 socket
> each, 2 refs per socket + sctp control socket ).  Then it should go
> to 42 as the accepted socket are created.  It should then fluctuate
> around 40 refs on the module.  I guess it could be possible for it
> to dip back down to 22, but that would be extremely unlikely, as it
> would require every application to close it's accepted socket and not
> accept any more.
> 
> The idea that module refcount count dip to 0 with so much traffic and
> so many association is a little frightening and I've never seen it before.
> 
> You could try to catch it, either in instrumenting sctp module or
> elsewhere.

Check the http://lkml.org/lkml/2009/2/3/203 discussion.

> 
> You could also try this on on my patched upstream kernel:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/vxy/lksctp-dev.git#pending
> 
> That 'pending' branch has all the patches that fix races and hung connections.
> 
> It would be interesting to see if it triggers on that one piece of hardware.
> 
> Thanks
> -vlad
> 
> > 
> >> -vlad
> > 
> > Thanks for your help
> > 
> > Best regards
> 

-- 
Michal Hocko
L3 team 
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html