Michal Hocko wrote: > On Fri 09-01-09 16:05:19, Vlad Yasevich wrote: >> Michal Hocko wrote: >>> Hi Vlad, >>> >>> On Thu 08-01-09 11:56:28, Vlad Yasevich wrote: >>>> Michal Hocko wrote: >>> [...] >>>>> However we are currently seeing another issue. It is not a crash (only >>>>> process is killed with BUG message in the log - see attached) but it is >>>>> the problem with module reference counting ( >>>>> BUG_ON(module_refcount(module)==0) in __module_get is called). >>>>> >>>>> I am not sure whether this is a real problem, because we were able to >>>>> trigger this only on _one_ testing configuration while other one is OK. >>>>> >>>>> I have checked all places where sctp decreases module reference count >>>>> (sock_put) and it seems that all places are correctly balanced with >>>>> sock_hold resp. __module_get: >>>>> - sctp_association_init vs. sctp_association_destroy >>>>> - sctp_association_migrate - put for old and hold for new >>>>> - sctp_endpoint_int vs. sctp_endpoint_destroy >>>>> - sctp_close - one artificial hold because of sk_common_release (which calls >>>>> put) >>>>> - one put balanced with sys_accept which calls __module_get >>>>> >>>>> And all sock_put corresponds to the current upstream. >>>>> >>>>> Do you have any idea or remember any problem in this area which could >>>>> trigger this? >>>>> >>>>> It smells either as some misconfiguration of the testing system or >>>>> another race condition or just I am overlooking something. >>>>> >>>>> >>>> Try this commit: 027f6e1ad32de32f9fe1c61d0f744e329e8acfd9 >>>> SCTP: Fix a potential race between timers and receive path. >>> Thanks for this tip, but the same result on the same testing machine. >>> >>>> Also, what does lsmod tell you about the reference count on sctp module? >>> It shown something like 48 when I asked tester about that. Do you want >>> some finer grained values (with 1s interval). I understand that this >>> value can change rapidly so it is very imprecise. >>> I would like to give a try another test HW configuration on Monday >>> (whether we are able to reproduce at all). >> If you are still running the same test, then 48 makes sense. It >> should start 22 when just listening sockets (10 servers, 1 socket >> each, 2 refs per socket + sctp control socket ). Then it should go >> to 42 as the accepted socket are created. It should then fluctuate >> around 40 refs on the module. I guess it could be possible for it >> to dip back down to 22, but that would be extremely unlikely, as it >> would require every application to close it's accepted socket and not >> accept any more. >> >> The idea that module refcount count dip to 0 with so much traffic and >> so many association is a little frightening and I've never seen it before. >> >> You could try to catch it, either in instrumenting sctp module or >> elsewhere. > > Check the http://lkml.org/lkml/2009/2/3/203 discussion. > Thanks for investigation. Good work. -vlad -- To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html