Re: bad reference counting for module (was Re: BUG in sctp crashes sles10sp2 kernel)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michal Hocko wrote:
> On Fri 09-01-09 16:05:19, Vlad Yasevich wrote:
>> Michal Hocko wrote:
>>> Hi Vlad,
>>>
>>> On Thu 08-01-09 11:56:28, Vlad Yasevich wrote:
>>>> Michal Hocko wrote:
>>> [...]
>>>>> However we are currently seeing another issue. It is not a crash (only
>>>>> process is killed with BUG message in the log - see attached) but it is
>>>>> the problem with module reference counting (
>>>>> BUG_ON(module_refcount(module)==0) in __module_get is called). 
>>>>>
>>>>> I am not sure whether this is a real problem, because we were able to
>>>>> trigger this only on _one_ testing configuration while other one is OK.
>>>>>
>>>>> I have checked all places where sctp decreases module reference count
>>>>> (sock_put) and it seems that all places are correctly balanced with
>>>>> sock_hold resp. __module_get:
>>>>> - sctp_association_init vs. sctp_association_destroy
>>>>> - sctp_association_migrate - put for old and hold for new
>>>>> - sctp_endpoint_int vs. sctp_endpoint_destroy
>>>>> - sctp_close - one artificial hold because of sk_common_release (which calls
>>>>>                put)
>>>>>              - one put balanced with sys_accept which calls __module_get
>>>>>
>>>>> And all sock_put corresponds to the current upstream.
>>>>>
>>>>> Do you have any idea or remember any problem in this area which could
>>>>> trigger this? 
>>>>>
>>>>> It smells either as some misconfiguration of the testing system or
>>>>> another race condition or just I am overlooking something.
>>>>>
>>>>>
>>>> Try this commit: 027f6e1ad32de32f9fe1c61d0f744e329e8acfd9
>>>> SCTP: Fix a potential race between timers and receive path.
>>> Thanks for this tip, but the same result on the same testing machine. 
>>>
>>>> Also, what does lsmod tell you about the reference count on sctp module?
>>> It shown something like 48 when I asked tester about that. Do you want
>>> some finer grained values (with 1s interval). I understand that this
>>> value can change rapidly so it is very imprecise.
>>> I would like to give a try another test HW configuration on Monday
>>> (whether we are able to reproduce at all).
>> If you are still running the same test, then 48 makes sense.  It
>> should start 22 when just listening sockets (10 servers, 1 socket
>> each, 2 refs per socket + sctp control socket ).  Then it should go
>> to 42 as the accepted socket are created.  It should then fluctuate
>> around 40 refs on the module.  I guess it could be possible for it
>> to dip back down to 22, but that would be extremely unlikely, as it
>> would require every application to close it's accepted socket and not
>> accept any more.
>>
>> The idea that module refcount count dip to 0 with so much traffic and
>> so many association is a little frightening and I've never seen it before.
>>
>> You could try to catch it, either in instrumenting sctp module or
>> elsewhere.
> 
> Check the http://lkml.org/lkml/2009/2/3/203 discussion.
> 

Thanks for investigation.  Good work.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Networking Development]     [Linux OMAP]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux