Re: osd become unusable, blocked by xfsaild (?) and load > 5000

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have been seeing this same behavior on a cluster that has been perfectly happy until we upgraded to the ubuntu vivid 3.19 kernel.  We are in the process of "upgrading" back to the 3.16 kernel across our cluster as we've not seen this behavior on that kernel for over 6 months and we're pretty strongly of the opinion this is a regression in the kernel.  Please let the list know if upping your threads fixes your issue (though I'm not optimistic) as we have our max threads set to the value recommended here (4194303) but we still see this issue regularly on the 3.19 ubuntu kernel we tried both 3.19.0-25 and 3.19.0-33 before giving up and reverting to 3.16.


On Tue, Dec 8, 2015 at 1:03 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:

> On 08 Dec 2015, at 08:57, Benedikt Fraunhofer <fraunhofer@xxxxxxxxxx> wrote:
>
> Hi Jan,
>
>> Doesn't look near the limit currently (but I suppose you rebooted it in the meantime?).
>
> the box this numbers came from has an uptime of 13 days
> so it's one of the boxes that did survive yesterdays half-cluster-wide-reboot.
>

So this box had no issues? Keep an eye on the number of threadas, but maybe others will have a better idea, this is just where I'd start. I have seen close to a milion threads from OSDs on my boxes, not sure what the number are now.

>> Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is that your data drives?) - were they overloaded really?
>
> no they didn't have any load and or iops.
> Basically the whole box had nothing to do.
>
> If I understand the load correctly, this just reports threads
> that are ready and willing to work but - in this case -
> don't get any data to work with.

Different unixes calculate this differently :-) By itself "load" is meaningless.
It should be something like an average number of processes that want to run at any given time but can't (because they are waiting for whatever they need - disks, CPU, blocking sockets...).

Jan


>
> Thx
>
> Benedikt
>
>
> 2015-12-08 8:44 GMT+01:00 Jan Schermer <jan@xxxxxxxxxxx>:
>>
>> Jan
>>
>>
>>> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer <fraunhofer@xxxxxxxxxx> wrote:
>>>
>>> Hi Jan,
>>>
>>> we had 65k for pid_max, which made
>>> kernel.threads-max = 1030520.
>>> or
>>> kernel.threads-max = 256832
>>> (looks like it depends on the number of cpus?)
>>>
>>> currently we've
>>>
>>> root@ceph1-store209:~# sysctl -a | grep -e thread -e pid
>>> kernel.cad_pid = 1
>>> kernel.core_uses_pid = 0
>>> kernel.ns_last_pid = 60298
>>> kernel.pid_max = 65535
>>> kernel.threads-max = 256832
>>> vm.nr_pdflush_threads = 0
>>> root@ceph1-store209:~# ps axH |wc -l
>>> 17548
>>>
>>> we'll see how it behaves once puppet has come by and adjusted it.
>>>
>>> Thx!
>>>
>>> Benedikt
>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux