Re: osd become unusable, blocked by xfsaild (?) and load > 5000

Tom Christensen <pavera@xxxxxxxxx> · Tue, 8 Dec 2015 02:00:11 -0700

We have been seeing this same behavior on a cluster that has been perfectly happy until we upgraded to the ubuntu vivid 3.19 kernel.  We are in the process of "upgrading" back to the 3.16 kernel across our cluster as we've not seen this behavior on that kernel for over 6 months and we're pretty strongly of the opinion this is a regression in the kernel.  Please let the list know if upping your threads fixes your issue (though I'm not optimistic) as we have our max threads set to the value recommended here (4194303) but we still see this issue regularly on the 3.19 ubuntu kernel we tried both 3.19.0-25 and 3.19.0-33 before giving up and reverting to 3.16.

On Tue, Dec 8, 2015 at 1:03 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:

> On 08 Dec 2015, at 08:57, Benedikt Fraunhofer <fraunhofer@xxxxxxxxxx> wrote:

>

> Hi Jan,

>

>> Doesn't look near the limit currently (but I suppose you rebooted it in the meantime?).

>

> the box this numbers came from has an uptime of 13 days

> so it's one of the boxes that did survive yesterdays half-cluster-wide-reboot.

>

So this box had no issues? Keep an eye on the number of threadas, but maybe others will have a better idea, this is just where I'd start. I have seen close to a milion threads from OSDs on my boxes, not sure what the number are now.

>> Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is that your data drives?) - were they overloaded really?

>

> no they didn't have any load and or iops.

> Basically the whole box had nothing to do.

>

> If I understand the load correctly, this just reports threads

> that are ready and willing to work but - in this case -

> don't get any data to work with.

Different unixes calculate this differently :-) By itself "load" is meaningless.

It should be something like an average number of processes that want to run at any given time but can't (because they are waiting for whatever they need - disks, CPU, blocking sockets...).

Jan

>

> Thx

>

> Benedikt

>

>

> 2015-12-08 8:44 GMT+01:00 Jan Schermer <jan@xxxxxxxxxxx>:

>>

>> Jan

>>

>>

>>> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer <fraunhofer@xxxxxxxxxx> wrote:

>>>

>>> Hi Jan,

>>>

>>> we had 65k for pid_max, which made

>>> kernel.threads-max = 1030520.

>>> or

>>> kernel.threads-max = 256832

>>> (looks like it depends on the number of cpus?)

>>>

>>> currently we've

>>>

>>> root@ceph1-store209:~# sysctl -a | grep -e thread -e pid

>>> kernel.cad_pid = 1

>>> kernel.core_uses_pid = 0

>>> kernel.ns_last_pid = 60298

>>> kernel.pid_max = 65535

>>> kernel.threads-max = 256832

>>> vm.nr_pdflush_threads = 0

>>> root@ceph1-store209:~# ps axH |wc -l

>>> 17548

>>>

>>> we'll see how it behaves once puppet has come by and adjusted it.

>>>

>>> Thx!

>>>

>>> Benedikt

>>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com