Re: Lot of blocked operations

Paweł Sadowski <ceph@xxxxxxxxx> · Fri, 18 Sep 2015 14:14:38 +0200



On 09/18/2015 12:17 PM, Olivier Bonvalet wrote:
> Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
>>> On 18 Sep 2015, at 11:28, Christian Balzer <chibi@xxxxxxx> wrote:
>>>
>>> On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
>>>
>>>> Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
>>>> :
>>>>> In that case it can either be slow monitors (slow network, slow
>>>>> disks(!!!)  or a CPU or memory problem).
>>>>> But it still can also be on the OSD side in the form of either
>>>>> CPU
>>>>> usage or memory pressure - in my case there were lots of memory
>>>>> used
>>>>> for pagecache (so for all intents and purposes considered
>>>>> "free") but
>>>>> when peering the OSD had trouble allocating any memory from it
>>>>> and it
>>>>> caused lots of slow ops and peering hanging in there for a
>>>>> while.
>>>>> This also doesn't show as high CPU usage, only kswapd spins up
>>>>> a bit
>>>>> (don't be fooled by its name, it has nothing to do with swap in
>>>>> this
>>>>> case).
>>>> My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
>>>> (for
>>>> 4x800GB ones), so I will try track this too. Thanks !
>>>>
>>> I haven't seen this (known problem) with 64GB or 128GB nodes,
>>> probably
>>> because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
>>> respectively.
>>>
>> I had this set to 6G and that doesn't help. This "buffer" is probably
>> only useful for some atomic allocations that can use it, not for
>> userland processes and their memory. Or maybe they get memory from
>> this pool but it gets replenished immediately.
>> QEMU has no problem allocating 64G on the same host, OSD struggles to
>> allocate memory during startup or when PGs are added during
>> rebalancing - probably because it does a lot of smaller allocations
>> instead of one big.
>>
> For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
> any rebalance, but I can see a reduced filestore.commitcycle_latency.

It might be worth checking how many threads you have in your system (ps
-eL | wc -l). By default there is a limit of 32k (sysctl -q
kernel.pid_max). There is/was a bug in fork()
(https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit is
reached. We hit a situation when OSD trying to create new thread was
killed and reports 'Cannot allocate memory' (12 OSD per node created
more than 32k threads).

-- 
PS

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com