On 09/18/2015 12:17 PM, Olivier Bonvalet wrote: > Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit : >>> On 18 Sep 2015, at 11:28, Christian Balzer <chibi@xxxxxxx> wrote: >>> >>> On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote: >>> >>>> Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit >>>> : >>>>> In that case it can either be slow monitors (slow network, slow >>>>> disks(!!!) or a CPU or memory problem). >>>>> But it still can also be on the OSD side in the form of either >>>>> CPU >>>>> usage or memory pressure - in my case there were lots of memory >>>>> used >>>>> for pagecache (so for all intents and purposes considered >>>>> "free") but >>>>> when peering the OSD had trouble allocating any memory from it >>>>> and it >>>>> caused lots of slow ops and peering hanging in there for a >>>>> while. >>>>> This also doesn't show as high CPU usage, only kswapd spins up >>>>> a bit >>>>> (don't be fooled by its name, it has nothing to do with swap in >>>>> this >>>>> case). >>>> My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM >>>> (for >>>> 4x800GB ones), so I will try track this too. Thanks ! >>>> >>> I haven't seen this (known problem) with 64GB or 128GB nodes, >>> probably >>> because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB >>> respectively. >>> >> I had this set to 6G and that doesn't help. This "buffer" is probably >> only useful for some atomic allocations that can use it, not for >> userland processes and their memory. Or maybe they get memory from >> this pool but it gets replenished immediately. >> QEMU has no problem allocating 64G on the same host, OSD struggles to >> allocate memory during startup or when PGs are added during >> rebalancing - probably because it does a lot of smaller allocations >> instead of one big. >> > For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw > any rebalance, but I can see a reduced filestore.commitcycle_latency. It might be worth checking how many threads you have in your system (ps -eL | wc -l). By default there is a limit of 32k (sysctl -q kernel.pid_max). There is/was a bug in fork() (https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit is reached. We hit a situation when OSD trying to create new thread was killed and reports 'Cannot allocate memory' (12 OSD per node created more than 32k threads). -- PS _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com