> On 31 Jul 2015, at 17:28, Haomai Wang <haomaiwang@xxxxxxxxx> wrote: > > On Fri, Jul 31, 2015 at 5:47 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote: >> I know a few other people here were battling with the occasional issue of OSD being extremely slow when starting. >> >> I personally run OSDs mixed with KVM guests on the same nodes, and was baffled by this issue occuring mostly on the most idle (empty) machines. >> Thought it was some kind of race condition where OSD started too fast and disks couldn’t catch up, was investigating latency of CPUs and cards on a mostly idle hardware etc. - with no improvement. >> >> But in the end, most of my issues were caused by page cache using too much memory. This doesn’t cause any problems when the OSDs have their memory allocated and are running, but when the OSD is (re)started, OS struggles to allocate contiguous blocks of memory for it and its buffers. >> This could also be why I’m seeing such an improvement with my NUMA pinning script - cleaning memory on one node is probably easier and doesn’t block allocations on other nodes. >> > > Although this is make sense to me. It still let me shocked by the fact > that pagecache free or memory fragmentation will cause slow request! “Fragmentation” may be inaccurate description. I know that is an issue for atomic kernel allocations (DMA, driver buffers…) where it can lead to memory starvation even though “free” shows tens of gigabytes free. This manifests in the same way except there’s no “page allocation failure” in dmesg when this happens, probably because there are no strict deadlines for satisfying userland requests. And although I’ve looked into it, I can’t say I am 100% sure my explanation is correct. Anyway if you start a process that needs ~2GB resident memory to work, you need to clean 2GB of pagecache - either by dropping the clean pages or by writing out dirty pages. And writing those pages under pressure while being bombarded by new allocations is not that fast. RH kernels are quite twisted with backported features that shouldn’t be there, some features that are no longer anywhere else, and their interaction and function is often unclear so that might be another issue in my case. As a side note, did you know that barriers don’t actually exist on RH (6) kernels? They replaced it with FUA backport… So does it actually behave the same way as newer kernels do? I can take a guess from what I’ve seen… :-) Have a nice weekend Jan > >> How can you tell if this is your case? When restarting an OSD that has this issue, look for CPU usage of “kswapd” processes. If it is >0 then you have this issue and would benefit from setting this: >> >> for i in $(mount |grep "ceph/osd" |cut -d' ' -f1 |cut -d'/' -f3 |tr -d '[0-9]') ; do echo 1 >/sys/block/$i/bdi/max_ratio ; done >> (another option is echo 1 > drop_caches before starting the OSD, but that’s a bit brutal) >> >> What this does is it limits the pagecache size for each block device to 1% of physical memory. I’d like to limit it even further but it doesn’t understand “0.3”... >> >> Let me know if it helps, I’ve not been able to test if this cures the problem completely, but there was no regression after setting it. >> >> Jan >> >> P.S. This is for RHEL 6 / CentOS 6 ancient 2.6.32 kernel, newer kernels have tunables to limit the overall pagecache size. You can also set the limits in cgroups but I’m afraid that won’t help in this case as you can only set the whole memory footprint limit where it will battle for allocations anyway. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Best Regards, > > Wheat _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com