(apologies if you receive this more than once... apparently I cannot reply to a 1 year old message on the list). Dear all, I'd like to +10 this old proposal of Kyle's. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway ("wrongly marked me down"..."). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for >30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,10000). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all <ceph command> on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: "On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default." Cheers, Dan On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader <kyle.bader@xxxxxxxxx> wrote: > It seems that NUMA can be problematic for ceph-osd daemons in certain > circumstances. Namely it seems that if a NUMA zone is running out of > memory due to uneven allocation it is possible for a NUMA zone to > enter reclaim mode when threads/processes are scheduled on a core in > that zone and those processes are request memory allocations greater > than the zones remaining memory. In order for the kernel to satisfy > the memory allocation for those processes it needs to page out some of > the contents of the contentious zone, which can have dramatic > performance implications due to cache misses, etc. I see two ways an > operator could alleviate these issues: > > Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing > ceph-osd daemons with "numactl --interleave=all". This should probably > be activated by a flag in /etc/default/ceph and modifying the > ceph-osd.conf upstart script, along with adding a depend to the ceph > package's debian/rules file on the "numactl" package. > > The alternative is to use a cgroup for each ceph-osd daemon, pinning > each one to cores in the same NUMA zone using cpuset.cpu and > cpuset.mems. This would probably also live in /etc/default/ceph and > the upstart scripts. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com