On 01/12/2015 07:47 AM, Dan van der Ster wrote:
(resending to list) Hi Kyle, I'd like to +10 this old proposal of yours. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway ("wrongly marked me down"..."). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for >30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,10000). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all <ceph command> on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: "On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default."
Interestingly I was seeing behavior that looked like this as well, though it manifested as OSDs going down with internal heartbeat timeouts during heavy 4K read/write benchmarks. I was able to observe major page faults on the OSD nodes correlated with the "frozen" period despite significant amounts of memory used for buffer cache. I also went down the vm_cache_pressure path. Changing it to 1 fixed the issue. I didn't think to go back and look at zone_reclaim_mode though! Since then the system has been upgraded to fedora 21 with a 3.17 kernel and the issue no longer manifested itself. I suspect now that this is due to the new default behavior. Perhaps this solves the puzzle. What is interesting is that at least on our test node this behavior didn't occur prior to firefly. It may be that some change we made exacerbated the problem. Anyway, thank you for the excellent analysis Dan!
Mark _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com