On Wed, Feb 18, 2015 at 9:19 PM, Florian Haas wrote: >> Hey everyone, >> >> I must confess I'm still not fully understanding this problem and >> don't exactly know where to start digging deeper, but perhaps other >> users have seen this and/or it rings a bell. >> >> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 >> different rulesets where the problem applies to hosts and PGs using a >> bog-standard default crushmap. >> >> Symptom: out of the blue, ceph-osd processes on a single OSD node >> start going to 100% CPU utilization. The problems turns so bad that >> the machine is effectively becoming CPU bound and can't cope with any >> client requests anymore. Stopping and restarting all OSDs brings the >> problem right back, as does rebooting the machine — right after >> ceph-osd processes start, CPU utilization shoots up again. Stopping >> and marking out several OSDs on the machine makes the problem go away >> but obviously causes massive backfilling. All the logs show while CPU >> utilization is implausibly high are slow requests (which would be >> expected in a system that can barely do anything). >> >> Now I've seen issues like this before on dumpling and firefly, but >> besides the fact that they have all been addressed and should now be >> fixed, they always involved the prior mass removal of RBD snapshots. >> This system only used a handful of snapshots in testing, and is >> presently not using any snapshots at all. >> >> I'll be spending some time looking for clues in the log files of the >> OSDs that were shut down which caused the problem to go away, but if >> this sounds familiar to anyone willing to offer clues, I'd be more >> than interested. :) Thanks! >> >> Cheers, >> Florian > >Dan vd Ster was kind enough to pitch in an incredibly helpful off-list >reply, which I am taking the liberty to paraphrase here: > >That "mysterious" OSD madness seems to be caused by NUMA zone reclaim, >which is enabled by default on Intel machines with recent kernels. It >can be disabled as follows: > >echo 0 > /proc/sys/vm/zone_reclaim_mode > >or of course, "sysctl -w vm.zone_reclaim_mode=0" or the corresponding >sysctl.conf entry. > >On the machines affected, that seems to have removed the CPU pegging >issue, at least it has not reappeared for several days now. > >Dan and Sage have discussed the issue recently in this thread: >http://www.spinics.net/lists/ceph-users/msg14914.html > >Thanks a million to Dan. I'm looking into the original issue Florian describes above. It seems that unsetting zone_reclaim_mode wasn't the magical fix we hoped. After a couple of weeks, we're seeing pegged CPUs again, but his time we managed to get a perf top snapshot of it happening. These are the topmost (ahem) lines: 8.33% [kernel] [k] _raw_spin_lock 3.14% perf [.] 0x00000000000da124 2.58% [unknown] [.] 0x00007f8a2901042d 1.85% libpython2.7.so.1.0 [.] 0x000000000006dac2 1.61% libc-2.17.so [.] __memcpy_ssse3_back 1.54% perf [.] dso__find_symbol 1.44% libc-2.17.so [.] __strcmp_sse42 1.41% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx 1.25% [kernel] [k] native_write_msr_safe 1.24% perf [.] hists__output_resort 1.11% libleveldb.so.1.0.7 [.] 0x000000000003cde8 0.86% perf [.] perf_evsel__parse_sample 0.81% libtcmalloc.so.4.1.2 [.] operator new(unsigned long) 0.76% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx 0.73% [kernel] [k] apic_timer_interrupt 0.71% [kernel] [k] page_fault 0.71% [kernel] [k] _raw_spin_lock_irqsave 0.62% libpthread-2.17.so [.] pthread_mutex_unlock 0.62% libc-2.17.so [.] __memcmp_sse4_1 0.61% libc-2.17.so [.] _int_malloc 0.60% perf [.] rb_next 0.58% [kernel] [k] clear_page_c_e 0.56% [kernel] [k] tg_load_down The server in question was booted without any OSDs. A few were started after invoking 'perf top', during which run the CPUs were saturated. Any ideas? Cheers! Adolfo _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com