On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > On 02/18/2015 02:19 PM, Florian Haas wrote: >> >> Hey everyone, >> >> I must confess I'm still not fully understanding this problem and >> don't exactly know where to start digging deeper, but perhaps other >> users have seen this and/or it rings a bell. >> >> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 >> different rulesets where the problem applies to hosts and PGs using a >> bog-standard default crushmap. >> >> Symptom: out of the blue, ceph-osd processes on a single OSD node >> start going to 100% CPU utilization. The problems turns so bad that >> the machine is effectively becoming CPU bound and can't cope with any >> client requests anymore. Stopping and restarting all OSDs brings the >> problem right back, as does rebooting the machine — right after >> ceph-osd processes start, CPU utilization shoots up again. Stopping >> and marking out several OSDs on the machine makes the problem go away >> but obviously causes massive backfilling. All the logs show while CPU >> utilization is implausibly high are slow requests (which would be >> expected in a system that can barely do anything). >> >> Now I've seen issues like this before on dumpling and firefly, but >> besides the fact that they have all been addressed and should now be >> fixed, they always involved the prior mass removal of RBD snapshots. >> This system only used a handful of snapshots in testing, and is >> presently not using any snapshots at all. >> >> I'll be spending some time looking for clues in the log files of the >> OSDs that were shut down which caused the problem to go away, but if >> this sounds familiar to anyone willing to offer clues, I'd be more >> than interested. :) Thanks! > > > Hi Florian, > > Does a quick perf top tell you anything useful? Hi Mark, Unfortunately, quite the contrary -- but this might actually provide a clue to the underlying issue. So the CPU pegging issue isn't currently present, so the perf top data wouldn't be conclusive until the issue is reproduced. But: merely running perf top on this host, which currently only has 2 active OSDs, renders the host unresponsive. Corresponding dmesg snippet: [Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns [Wed Feb 18 20:53:49 2015] ------------[ cut here ]------------ [Wed Feb 18 20:53:49 2015] WARNING: at arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100() [Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables xfs vfat fat iTCO_w dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich mfd_core mei_me mei ipmi_devintf shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm bnx2 x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si] [Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not tainted 3.10.0-123.20.1.el7.x86_64 #1 [Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.2.2 01/16/2014 [Wed Feb 18 20:53:49 2015] 0000000000000000 0000000050de8931 ffff880fef003d40 ffffffff815e2b0c [Wed Feb 18 20:53:49 2015] ffff880fef003d78 ffffffff8105dee1 ffff880c316a7400 ffff880fef00b9e0 [Wed Feb 18 20:53:49 2015] 0000000000000000 ffff880fef016db0 ffff880dbaa896c0 ffff880fef003d88 [Wed Feb 18 20:53:49 2015] Call Trace: [Wed Feb 18 20:53:49 2015] <IRQ> [<ffffffff815e2b0c>] dump_stack+0x19/0x1b [Wed Feb 18 20:53:49 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 [Wed Feb 18 20:53:49 2015] [<ffffffff8105e00a>] warn_slowpath_null+0x1a/0x20 [Wed Feb 18 20:53:49 2015] [<ffffffff81023706>] x86_pmu_start+0xc6/0x100 [Wed Feb 18 20:53:49 2015] [<ffffffff81136128>] perf_adjust_freq_unthr_context.part.79+0x198/0x1b0 [Wed Feb 18 20:53:49 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0 [Wed Feb 18 20:53:49 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150 [Wed Feb 18 20:53:49 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80 [Wed Feb 18 20:53:49 2015] [<ffffffff810be055>] tick_sched_handle.isra.16+0x25/0x60 [Wed Feb 18 20:53:49 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60 [Wed Feb 18 20:53:49 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0 [Wed Feb 18 20:53:49 2015] [<ffffffff810be090>] ? tick_sched_handle.isra.16+0x60/0x60 [Wed Feb 18 20:53:49 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240 [Wed Feb 18 20:53:49 2015] [<ffffffff81039717>] local_apic_timer_interrupt+0x37/0x60 [Wed Feb 18 20:53:49 2015] [<ffffffff815f552f>] smp_apic_timer_interrupt+0x3f/0x60 [Wed Feb 18 20:53:49 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80 [Wed Feb 18 20:53:49 2015] <EOI> [<ffffffff81092d46>] ? finish_task_switch+0x56/0x170 [Wed Feb 18 20:53:49 2015] [<ffffffff815e7db5>] __schedule+0x2c5/0x790 [Wed Feb 18 20:53:49 2015] [<ffffffff815e82a9>] schedule+0x29/0x70 [Wed Feb 18 20:53:49 2015] [<ffffffff815e73b5>] do_nanosleep+0xc5/0x130 [Wed Feb 18 20:53:49 2015] [<ffffffff8108a78d>] hrtimer_nanosleep+0xad/0x170 [Wed Feb 18 20:53:49 2015] [<ffffffff81089720>] ? hrtimer_get_res+0x50/0x50 [Wed Feb 18 20:53:49 2015] [<ffffffff815e7361>] ? do_nanosleep+0x71/0x130 [Wed Feb 18 20:53:49 2015] [<ffffffff8108a8ce>] SyS_nanosleep+0x7e/0x90 [Wed Feb 18 20:53:49 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b [Wed Feb 18 20:53:49 2015] ---[ end trace bbcc155358ec2b47 ]--- [Wed Feb 18 21:00:35 2015] ------------[ cut here ]------------ [Wed Feb 18 21:00:35 2015] WARNING: at arch/x86/kernel/cpu/perf_event_intel.c:1205 intel_pmu_handle_irq+0x3f7/0x410() [Wed Feb 18 21:00:35 2015] perfevents: irq loop stuck! [Wed Feb 18 21:00:35 2015] Modules linked in: ipmi_si binfmt_misc mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables xfs vfat fat iTCO_w dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich mfd_core mei_me mei ipmi_devintf shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm bnx2 x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si] [Wed Feb 18 21:00:35 2015] CPU: 9 PID: 4608 Comm: ceph-osd Tainted: G W -------------- 3.10.0-123.20.1.el7.x86_64 #1 [Wed Feb 18 21:00:35 2015] Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.2.2 01/16/2014 [Wed Feb 18 21:00:35 2015] ffff880fef125d10 000000000bc39f12 ffff880fef125cc8 ffffffff815e2b0c [Wed Feb 18 21:00:35 2015] ffff880fef125d00 ffffffff8105dee1 0000000000000064 ffff880fef12b9e0 [Wed Feb 18 21:00:35 2015] ffff880c316a4400 0000000000000040 ffff880fef12bbe0 ffff880fef125d68 [Wed Feb 18 21:00:35 2015] Call Trace: [Wed Feb 18 21:00:35 2015] <NMI> [<ffffffff815e2b0c>] dump_stack+0x19/0x1b [Wed Feb 18 21:00:35 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 [Wed Feb 18 21:00:35 2015] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80 [Wed Feb 18 21:00:35 2015] [<ffffffff810294c7>] intel_pmu_handle_irq+0x3f7/0x410 [Wed Feb 18 21:00:35 2015] [<ffffffff81356ff4>] ? ghes_copy_tofrom_phys+0x124/0x210 [Wed Feb 18 21:00:35 2015] [<ffffffff8101a603>] ? native_sched_clock+0x13/0x80 [Wed Feb 18 21:00:35 2015] [<ffffffff8101a679>] ? sched_clock+0x9/0x10 [Wed Feb 18 21:00:35 2015] [<ffffffff8109a045>] ? sched_clock_cpu+0xb5/0x100 [Wed Feb 18 21:00:35 2015] [<ffffffff815ec08b>] perf_event_nmi_handler+0x2b/0x50 [Wed Feb 18 21:00:35 2015] [<ffffffff815eb7d9>] nmi_handle.isra.0+0x69/0xb0 [Wed Feb 18 21:00:35 2015] [<ffffffff815eb989>] do_nmi+0x169/0x340 [Wed Feb 18 21:00:35 2015] [<ffffffff815eac31>] end_repeat_nmi+0x1e/0x2e [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10 [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10 [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10 [Wed Feb 18 21:00:35 2015] <<EOE>> <IRQ> [<ffffffff81029bc1>] intel_pmu_disable_event+0xa1/0x160 [Wed Feb 18 21:00:35 2015] [<ffffffff81022956>] x86_pmu_stop+0x76/0xd0 [Wed Feb 18 21:00:35 2015] [<ffffffff81136094>] perf_adjust_freq_unthr_context.part.79+0x104/0x1b0 [Wed Feb 18 21:00:35 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0 [Wed Feb 18 21:00:35 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150 [Wed Feb 18 21:00:35 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80 [Wed Feb 18 21:00:35 2015] [<ffffffff810be055>] tick_sched_handle.isra.16+0x25/0x60 [Wed Feb 18 21:00:35 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60 [Wed Feb 18 21:00:35 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0 [Wed Feb 18 21:00:35 2015] [<ffffffff810be090>] ? tick_sched_handle.isra.16+0x60/0x60 [Wed Feb 18 21:00:35 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240 [Wed Feb 18 21:00:35 2015] [<ffffffff81039717>] local_apic_timer_interrupt+0x37/0x60 [Wed Feb 18 21:00:35 2015] [<ffffffff815f552f>] smp_apic_timer_interrupt+0x3f/0x60 [Wed Feb 18 21:00:35 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80 [Wed Feb 18 21:00:35 2015] <EOI> [<ffffffff811ca796>] ? file_update_time+0xe6/0xf0 [Wed Feb 18 21:00:35 2015] [<ffffffff811ca70f>] ? file_update_time+0x5f/0xf0 [Wed Feb 18 21:00:35 2015] [<ffffffff811437a8>] __generic_file_aio_write+0x198/0x3e0 [Wed Feb 18 21:00:35 2015] [<ffffffff81143a4d>] generic_file_aio_write+0x5d/0xc0 [Wed Feb 18 21:00:35 2015] [<ffffffffa02a51a9>] ext4_file_write+0xa9/0x450 [ext4] [Wed Feb 18 21:00:35 2015] [<ffffffff810976b6>] ? try_to_wake_up+0x1b6/0x280 [Wed Feb 18 21:00:35 2015] [<ffffffff810bfd68>] ? get_futex_key+0x1c8/0x2a0 [Wed Feb 18 21:00:35 2015] [<ffffffff811af0cd>] do_sync_write+0x8d/0xd0 [Wed Feb 18 21:00:35 2015] [<ffffffff811af86d>] vfs_write+0xbd/0x1e0 [Wed Feb 18 21:00:35 2015] [<ffffffff811b02b8>] SyS_write+0x58/0xb0 [Wed Feb 18 21:00:35 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b [Wed Feb 18 21:00:35 2015] ---[ end trace bbcc155358ec2b48 ]--- [Wed Feb 18 21:00:35 2015] CPU#9: ctrl: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: status: 0000000000000001 [Wed Feb 18 21:00:35 2015] CPU#9: overflow: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: fixed: 00000000000000b0 [Wed Feb 18 21:00:35 2015] CPU#9: pebs: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: active: 0000000200000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 ctrl: 000000000013003c [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 left: 0000000000007a1c [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 ctrl: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 left: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 ctrl: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 left: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 ctrl: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 left: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC0 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC1 count: 0000fffbe795d19b [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC2 count: 0000000000000000 [Wed Feb 18 21:00:35 2015] perf_event_intel: clearing PMU state on CPU#9 [Wed Feb 18 21:00:35 2015] perf samples too long (64102 > 62500), lowering kernel.perf_event_max_sample_rate to 2000 In case you're wondering, no the ext4 file write does not come from an OSD. All OSDs use XFS, and the node does not host a mon either. The kernel build is the latest currently available for CentOS 7, with matching perf obviously. Does that help at all? Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com