Re: ceph-osd pegging CPU on giant, no snapshots involved this time

Florian Haas <florian@xxxxxxxxxxx> · Thu, 19 Feb 2015 17:56:46 +0100

On Wed, Feb 18, 2015 at 10:27 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
> On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>> On 02/18/2015 02:19 PM, Florian Haas wrote:
>>>
>>> Hey everyone,
>>>
>>> I must confess I'm still not fully understanding this problem and
>>> don't exactly know where to start digging deeper, but perhaps other
>>> users have seen this and/or it rings a bell.
>>>
>>> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
>>> different rulesets where the problem applies to hosts and PGs using a
>>> bog-standard default crushmap.
>>>
>>> Symptom: out of the blue, ceph-osd processes on a single OSD node
>>> start going to 100% CPU utilization. The problems turns so bad that
>>> the machine is effectively becoming CPU bound and can't cope with any
>>> client requests anymore. Stopping and restarting all OSDs brings the
>>> problem right back, as does rebooting the machine — right after
>>> ceph-osd processes start, CPU utilization shoots up again. Stopping
>>> and marking out several OSDs on the machine makes the problem go away
>>> but obviously causes massive backfilling. All the logs show while CPU
>>> utilization is implausibly high are slow requests (which would be
>>> expected in a system that can barely do anything).
>>>
>>> Now I've seen issues like this before on dumpling and firefly, but
>>> besides the fact that they have all been addressed and should now be
>>> fixed, they always involved the prior mass removal of RBD snapshots.
>>> This system only used a handful of snapshots in testing, and is
>>> presently not using any snapshots at all.
>>>
>>> I'll be spending some time looking for clues in the log files of the
>>> OSDs that were shut down which caused the problem to go away, but if
>>> this sounds familiar to anyone willing to offer clues, I'd be more
>>> than interested. :) Thanks!
>>
>>
>> Hi Florian,
>>
>> Does a quick perf top tell you anything useful?
>
> Hi Mark,
>
> Unfortunately, quite the contrary -- but this might actually provide a
> clue to the underlying issue.
>
> So the CPU pegging issue isn't currently present, so the perf top data
> wouldn't be conclusive until the issue is reproduced. But: merely
> running perf top on this host, which currently only has 2 active OSDs,
> renders the host unresponsive.
>
> Corresponding dmesg snippet:
>
> [Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns
> [Wed Feb 18 20:53:49 2015] ------------[ cut here ]------------
> [Wed Feb 18 20:53:49 2015] WARNING: at
> arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100()
> [Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc
> mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> iptable_filter ip_tables xfs vfat fat iTCO_w
> dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
> crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
> glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
> mfd_core mei_me mei ipmi_devintf
> shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
> auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
> crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
> i2c_algo_bit drm_kms_helper ttm bnx2
> x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]
>
> [Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not
> tainted 3.10.0-123.20.1.el7.x86_64 #1
> [Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge
> R720xd/0020HJ, BIOS 2.2.2 01/16/2014
> [Wed Feb 18 20:53:49 2015] 0000000000000000 0000000050de8931
> ffff880fef003d40 ffffffff815e2b0c
> [Wed Feb 18 20:53:49 2015] ffff880fef003d78 ffffffff8105dee1
> ffff880c316a7400 ffff880fef00b9e0
> [Wed Feb 18 20:53:49 2015] 0000000000000000 ffff880fef016db0
> ffff880dbaa896c0 ffff880fef003d88
> [Wed Feb 18 20:53:49 2015] Call Trace:
> [Wed Feb 18 20:53:49 2015]  <IRQ>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
> [Wed Feb 18 20:53:49 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
> [Wed Feb 18 20:53:49 2015] [<ffffffff8105e00a>] warn_slowpath_null+0x1a/0x20
> [Wed Feb 18 20:53:49 2015] [<ffffffff81023706>] x86_pmu_start+0xc6/0x100
> [Wed Feb 18 20:53:49 2015] [<ffffffff81136128>]
> perf_adjust_freq_unthr_context.part.79+0x198/0x1b0
> [Wed Feb 18 20:53:49 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
> [Wed Feb 18 20:53:49 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
> [Wed Feb 18 20:53:49 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
> [Wed Feb 18 20:53:49 2015] [<ffffffff810be055>]
> tick_sched_handle.isra.16+0x25/0x60
> [Wed Feb 18 20:53:49 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
> [Wed Feb 18 20:53:49 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
> [Wed Feb 18 20:53:49 2015] [<ffffffff810be090>] ?
> tick_sched_handle.isra.16+0x60/0x60
> [Wed Feb 18 20:53:49 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
> [Wed Feb 18 20:53:49 2015] [<ffffffff81039717>]
> local_apic_timer_interrupt+0x37/0x60
> [Wed Feb 18 20:53:49 2015] [<ffffffff815f552f>]
> smp_apic_timer_interrupt+0x3f/0x60
> [Wed Feb 18 20:53:49 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
> [Wed Feb 18 20:53:49 2015]  <EOI>  [<ffffffff81092d46>] ?
> finish_task_switch+0x56/0x170
> [Wed Feb 18 20:53:49 2015] [<ffffffff815e7db5>] __schedule+0x2c5/0x790
> [Wed Feb 18 20:53:49 2015] [<ffffffff815e82a9>] schedule+0x29/0x70
> [Wed Feb 18 20:53:49 2015] [<ffffffff815e73b5>] do_nanosleep+0xc5/0x130
> [Wed Feb 18 20:53:49 2015] [<ffffffff8108a78d>] hrtimer_nanosleep+0xad/0x170
> [Wed Feb 18 20:53:49 2015] [<ffffffff81089720>] ? hrtimer_get_res+0x50/0x50
> [Wed Feb 18 20:53:49 2015] [<ffffffff815e7361>] ? do_nanosleep+0x71/0x130
> [Wed Feb 18 20:53:49 2015] [<ffffffff8108a8ce>] SyS_nanosleep+0x7e/0x90
> [Wed Feb 18 20:53:49 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
> [Wed Feb 18 20:53:49 2015] ---[ end trace bbcc155358ec2b47 ]---
> [Wed Feb 18 21:00:35 2015] ------------[ cut here ]------------
> [Wed Feb 18 21:00:35 2015] WARNING: at
> arch/x86/kernel/cpu/perf_event_intel.c:1205
> intel_pmu_handle_irq+0x3f7/0x410()
> [Wed Feb 18 21:00:35 2015] perfevents: irq loop stuck!
> [Wed Feb 18 21:00:35 2015] Modules linked in: ipmi_si binfmt_misc
> mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> iptable_filter ip_tables xfs vfat fat iTCO_w
> dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
> crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
> glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
> mfd_core mei_me mei ipmi_devintf
> shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
> auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
> crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
> i2c_algo_bit drm_kms_helper ttm bnx2
> x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]
>
> [Wed Feb 18 21:00:35 2015] CPU: 9 PID: 4608 Comm: ceph-osd Tainted: G
>       W   --------------   3.10.0-123.20.1.el7.x86_64 #1
> [Wed Feb 18 21:00:35 2015] Hardware name: Dell Inc. PowerEdge
> R720xd/0020HJ, BIOS 2.2.2 01/16/2014
> [Wed Feb 18 21:00:35 2015] ffff880fef125d10 000000000bc39f12
> ffff880fef125cc8 ffffffff815e2b0c
> [Wed Feb 18 21:00:35 2015] ffff880fef125d00 ffffffff8105dee1
> 0000000000000064 ffff880fef12b9e0
> [Wed Feb 18 21:00:35 2015] ffff880c316a4400 0000000000000040
> ffff880fef12bbe0 ffff880fef125d68
> [Wed Feb 18 21:00:35 2015] Call Trace:
> [Wed Feb 18 21:00:35 2015]  <NMI>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
> [Wed Feb 18 21:00:35 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
> [Wed Feb 18 21:00:35 2015] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
> [Wed Feb 18 21:00:35 2015] [<ffffffff810294c7>] intel_pmu_handle_irq+0x3f7/0x410
> [Wed Feb 18 21:00:35 2015] [<ffffffff81356ff4>] ?
> ghes_copy_tofrom_phys+0x124/0x210
> [Wed Feb 18 21:00:35 2015] [<ffffffff8101a603>] ? native_sched_clock+0x13/0x80
> [Wed Feb 18 21:00:35 2015] [<ffffffff8101a679>] ? sched_clock+0x9/0x10
> [Wed Feb 18 21:00:35 2015] [<ffffffff8109a045>] ? sched_clock_cpu+0xb5/0x100
> [Wed Feb 18 21:00:35 2015] [<ffffffff815ec08b>] perf_event_nmi_handler+0x2b/0x50
> [Wed Feb 18 21:00:35 2015] [<ffffffff815eb7d9>] nmi_handle.isra.0+0x69/0xb0
> [Wed Feb 18 21:00:35 2015] [<ffffffff815eb989>] do_nmi+0x169/0x340
> [Wed Feb 18 21:00:35 2015] [<ffffffff815eac31>] end_repeat_nmi+0x1e/0x2e
> [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
> [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
> [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
> [Wed Feb 18 21:00:35 2015]  <<EOE>>  <IRQ>  [<ffffffff81029bc1>]
> intel_pmu_disable_event+0xa1/0x160
> [Wed Feb 18 21:00:35 2015] [<ffffffff81022956>] x86_pmu_stop+0x76/0xd0
> [Wed Feb 18 21:00:35 2015] [<ffffffff81136094>]
> perf_adjust_freq_unthr_context.part.79+0x104/0x1b0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
> [Wed Feb 18 21:00:35 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
> [Wed Feb 18 21:00:35 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
> [Wed Feb 18 21:00:35 2015] [<ffffffff810be055>]
> tick_sched_handle.isra.16+0x25/0x60
> [Wed Feb 18 21:00:35 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
> [Wed Feb 18 21:00:35 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
> [Wed Feb 18 21:00:35 2015] [<ffffffff810be090>] ?
> tick_sched_handle.isra.16+0x60/0x60
> [Wed Feb 18 21:00:35 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
> [Wed Feb 18 21:00:35 2015] [<ffffffff81039717>]
> local_apic_timer_interrupt+0x37/0x60
> [Wed Feb 18 21:00:35 2015] [<ffffffff815f552f>]
> smp_apic_timer_interrupt+0x3f/0x60
> [Wed Feb 18 21:00:35 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
> [Wed Feb 18 21:00:35 2015]  <EOI>  [<ffffffff811ca796>] ?
> file_update_time+0xe6/0xf0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811ca70f>] ? file_update_time+0x5f/0xf0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811437a8>]
> __generic_file_aio_write+0x198/0x3e0
> [Wed Feb 18 21:00:35 2015] [<ffffffff81143a4d>] generic_file_aio_write+0x5d/0xc0
> [Wed Feb 18 21:00:35 2015] [<ffffffffa02a51a9>]
> ext4_file_write+0xa9/0x450 [ext4]
> [Wed Feb 18 21:00:35 2015] [<ffffffff810976b6>] ? try_to_wake_up+0x1b6/0x280
> [Wed Feb 18 21:00:35 2015] [<ffffffff810bfd68>] ? get_futex_key+0x1c8/0x2a0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811af0cd>] do_sync_write+0x8d/0xd0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811af86d>] vfs_write+0xbd/0x1e0
> [Wed Feb 18 21:00:35 2015] [<ffffffff811b02b8>] SyS_write+0x58/0xb0
> [Wed Feb 18 21:00:35 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
> [Wed Feb 18 21:00:35 2015] ---[ end trace bbcc155358ec2b48 ]---
>
> [Wed Feb 18 21:00:35 2015] CPU#9: ctrl:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9: status:  0000000000000001
> [Wed Feb 18 21:00:35 2015] CPU#9: overflow:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9: fixed: 00000000000000b0
> [Wed Feb 18 21:00:35 2015] CPU#9: pebs:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9: active:  0000000200000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 ctrl:  000000000013003c
> [Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC0 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 left:  0000000000007a1c
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 ctrl:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC1 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 left:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 ctrl:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC2 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 left:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 ctrl:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC3 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 left:  0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC0 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC1 count: 0000fffbe795d19b
> [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC2 count: 0000000000000000
> [Wed Feb 18 21:00:35 2015] perf_event_intel: clearing PMU state on CPU#9
> [Wed Feb 18 21:00:35 2015] perf samples too long (64102 > 62500),
> lowering kernel.perf_event_max_sample_rate to 2000
>
> In case you're wondering, no the ext4 file write does not come from an
> OSD. All OSDs use XFS, and the node does not host a mon either.
>
> The kernel build is the latest currently available for CentOS 7, with
> matching perf obviously.

I wonder, has anyone else seen this? I'm trying to determine whether
this looks like

- this is just a generally fishy system;
- something in either the kernel or its matching perf-tools could be
causing this;
- the root cause would be somewhere in Ceph.

As it is, a simple "perf top" basically hosing the system wouldn't be
something that is generally considered expected.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com