Re: ceph-osd pegging CPU on giant, no snapshots involved this time

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 20 Feb 2015 11:34:43 -0600

On 02/19/2015 10:56 AM, Florian Haas wrote:
On Wed, Feb 18, 2015 at 10:27 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
On 02/18/2015 02:19 PM, Florian Haas wrote:

Hey everyone,

I must confess I'm still not fully understanding this problem and
don't exactly know where to start digging deeper, but perhaps other
users have seen this and/or it rings a bell.

System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
different rulesets where the problem applies to hosts and PGs using a
bog-standard default crushmap.

Symptom: out of the blue, ceph-osd processes on a single OSD node
start going to 100% CPU utilization. The problems turns so bad that
the machine is effectively becoming CPU bound and can't cope with any
client requests anymore. Stopping and restarting all OSDs brings the
problem right back, as does rebooting the machine — right after
ceph-osd processes start, CPU utilization shoots up again. Stopping
and marking out several OSDs on the machine makes the problem go away
but obviously causes massive backfilling. All the logs show while CPU
utilization is implausibly high are slow requests (which would be
expected in a system that can barely do anything).

Now I've seen issues like this before on dumpling and firefly, but
besides the fact that they have all been addressed and should now be
fixed, they always involved the prior mass removal of RBD snapshots.
This system only used a handful of snapshots in testing, and is
presently not using any snapshots at all.

I'll be spending some time looking for clues in the log files of the
OSDs that were shut down which caused the problem to go away, but if
this sounds familiar to anyone willing to offer clues, I'd be more
than interested. :) Thanks!

Hi Florian,

Does a quick perf top tell you anything useful?

Hi Mark,

Unfortunately, quite the contrary -- but this might actually provide a
clue to the underlying issue.

So the CPU pegging issue isn't currently present, so the perf top data
wouldn't be conclusive until the issue is reproduced. But: merely
running perf top on this host, which currently only has 2 active OSDs,
renders the host unresponsive.

Corresponding dmesg snippet:

[Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns
[Wed Feb 18 20:53:49 2015] ------------[ cut here ]------------
[Wed Feb 18 20:53:49 2015] WARNING: at
arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100()
[Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc
mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables xfs vfat fat iTCO_w
dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
mfd_core mei_me mei ipmi_devintf
shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm bnx2
x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not
tainted 3.10.0-123.20.1.el7.x86_64 #1
[Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge
R720xd/0020HJ, BIOS 2.2.2 01/16/2014
[Wed Feb 18 20:53:49 2015] 0000000000000000 0000000050de8931
ffff880fef003d40 ffffffff815e2b0c
[Wed Feb 18 20:53:49 2015] ffff880fef003d78 ffffffff8105dee1
ffff880c316a7400 ffff880fef00b9e0
[Wed Feb 18 20:53:49 2015] 0000000000000000 ffff880fef016db0
ffff880dbaa896c0 ffff880fef003d88
[Wed Feb 18 20:53:49 2015] Call Trace:
[Wed Feb 18 20:53:49 2015]  <IRQ>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
[Wed Feb 18 20:53:49 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[Wed Feb 18 20:53:49 2015] [<ffffffff8105e00a>] warn_slowpath_null+0x1a/0x20
[Wed Feb 18 20:53:49 2015] [<ffffffff81023706>] x86_pmu_start+0xc6/0x100
[Wed Feb 18 20:53:49 2015] [<ffffffff81136128>]
perf_adjust_freq_unthr_context.part.79+0x198/0x1b0
[Wed Feb 18 20:53:49 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
[Wed Feb 18 20:53:49 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
[Wed Feb 18 20:53:49 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
[Wed Feb 18 20:53:49 2015] [<ffffffff810be055>]
tick_sched_handle.isra.16+0x25/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
[Wed Feb 18 20:53:49 2015] [<ffffffff810be090>] ?
tick_sched_handle.isra.16+0x60/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
[Wed Feb 18 20:53:49 2015] [<ffffffff81039717>]
local_apic_timer_interrupt+0x37/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff815f552f>]
smp_apic_timer_interrupt+0x3f/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
[Wed Feb 18 20:53:49 2015]  <EOI>  [<ffffffff81092d46>] ?
finish_task_switch+0x56/0x170
[Wed Feb 18 20:53:49 2015] [<ffffffff815e7db5>] __schedule+0x2c5/0x790
[Wed Feb 18 20:53:49 2015] [<ffffffff815e82a9>] schedule+0x29/0x70
[Wed Feb 18 20:53:49 2015] [<ffffffff815e73b5>] do_nanosleep+0xc5/0x130
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a78d>] hrtimer_nanosleep+0xad/0x170
[Wed Feb 18 20:53:49 2015] [<ffffffff81089720>] ? hrtimer_get_res+0x50/0x50
[Wed Feb 18 20:53:49 2015] [<ffffffff815e7361>] ? do_nanosleep+0x71/0x130
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a8ce>] SyS_nanosleep+0x7e/0x90
[Wed Feb 18 20:53:49 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
[Wed Feb 18 20:53:49 2015] ---[ end trace bbcc155358ec2b47 ]---
[Wed Feb 18 21:00:35 2015] ------------[ cut here ]------------
[Wed Feb 18 21:00:35 2015] WARNING: at
arch/x86/kernel/cpu/perf_event_intel.c:1205
intel_pmu_handle_irq+0x3f7/0x410()
[Wed Feb 18 21:00:35 2015] perfevents: irq loop stuck!
[Wed Feb 18 21:00:35 2015] Modules linked in: ipmi_si binfmt_misc
mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables xfs vfat fat iTCO_w
dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
mfd_core mei_me mei ipmi_devintf
shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm bnx2
x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[Wed Feb 18 21:00:35 2015] CPU: 9 PID: 4608 Comm: ceph-osd Tainted: G
       W   --------------   3.10.0-123.20.1.el7.x86_64 #1
[Wed Feb 18 21:00:35 2015] Hardware name: Dell Inc. PowerEdge
R720xd/0020HJ, BIOS 2.2.2 01/16/2014
[Wed Feb 18 21:00:35 2015] ffff880fef125d10 000000000bc39f12
ffff880fef125cc8 ffffffff815e2b0c
[Wed Feb 18 21:00:35 2015] ffff880fef125d00 ffffffff8105dee1
0000000000000064 ffff880fef12b9e0
[Wed Feb 18 21:00:35 2015] ffff880c316a4400 0000000000000040
ffff880fef12bbe0 ffff880fef125d68
[Wed Feb 18 21:00:35 2015] Call Trace:
[Wed Feb 18 21:00:35 2015]  <NMI>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
[Wed Feb 18 21:00:35 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff810294c7>] intel_pmu_handle_irq+0x3f7/0x410
[Wed Feb 18 21:00:35 2015] [<ffffffff81356ff4>] ?
ghes_copy_tofrom_phys+0x124/0x210
[Wed Feb 18 21:00:35 2015] [<ffffffff8101a603>] ? native_sched_clock+0x13/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff8101a679>] ? sched_clock+0x9/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8109a045>] ? sched_clock_cpu+0xb5/0x100
[Wed Feb 18 21:00:35 2015] [<ffffffff815ec08b>] perf_event_nmi_handler+0x2b/0x50
[Wed Feb 18 21:00:35 2015] [<ffffffff815eb7d9>] nmi_handle.isra.0+0x69/0xb0
[Wed Feb 18 21:00:35 2015] [<ffffffff815eb989>] do_nmi+0x169/0x340
[Wed Feb 18 21:00:35 2015] [<ffffffff815eac31>] end_repeat_nmi+0x1e/0x2e
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015]  <<EOE>>  <IRQ>  [<ffffffff81029bc1>]
intel_pmu_disable_event+0xa1/0x160
[Wed Feb 18 21:00:35 2015] [<ffffffff81022956>] x86_pmu_stop+0x76/0xd0
[Wed Feb 18 21:00:35 2015] [<ffffffff81136094>]
perf_adjust_freq_unthr_context.part.79+0x104/0x1b0
[Wed Feb 18 21:00:35 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
[Wed Feb 18 21:00:35 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff810be055>]
tick_sched_handle.isra.16+0x25/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
[Wed Feb 18 21:00:35 2015] [<ffffffff810be090>] ?
tick_sched_handle.isra.16+0x60/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
[Wed Feb 18 21:00:35 2015] [<ffffffff81039717>]
local_apic_timer_interrupt+0x37/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff815f552f>]
smp_apic_timer_interrupt+0x3f/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
[Wed Feb 18 21:00:35 2015]  <EOI>  [<ffffffff811ca796>] ?
file_update_time+0xe6/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff811ca70f>] ? file_update_time+0x5f/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff811437a8>]
__generic_file_aio_write+0x198/0x3e0
[Wed Feb 18 21:00:35 2015] [<ffffffff81143a4d>] generic_file_aio_write+0x5d/0xc0
[Wed Feb 18 21:00:35 2015] [<ffffffffa02a51a9>]
ext4_file_write+0xa9/0x450 [ext4]
[Wed Feb 18 21:00:35 2015] [<ffffffff810976b6>] ? try_to_wake_up+0x1b6/0x280
[Wed Feb 18 21:00:35 2015] [<ffffffff810bfd68>] ? get_futex_key+0x1c8/0x2a0
[Wed Feb 18 21:00:35 2015] [<ffffffff811af0cd>] do_sync_write+0x8d/0xd0
[Wed Feb 18 21:00:35 2015] [<ffffffff811af86d>] vfs_write+0xbd/0x1e0
[Wed Feb 18 21:00:35 2015] [<ffffffff811b02b8>] SyS_write+0x58/0xb0
[Wed Feb 18 21:00:35 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
[Wed Feb 18 21:00:35 2015] ---[ end trace bbcc155358ec2b48 ]---

[Wed Feb 18 21:00:35 2015] CPU#9: ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: status:  0000000000000001
[Wed Feb 18 21:00:35 2015] CPU#9: overflow:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed: 00000000000000b0
[Wed Feb 18 21:00:35 2015] CPU#9: pebs:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: active:  0000000200000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 ctrl:  000000000013003c
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC0 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 left:  0000000000007a1c
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC1 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC2 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC3 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC0 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC1 count: 0000fffbe795d19b
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC2 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] perf_event_intel: clearing PMU state on CPU#9
[Wed Feb 18 21:00:35 2015] perf samples too long (64102 > 62500),
lowering kernel.perf_event_max_sample_rate to 2000

In case you're wondering, no the ext4 file write does not come from an
OSD. All OSDs use XFS, and the node does not host a mon either.

The kernel build is the latest currently available for CentOS 7, with
matching perf obviously.

I wonder, has anyone else seen this? I'm trying to determine whether
this looks like

- this is just a generally fishy system;
- something in either the kernel or its matching perf-tools could be
causing this;
- the root cause would be somewhere in Ceph.

As it is, a simple "perf top" basically hosing the system wouldn't be
something that is generally considered expected.

Sorry, this was originally sent off-list by mistake.  Sending to the 
list to preserve the conversation.

I've seen perf top go a bit nuts with the default sampling frequency and 
dwarf symbols (mostly because of memory consumption) but not otherwise. 
 you might try using the -F flag to reduce the sampling frequency and 
see if that helps at all.  I don't have any centos machines, but do have 
some RHEL7 and fedora boxes and haven't seen anything similar to that. 
Definitely sounds strange!

Cheers,
Florian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com