{Disarmed} Problem with internals and mgr/ out-of-memory, unresponsive, high-CPU

Ted Lum <ceph.io@xxxxxxxxxx> · Tue, 1 Mar 2022 21:11:17 -0500

I'm attempting to install an OpenStack cluster, with Ceph. It's doing a 
cephadm install (bootstrap overcloud-controller-0, then deploy from 
there to the other two nodes)

This is a containerized install:

parameter_defaults:
  ContainerImagePrepare:
  - set:
      ceph_alertmanager_image: alertmanager
      ceph_alertmanager_namespace: quay.ceph.io/prometheus
      ceph_alertmanager_tag: v0.16.2
      ceph_grafana_image: grafana
      ceph_grafana_namespace: quay.ceph.io/app-sre
      ceph_grafana_tag: 6.7.4
      ceph_image: daemon
      ceph_namespace: quay.io/ceph
      ceph_node_exporter_image: node-exporter
      ceph_node_exporter_namespace: quay.ceph.io/prometheus
      ceph_node_exporter_tag: v0.17.0
      ceph_prometheus_image: prometheus
      ceph_prometheus_namespace: quay.ceph.io/prometheus
      ceph_prometheus_tag: v2.7.2
      ceph_tag: v6.0.4-stable-6.0-pacific-centos-8-x86_64
      name_prefix: openstack-
      name_suffix: ''
      namespace: quay.io/tripleomaster
      neutron_driver: ovn
      rhel_containers: false
      tag: current-tripleo
    tag_from_label: rdo_version

When a controller node becomes active a call is made to, I believe, 
ActivePyModules::set_store(...) with a corrupt - corrupt in a Huge Way - 
json payload, which results in it attempting to allocate virtual memory 
for it, but runs out of VM around 120Gig, all the while the node is 
unresponsive because the CPU is running around 100% IO wait. Eventually 
control is transferred to another node which then begins the same 
behavior, and it just transfers from one node to the next indefinitively.

I don't know where the payload is coming from; I don't yet know enough 
about Ceph internals. I believe it's delivered by a message, but I don't 
know who sent it.

The set_store call is trying to do a "config-key set 
mgr/cephadm/host.overcloud-controller-0". The problem with the payload 
seems to be a repeatable, indefinite, duplication of an IP address. The 
following is an excerpt from just before it looses it's mind:

..."networks_and_interfaces": {"10.100.4.0/24": {"br-ex": 
["10.100.4.71"]}, "10.100.5.0/24": {"vlan1105": ["10.100.5.154"]}, 
"10.100.6.0/24": {"vlan1106": ["10.100.6.163"]}, "10.100.7.0/24": 
{"vlan1107": ["10.100.7.163", "10.100.7.163", ... (the vlan1107 IP is 
repeated more times than the log can hold).

I've truncated the length of these lines, but it gives an idea of the 7 
minutes that it spends in I/O wait hell while it's sucking up all 
available VM.

Feb 27 22:43:59 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 1207967744 bytes == 0x5606e03a0000 @
Feb 27 22:44:00 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 1207967744 bytes == 0x560662376000 @
Feb 27 22:44:10 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 2415927296 bytes == 0x5607b8ba6000 @
Feb 27 22:44:12 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 2415927296 bytes == 0x560848ba8000 @
Feb 27 22:44:18 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 2415927296 bytes == 0x560848ba8000 @
Feb 27 22:44:19 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 2415927296 bytes == 0x56063e374000 @
Feb 27 22:44:25 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 2415927296 bytes == 0x56063e374000 @
Feb 27 22:44:32 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x5608d8baa000 @
Feb 27 22:44:35 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x5609f93ac000 @
Feb 27 22:44:46 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x5609f93ac000 @
Feb 27 22:44:48 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x5607b8ba6000 @
Feb 27 22:45:00 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x5607b8ba6000 @
Feb 27 22:45:02 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 4831846400 bytes == 0x56063e374000 @
Feb 27 22:45:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x560b193ae000 @
Feb 27 22:45:21 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x560d59bb0000 @
Feb 27 22:45:45 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x560d59bb0000 @
Feb 27 22:45:52 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x560f9abb2000 @
Feb 27 22:46:14 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x560f9abb2000 @
Feb 27 22:46:18 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 9663684608 bytes == 0x5611db3b4000 @
Feb 27 22:47:42 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == 0x56141bbb6000
Feb 27 22:47:58 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == 0x56189cbb8000
Feb 27 22:48:45 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == 0x56189cbb8000
Feb 27 22:48:55 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == 0x561d1dbba000
Feb 27 22:49:54 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == 0x561d1dbba000
Feb 27 22:50:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 19327361024 bytes == (nil) @  0x7fdb
Feb 27 22:50:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 38654713856 bytes == (nil) @  0x7fdb
Feb 27 22:50:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 38654713856 bytes == (nil) @  0x7fdb
Feb 27 22:50:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 38654713856 bytes == (nil) @  0x7fdb
Feb 27 22:50:13 overcloud-controller-0 conmon[4885]: tcmalloc: large 
alloc 38654713856 bytes == (nil) @  0x7fdb

vlan1107 is the "Storage" vlan, which I believe is considered the 
"External" network by Ceph, and 10.100.7.163 belongs to the active node

14: vlan1107: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue 
state UNKNOWN group default qlen 1000
    link/ether c6:48:79:80:d2:69 brd ff:ff:ff:ff:ff:ff
    inet 10.100.7.163/24 brd 10.100.7.255 scope global vlan1107
       valid_lft forever preferred_lft forever
    inet 10.100.7.152/32 brd 10.100.7.255 scope global vlan1107
       valid_lft forever preferred_lft forever
(10.100.7.152/32 is the VIP)

This is the bootstrap conf:

[root@overcloud-controller-0 ceph-admin]# cat bootstrap_ceph.conf
[global]
fsid = d0bcb278-f5ee-4f7a-87b7-911bf60620ae
mon host = 10.100.7.151
public network = 10.100.7.0/24
cluster network = 10.100.8.0/24
osd_pool_default_pg_num = 32
osd_pool_default_pgp_num = 32
osd_pool_default_size = 3
rgw_keystone_accepted_admin_roles = ResellerAdmin, swiftoperator
rgw_keystone_accepted_roles = member, Member, admin
rgw_keystone_admin_domain = default
rgw_keystone_admin_password = ***
rgw_keystone_admin_project = service
rgw_keystone_admin_user = swift
rgw_keystone_api_version = 3
rgw_keystone_implicit_tenants = true
rgw_keystone_revocation_interval = 0
rgw_keystone_url = http://10.100.5.168:5000
rgw_s3_auth_use_keystone = true
rgw_swift_account_in_url = true
rgw_swift_versioning_enabled = true
rgw_trust_forwarded_https = true

[mgr]
mgr/cephadm/autotune_memory_target_ratio = 0.2

[osd]
osd_memory_target_autotune = True
osd_numa_auto_affinity = True

This is all the basic information, but due to the nature of this problem 
and the fact that its a Tripleo-Ceph install there are more artifacts 
than can be practically attached to an email.

Does anyone know how to track down the source of the caller of 
set_store, with the corrupt payload, that's trying to do a, config-key 
set mgr/cephadm/host.overcloud-controller-0?

TIA

Ted

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx