ceph cluster experiencing major performance issues

"Mclean, Patrick" <Patrick.Mclean@xxxxxxxx> · Tue, 8 Aug 2017 03:47:25 +0000

High CPU utilization and inexplicably slow I/O requests

We have been having similar performance issues across several ceph
clusters. When all the OSDs are up in the cluster, it can stay HEALTH_OK
for a while, but eventually performance worsens and becomes (at first
intermittently, but eventually continually) HEALTH_WARN due to slow I/O
request blocked for longer than 32 sec. These slow requests are
accompanied by "currently waiting for rw locks", but we have not found
any network issue that normally is responsible for this warning.

Examining the individual slow OSDs from `ceph health detail` has been
unproductive; there don't seem to be any slow disks and if we stop the
OSD the problem just moves somewhere else.

We also think this trends with increased number of RBDs on the clusters,
but not necessarily a ton of Ceph I/O. At the same time, user %CPU time
spikes up to 95-100%, at first frequently and then consistently,
simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz CPU
with 6 cores and 64GiB RAM per node.

ceph1 ~ $ sudo ceph status
    cluster XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
     health HEALTH_WARN
            547 requests are blocked > 32 sec
     monmap e1: 3 mons at
{cephmon1.XXXXXXXXXXXXXXXXXXXXXXX=XXX.XXX.XXX.XXX:XXXX/0,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX=XXX.XXX.XXX.XX:XXXX/0,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX=XXX.XXX.XXX.XXX:XXXX/0}
            election epoch 16, quorum 0,1,2
cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX
     osdmap e577122: 72 osds: 68 up, 68 in
            flags sortbitwise,require_jewel_osds
      pgmap v6799002: 4096 pgs, 4 pools, 13266 GB data, 11091 kobjects
            126 TB used, 368 TB / 494 TB avail
                4084 active+clean
                  12 active+clean+scrubbing+deep
  client io 113 kB/s rd, 11486 B/s wr, 135 op/s rd, 7 op/s wr

ceph1 ~ $ vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa st
27  1      0 3112660 165544 36261692    0    0   472  1274    0    1 22 
1 76  1  0
25  0      0 3126176 165544 36246508    0    0   858 12692 12122 110478
97  2  1  0  0
22  0      0 3114284 165544 36258136    0    0     1  6118 9586 118625
97  2  1  0  0
11  0      0 3096508 165544 36276244    0    0     8  6762 10047 188618
89  3  8  0  0
18  0      0 2990452 165544 36384048    0    0  1209 21170 11179 179878
85  4 11  0  0

There is no apparent memory shortage, and none of the HDDs or SSDs show
consistently high utilization, slow service times, or any other form of
hardware saturation, other than user CPU utilization. Can CPU starvation
be responsible for "waiting for rw locks"?

Our main pool (the one with all the data) currently has 1024 PGs,
leaving us room to add more PGs if needed, but we're concerned if we do
so that we'd consume even more CPU.

We have moved to running Ceph + jemalloc instead of tcmalloc, and that
has helped with CPU utilization somewhat, but we still see occurences of
95-100% CPU with not terribly high Ceph workload.

Any suggestions of what else to look at? We have a peculiar use case
where we have many RBDs but only about 1-5% of them are active at the
same time, and we're constantly making and expiring RBD snapshots. Could
this lead to aberrant performance? For instance, is it normal to have
~40k snaps still in cached_removed_snaps?

[global]

cluster = XXXXXXXX
fsid = XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

keyring = /etc/ceph/ceph.keyring

auth_cluster_required = none
auth_service_required = none
auth_client_required = none

mon_host = cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX
mon_addr = XXX.XXX.XXX.XXX:XXXX,XXX.XXX.XXX.XXX:XXX,XXX.XXX.XXX.XXX:XXXX
mon_initial_members = cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX,cephmon1.XXXXXXXXXXXXXXXXXXXXXXX

cluster_network = 172.20.0.0/18
public_network = XXX.XXX.XXX.XXX/20

mon osd full ratio = .80
mon osd nearfull ratio = .60

rbd default format = 2
rbd default order = 25
rbd_default_features = 1

osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 1024
osd pool default pgp num = 1024

osd_recovery_op_priority = 1

osd_max_backfills = 1

osd_recovery_threads = 1

osd_recovery_max_active = 1

osd_recovery_max_single_start = 1

osd_scrub_thread_suicide_timeout = 300

osd scrub during recovery = false

osd scrub sleep = 60
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com