pubsub RGW and OSD processes suddenly start using much more CPU

david.piper@xxxxxxxxxxxxxx · Wed, 19 Aug 2020 15:48:56 -0000

Hi all,

I've got a 3-node cluster in a lab environment running on ceph version 14.2.9 (containerized). Each node is running a OSD, MON, MGR, MDS and 2 x RGW (we're using a second RGW instance to host a pubsub endpoint). 

I've been monitoring my nodes with the ceph dashboard, and noticed that all of a sudden at random the CPU profile of my hosts completely changes from [iowait: 20%, user: 10%, system: 5%, others: <5%] to [user: 40%, system: 20%, iowait: 5%, others: <5%]. This isn't a spike; the CPU profile is still like this after several days. All hosts seem to be affected at the same moment. Running `top` on the hosts shows it's the OSD and one of the RGW processes (seems to be the pubsub process) that are using all the CPU:

$ top
top - 14:57:23 up 11 days, 22:26,  1 user,  load average: 1.65, 1.81, 2.04
Tasks: 140 total,   2 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s): 40.9 us, 21.4 sy,  0.0 ni, 27.9 id,  5.9 wa,  0.0 hi,  4.0 si,  0.0 st
KiB Mem : 12304036 total,   183596 free, 10848600 used,  1271840 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  1184676 avail Mem
Change delay from 3.0 to
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6541 167       20   0 5505324 205564   1380 S  67.8  1.7   4651:41 radosgw
 3826 167       20   0 9563536   7.9g   5496 S  64.8 67.6   4609:05 ceph-osd
 6621 167       20   0 5571764 276804      0 S   6.6  2.2   4985:30 radosgw
 5524 167       20   0 1365368 322240   8516 S   3.7  2.6 171:37.23 ceph-mgr
 1413 root      20   0  558728  28372  24968 S   0.3  0.2  18:56.05 rsyslogd
....

If I restart the RGW instances (I think it's the pubsub instance specifically), CPU usage returns back to a normal level.

I don't see any unusual logs (from the either RGW process or the OSD or MON processes) to indicate that anything has materially changed at the point the CPU elevates.

Has anyone experienced anything like this? Beyond turning up RGW logging and trying to make sense of the output, I'm not sure where next to dig.

Cheers,

Dave
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx