Re: Quincy recovery load

Chris Palmer <chris.palmer@xxxxxxxxx> · Mon, 11 Jul 2022 17:41:02 +0100

Correction - it is the Acting OSDs that are consuming CPU, not the UP ones

On 11/07/2022 16:17, Chris Palmer wrote:
I'm seeing a similar problem on a small cluster just upgraded from 
Pacific 16.2.9 to Quincy 17.2.1 (non-cephadm). The cluster was only 
very lightly loaded during and after the upgrade. The OSDs affected 
are all bluestore, HDD sharing NVMe DB/WAL, and all created on Pacific 
(I think).

The upgrade was uneventful except that after the OSDs on the first 
node were restarted, some rebalancing started. I don't know why, but I 
set norebalance for the remainder of the upgrade.

When the entire upgrade was complete I unset norebalance. A number of 
things then became apparent:

 * Startup appeared normal, with health returning to OK in the expected
   time. Apart from below, things seem normal.
 * A few OSDs are consuming around 95% CPU, almost all of which is User
   mode. System, iowait and interrupt are all minimal.
 * Rebalancing is occurring, but only at about 1 object/second.
 * Disk IO rates are minimal.
 * We originally had osd_compact_on_start set to true, but then
   disabled it, and restarted the CPU-bound OSDs. They all restarted
   ok, and then resumed their high CPU load. Rebooting an OSD node
   didn't change anything either.
 * There are currently 3 PGs that are remapping (all cephfs_data). The
   Acting OSDs of each of those PGs (5, 7 & 8) are the ones consuming 
the CPU.
 * Setting nobackfill causes the CPU consumption to drop to normal idle
   levels; unsetting it returns to the high levels.
 * We have not changed any other settings during the upgrade. All of
   the OSDs are using mclock high_client_ops profile.

On Pacific, the rebalancing was often up to 100 objects per second 
with fairly minimal CPU load.

This CPU load and backfill rate is not sustainable, and I have put the 
upgrade of our larger production cluster on hold.

Any ideas, please?!

Thanks, Chris

# ceph pg dump pgs_brief|grep remap
dumped pgs_brief
8.38     active+remapped+backfilling     [7,0,4]           7 
[7,0,3]               7
8.5f     active+remapped+backfilling     [8,3,2]           8 
[8,3,0]               8
8.63     active+remapped+backfilling     [0,6,5]           0 
[5,6,2]               5

# ceph -s
  cluster:
    id:     9208361c-5b68-41ed-8155-cc246a3fe538
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 46m)
    mgr: pve1(active, since 2h), standbys: pve2, pve3
    mds: 1/1 daemons up, 2 standby
    osd: 18 osds: 18 up (since 45m), 18 in (since 2w); 3 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   14 pools, 617 pgs
    objects: 2.49M objects, 5.9 TiB
    usage:   18 TiB used, 25 TiB / 43 TiB avail
    pgs:     16826/7484589 objects misplaced (0.225%)
             614 active+clean
             3   active+remapped+backfilling

  io:
    client:   341 B/s rd, 256 KiB/s wr, 0 op/s rd, 20 op/s wr
    recovery: 4.8 MiB/s, 1 objects/s

On 06/07/2022 16:19, Jimmy Spets wrote:
Do you mean load average as reported by `top` or `uptime`?
yes

That figure can be misleading on multi-core systems.  What CPU are you
using?
It's a 4c/4t low end CPU

/Jimmy

On Wed, Jul 6, 2022 at 4:52 PM Anthony D'Atri<anthony.datri@xxxxxxxxx>
wrote:

Do you mean load average as reported by `top` or `uptime`?

That figure can be misleading on multi-core systems.  What CPU are you
using?

For context, when I ran systems with 32C/64T and 24x SATA SSD, the load
average could easily hit 40-60 without anything being wrong.

What CPU  percentages in user, system, idle, iowait do you see?

On Jul 6, 2022, at 5:32 AM, Jimmy Spets<jimmy@xxxxxxxxx>  wrote:

Hi all

I have a 10 node cluster with fairly modest hardware (6 HDD, 1 shared
NVME for DB on each) on the nodes that I use for archival.
After upgrading to Quincy I noticed that load avg on my servers is 
very
high during recovery or rebalance.
Changing the OSD recovery priority does not work, I assume because of
the change to mClock.
Is the high load avg the expected behaviour?

Should I adjust some limits so that the scheduler does not 
overwhelm the
server?

/Jimmy

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx