Re: Quincy recovery load

Jimmy Spets <jimmy@xxxxxxxxx> · Wed, 6 Jul 2022 16:56:33 +0200

Thanks for your reply.

What I meant with high load was load as seen by the top command, all the
servers have load average over 10.

I added one more noode to add more space.

This is what I get from ceph status:

  cluster:
    id:     <redacted>
    health: HEALTH_WARN
            2 failed cephadm daemon(s)
            48 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't
resolve itself): 24 pgs backfill_toofull
            4 pool(s) nearfull

  services:
    mon: 5 daemons, quorum ceph03,ceph02,ceph05,ceph01,ceph04 (age 4h)
    mgr: ceph03.xmbwxh(active, since 2d), standbys: ceph01.ecfgwz,
ceph10.rcvwmp
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 61 osds: 61 up (since 4h), 61 in (since 4h); 1264 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   15 pools, 4465 pgs
    objects: 26.53M objects, 91 TiB
    usage:   284 TiB used, 75 TiB / 359 TiB avail
    pgs:     8613187/79613362 objects misplaced (10.819%)
             3201 active+clean
             1240 active+remapped+backfilling
             22   active+remapped+backfill_toofull
             2    active+remapped+backfill_wait+backfill_toofull

  io:
    client:   624 MiB/s rd, 1.6 KiB/s wr, 263 op/s rd, 17 op/s wr
    recovery: 164 MiB/s, 45 objects/s

The performance balances as I expected giving priority to client traffic.
I get a lot of health warnings about osd_slow_ping_time_back,
osd_slow_ping_time_front and slow_ops.
I noticed that there are 1240 pgs backfilling in parallel. Is that as
expected?

/Jimmy

On Wed, Jul 6, 2022 at 3:28 PM Sridhar Seshasayee <sseshasa@xxxxxxxxxx>
wrote:

> Hi Jimmy,
>
> As you rightly pointed out, the OSD recovery priority does not work
> because of the
> change to mClock. By default, the "high_client_ops" profile is enabled and
> this
> optimizes client ops when compared to recovery ops. Recovery ops will take
> the
> longest time to complete with this profile and this is expected.
>
> When you say "load avg on my servers is high", I am assuming it's the
> recovery load.
> If you want recovery ops to complete faster, then you can first try
> changing the mClock
> profile to the "balanced" profile on all OSDs and see if it improves the
> situation. The
> "high_recovery_ops" profile would be the next option as it will provide
> the best recovery
> performance. But with both the "balanced" and the "high_recovery_ops"
> profiles,
> improved recovery performance will be at the expense of client ops which
> will
> experience slightly higher latencies.
>
> For more details on the mClock profiles, see mClock Config Reference:
> https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
>
> To switch Profiles, see:
>
> https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#steps-to-enable-mclock-profile
>
> The recommendation would be to change the profile on all OSDs to get the
> best performance for the operation you are interested in.
>
> -Sridhar
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx