Re: Ceph recovery network speed

Frank Schilder <frans@xxxxxx> · Mon, 27 Jun 2022 16:52:04 +0000

I think this is just how ceph is. Maybe you should post the output of "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an idea whether what you look at is expected or not. As I wrote before, object recovery is throttled and the recovery bandwidth depends heavily on object size. The interesting question is, how many objects per second are recovered/rebalanced.

Maybe provide the output of the first two commands for osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a bit after setting these and then collect the output). Include the applied values for osd_max_backfills* and osd_recovery_max_active* for one of the OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e osd_recovery_max_active).

I don't really know if on such a small cluster one can expect more than what you see. It has nothing to do with network speed if you have a 10G line. However, recovery is something completely different from a full link-speed copy.

I can tell you that boatloads of tiny objects are a huge pain for recovery, even on SSD. Ceph doesn't raid up sections of disks against each other, but object for object. This might be a feature request: that PG space allocation and recovery should follow the model of LVM extends (ideally match with LVM extends) to allow recovery/rebalancing larger chunks of storage in one go, containing parts of a large or many small objects.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Curt <lightspd@xxxxxxxxx>
Sent: 27 June 2022 17:35:19
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Ceph recovery network speed

Hello,

I had already increased/changed those variables previously.  I increased the pg_num to 128. Which increased the number of PG's backfilling, but speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the last several hours.  Should I increase it higher than 128?

I'm still trying to figure out if this is just how ceph is or if there is a bottleneck somewhere.  Like if I sftp a 10G file between servers it's done in a couple min or less.  Am I thinking of this wrong?

Thanks,
Curt

On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi Curt,

as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per host busy. My experience is, that the algorithm for selecting PGs to backfill/recover is not very smart. It could simply be that it doesn't find more PGs without violating some of these settings:

osd_max_backfills
osd_recovery_max_active

I have never observed the second parameter to change anything (try any ways). However, the first one has a large impact. You could try increasing this slowly until recovery moves faster. Another parameter you might want to try is

osd_recovery_sleep_[hdd|ssd]

Be careful as this will impact client IO. I could reduce the sleep for my HDDs to 0.05. With your workload pattern, this might be something you can tune as well.

Having said that, I think you should increase your PG count on the EC pool as soon as the cluster is healthy. You have only about 20 PGs per OSD and large PGs will take unnecessarily long to recover. A higher PG count will also make it easier for the scheduler to find PGs for recovery/backfill. Aim for a number between 100 and 200. Give the pool(s) with most data (#objects) the most PGs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
Sent: 24 June 2022 19:04
To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject:  Re: Ceph recovery network speed

2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB enterprise
HD's.

Take this log entry below, 72 minutes and still backfilling undersized?
Should it be that slow?

pg 12.15 is stuck undersized for 72m, current state
active+undersized+degraded+remapped+backfilling, last acting [34,10,29,NONE]

Thanks,
Curt

On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>
wrote:

> Your recovery is slow *because* there are only 2 PGs backfilling.
>
> What kind of OSD media are you using?
>
> > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>> wrote:
> >
> > Hello,
> >
> > I'm trying to understand why my recovery is so slow with only 2 pg
> > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G network.  I
> > have tested the speed between machines with a few tools and all confirm
> 10G
> > speed.  I've tried changing various settings of priority and recovery
> sleep
> > hdd, but still the same. Is this a configuration issue or something else?
> >
> > It's just a small cluster right now with 4 hosts, 11 osd's per.  Please
> let
> > me know if you need more information.
> >
> > Thanks,
> > Curt
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx