Hello, I had already increased/changed those variables previously. I increased the pg_num to 128. Which increased the number of PG's backfilling, but speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the last several hours. Should I increase it higher than 128? I'm still trying to figure out if this is just how ceph is or if there is a bottleneck somewhere. Like if I sftp a 10G file between servers it's done in a couple min or less. Am I thinking of this wrong? Thanks, Curt On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx> wrote: > Hi Curt, > > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per > host busy. My experience is, that the algorithm for selecting PGs to > backfill/recover is not very smart. It could simply be that it doesn't find > more PGs without violating some of these settings: > > osd_max_backfills > osd_recovery_max_active > > I have never observed the second parameter to change anything (try any > ways). However, the first one has a large impact. You could try increasing > this slowly until recovery moves faster. Another parameter you might want > to try is > > osd_recovery_sleep_[hdd|ssd] > > Be careful as this will impact client IO. I could reduce the sleep for my > HDDs to 0.05. With your workload pattern, this might be something you can > tune as well. > > Having said that, I think you should increase your PG count on the EC pool > as soon as the cluster is healthy. You have only about 20 PGs per OSD and > large PGs will take unnecessarily long to recover. A higher PG count will > also make it easier for the scheduler to find PGs for recovery/backfill. > Aim for a number between 100 and 200. Give the pool(s) with most data > (#objects) the most PGs. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Curt <lightspd@xxxxxxxxx> > Sent: 24 June 2022 19:04 > To: Anthony D'Atri; ceph-users@xxxxxxx > Subject: Re: Ceph recovery network speed > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB enterprise > HD's. > > Take this log entry below, 72 minutes and still backfilling undersized? > Should it be that slow? > > pg 12.15 is stuck undersized for 72m, current state > active+undersized+degraded+remapped+backfilling, last acting > [34,10,29,NONE] > > Thanks, > Curt > > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx> > wrote: > > > Your recovery is slow *because* there are only 2 PGs backfilling. > > > > What kind of OSD media are you using? > > > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx> wrote: > > > > > > Hello, > > > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G network. I > > > have tested the speed between machines with a few tools and all confirm > > 10G > > > speed. I've tried changing various settings of priority and recovery > > sleep > > > hdd, but still the same. Is this a configuration issue or something > else? > > > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. Please > > let > > > me know if you need more information. > > > > > > Thanks, > > > Curt > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx