Re: Very slow backfilling

Curt <lightspd@xxxxxxxxx> · Thu, 2 Mar 2023 18:02:34 +0400

Forgot to do a reply all.

What does

ceph osd df
ceph osd dump | grep pool return?

Are you using auto scaling? 289pg with 272tb of data and 60 osds, that
seems like 3-4 pg per osd at almost 1TB each. Unless I'm thinking of this
wrong.

On Thu, Mar 2, 2023, 17:37 Joffrey <joff.au@xxxxxxxxx> wrote:

> My Ceph Version is 17.2.5 and all configuration about osd_scrub* are
> defaults. I tried some updates on osd-max-backfills but no change.
> I have many HDD with NVME for db and all are connected in a 25G network.
>
> Yes, it's the same PG since 4 days.
>
> I got a failure on a HDD and get many days of recovery+backfilling last  2
> weeks.   Perhaps the 'not in time' warning is related to this.
>
> 'Jof
>
> Le jeu. 2 mars 2023 à 14:25, Anthony D'Atri <aad@xxxxxxxxxxxxxx> a écrit :
>
> > Run `ceph health detail`.
> >
> > Is it the same PG backfilling for a long time, or a different one over
> > time?
> >
> > That it’s remapped makes me think that what you’re seeing is the balancer
> > doing its job.
> >
> > As far as the scrubbing, do you limit the times when scrubbing can
> happen?
> > Are these HDDs? EC?
> >
> > > On Mar 2, 2023, at 07:20, Joffrey <joff.au@xxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > I have many 'not {deep-}scrubbed in time' and a1 PG
> remapped+backfilling
> > > and I don't understand why this backfilling is taking so long.
> > >
> > > root@hbgt-ceph1-mon3:/# ceph -s
> > >  cluster:
> > >    id:     c300532c-51fa-11ec-9a41-0050569c3b55
> > >    health: HEALTH_WARN
> > >            15 pgs not deep-scrubbed in time
> > >            13 pgs not scrubbed in time
> > >
> > >  services:
> > >    mon: 3 daemons, quorum
> hbgt-ceph1-mon1,hbgt-ceph1-mon2,hbgt-ceph1-mon3
> > > (age 36h)
> > >    mgr: hbgt-ceph1-mon2.nteihj(active, since 2d), standbys:
> > > hbgt-ceph1-mon1.thrnnu, hbgt-ceph1-mon3.gmfzqm
> > >    osd: 60 osds: 60 up (since 13h), 60 in (since 13h); 1 remapped pgs
> > >    rgw: 3 daemons active (3 hosts, 2 zones)
> > >
> > >  data:
> > >    pools:   13 pools, 289 pgs
> > >    objects: 67.74M objects, 127 TiB
> > >    usage:   272 TiB used, 769 TiB / 1.0 PiB avail
> > >    pgs:     288 active+clean
> > >             1   active+remapped+backfilling
> > >
> > >  io:
> > >    client:   3.3 KiB/s rd, 1.5 MiB/s wr, 3 op/s rd, 8 op/s wr
> > >    recovery: 790 KiB/s, 0 objects/s
> > >
> > >
> > > What can I do to understand this slow recovery (is it the backfill
> > action ?)
> > >
> > > Thanks you
> > >
> > > 'Jof
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx