Re: Snaptriming speed degrade with pg increase

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Fri, 29 Nov 2024 12:11:34 +0000

The reshard topic is running on quincy 17.2.7, but tested today the reshard, objects gone.

Istvan
________________________________
From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
Sent: Friday, November 29, 2024 5:17:27 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxx>
Subject: Re:  Snaptriming speed degrade with pg increase

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________
----- Le 29 Nov 24, à 11:11, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit :
Increased from 9 servers to 11 so let's say 20% capacity and performance added.

This is a different cluster purely rbd.
I see, so big objects. You might want to increase osd_max_trimming_pgs and eventually osd_pg_max_concurrent_snap_trims and see how it goes.

(For the other topic can't be resharded because in multisite it will disappear all the data disappear on remote site, need to create new bucket and migrate data first to a higher sharded bucket).
Hum... You have fallen significantly behind on Ceph versions, which must be hindering you in many operational tasks today. Another option would be to catch up and reshard into a recent version in multi-site mode.

Frédéric.

Istvan
________________________________
From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
Sent: Friday, November 29, 2024 4:58:52 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxx>
Subject: Re:  Snaptriming speed degrade with pg increase

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi Istvan,

Did the PG split involved using more OSDs than before? If so then increasing these values (apart from the sleep) should not have a negative impact on clients I/O compared to before the split and should accelerate the whole process.

Did you reshard the buckets as discussed in the other thread?

Regards,
Frédéric.

----- Le 29 Nov 24, à 3:30, Istvan Szabo, Agoda Istvan.Szabo@xxxxxxxxx a écrit :

> Hi,
>
> When we scale the placement group on a pool located in a full nvme cluster, the
> snaptriming speed degrades a lot.
> Currently we are running with these values to not degrade client op and have
> some progress on snaptrimmin, but it is terrible. (octopus 15.2.17 on ubuntu
> 20.04)
>
> -osd_max_trimming_pgs=2
> --osd_snap_trim_sleep=0.1
> --osd_pg_max_concurrent_snap_trims=2
>
> We had a big pool which we used to have 128PG and that length of the
> snaptrimming took around 45-60 minutes.
> Due to impossible to do maintenance on the cluster with 600GB pg sizes because
> it can easily max out a cluster (which we did), we increased to 1024 and the
> snaptrimming duration increased to 3.5 hours.
>
> Is there any good solution that we are missing to fix this?
>
> On the hardware level I've changed server profile to tune some numa settings but
> seems like didn't help still.
>
> Thank you
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx