Re: snaptrim number of objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/21/23 17:38, Angelo Höngens wrote:

On 21/08/2023 16:47, Manuel Lausch wrote:
Hello,

on my testcluster I played a bit with ceph quincy (17.2.6).
I also see slow ops while deleting snapshots. With the previous major
(pacific) this wasn't a issue.
In my case this is related to the new mclock scheduler which is
defaulted with quincy. With "ceph config set global osd_op_queue wpq".
Thie issue is gone.(after restarting the OSDs of course). wpq was the
previous default scheduler.

Maybe this will help you.

On the other hand, mclock shouldn't break down the cluster in this way.
At least not with "high_client_ops" which I used. Maybe someone should
have a look at this.


Manuel
Hey Manuel,

You made me a happy man (for now!)

In short: wpq indeed seems to do waaaaay better in my setup.

We did a lot of tuning with the mclock scheduler, tuned
osd_mclock_max_capacity_iops_hdd, tried a lot of different settings
for osd_snap_trim_sleep_hdd/ssd, etc, but it did not yet have the
desired effect. The only thing that prevented my cluster from going
down, was setting osd_max_trimming_pgs to 0 on all disks, and set it
to 1 or 2 for a few OSD's at a time. As soon as I enable too many
OSD's, everything would bog down, slow ops everywhere, hanging cephfs
clients, etc. I think I could do a max of 100 objects/sec
snaptrimming.

I also played around with the different mclock profiles to speed up
recovery. I think with high_client_ops, we got 400-600MB/s client io,
and about 50MB/s recovery io (we had a few degraded objects and some
rebalancing). With the high_recovery_ops profile I think I was able to
get around 400-500MB/s client write, and 300MB/s recovery. As soon as
I enabled snaptrimming, stuff would get quite a bit slower.

At your suggestion I just changed the osd_op_queue to wpq, removed
almost all other osd config variables and restarted all osd's.

Now I see 400-600MB/s client i/o (normal) AND I see recovery at
1500MB/s(!) AND it's also snaptrimmimng at 250 objects/sec. And I
haven't seen the first slow op warning yet!

I'm still cautious, but for now, this looks very positive!

This also leads me to agree with you there's 'something wrong' with
the mclock scheduler. I was almost starting to suspect hardware issues
or something like that, I was at my wit's end.


Angelo.


If you have the inclination, I would also be very curious if enabling this helps:

"rocksdb_cf_compact_on_deletion"


This is a new feature we added in reef and backported to quincy/pacific in a disabled state that issues compaction if too many tombstones are encountered during iteration in RocksDB.  You can control how quickly to issue compactions using:


"rocksdb_cf_compact_on_deletion_trigger" (default is 16384, shrink to increase compaction frequency)

"rocksdb_cf_compact_on_deletion_sliding_window" (default is 32768, grow to increase compaction frequency)


The combination of these two parameters dictates how many X tombstones you must encounter over Y keys before triggering a compaction.  The default is pretty conservative so you may need to play with it if you are hitting too many tombstones.  If compactions are trigger too frequently you can increase the number of X allowed tombstones per Y keys.

Mark



On Mon, Aug 21, 2023 at 4:49 PM Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:
Hello,

on my testcluster I played a bit with ceph quincy (17.2.6).
I also see slow ops while deleting snapshots. With the previous major
(pacific) this wasn't a issue.
In my case this is related to the new mclock scheduler which is
defaulted with quincy. With "ceph config set global osd_op_queue wpq".
Thie issue is gone.(after restarting the OSDs of course). wpq was the
previous default scheduler.

Maybe this will help you.

On the other hand, mclock shouldn't break down the cluster in this way.
At least not with "high_client_ops" which I used. Maybe someone should
have a look at this.


Manuel



On Fri, 4 Aug 2023 17:40:42 -0400
Angelo Höngens <angelo@xxxxxxxxxx> wrote:

Hey guys,

I'm trying to figure out what's happening to my backup cluster that
often grinds to a halt when cephfs automatically removes snapshots.
Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
CephFS stops doing client i/o.

I'm graphing the cumulative value of the snaptrimq_len value, and that
slowly decreases over time. One night it takes an hour, but other
days, like today, my cluster has been down for almost 20 hours, and I
think we're half way. Funny thing is that in both cases, the
snaptrimq_len value initially goes to the same value, around 3000, and
then slowly decreases, but my guess is that the number of objects that
need to be trimmed varies hugely every day.

Is there a way to show the size of cephfs snapshots, or get the number
of objects or bytes that need snaptrimming? Perhaps I can graph that
and see where the differences are.

That won't explain why my cluster bogs down, but at least it gives
some visibility. Running 17.2.6 everywhere by the way.

Angelo.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux