Re: The snaptrim queue of PGs has not decreased for several days.

Eugen Block <eblock@xxxxxx> · Thu, 05 Sep 2024 06:18:04 +0000

Another update: Giovanna agreed to switch back to mclock_scheduler and  
adjust osd_snap_trim_cost to 400K. It looks very promising, after a  
few hours the snaptrim queue was processed.

@Sridhar: thanks a lot for your valuable input!

Zitat von Eugen Block <eblock@xxxxxx>:

Quick update: we decided to switch to wpq to see if that would  
confirm our suspicion, and it did. After a few hours all PGs in the  
snaptrim queue had been processed. We haven't looked into the  
average object sizes yet, maybe we'll try that approach next week or  
so. If you have any other ideas, let us know.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

as expected the issue is not resolved and turned up again a couple  
of hours later. Here's the tracker issue:

https://tracker.ceph.com/issues/67702

I also attached a log snippet from one osd with debug_osd 10 to the  
tracker. Let me know if you need anything else, I'll stay in touch  
with Giovanna.

Thanks!
Eugen

Zitat von Sridhar Seshasayee <sseshasa@xxxxxxxxxx>:

Hi Eugen,

On Fri, Aug 23, 2024 at 1:37 PM Eugen Block <eblock@xxxxxx> wrote:

Hi again,

I have a couple of questions about this.
What exactly happened to the PGs? They were queued for snaptrimming,
but we didn't see any progress. Let's assume the average object size
in that pool was around 2 MB (I don't have the actual numbers). Does
that mean if osd_snap_trim_cost (1M default) was too low, those too
large objects weren't trimmed? And then we split the PGs, reducing the
average object size to 1 MB, these objects could be trimmed then,
obviously. Does this explanation make sense?

If you have the OSD logs, I can take a look and see why the snaptrim ops
did not make progress. The cost is one contributing factor on the position
of the op in the queue. Therefore, even though the cost incorrectly
represents the actual average size of the objects in the PG, the op should
be scheduled based on the set cost and the profile allocations.

The OSDs appear to be NVMe based is what I understand from the
thread. Based on the actions taken to resolve the situation (increased
pg_num to 64), I think something else was up on the cluster. For NVMe
based cluster, the current cost shouldn't cause stalling of the snaptrim
ops. I'd suggest raising an upstream tracker with your observation and
OSD logs to investigate this further.

I just browsed through the changes, if I understand the fix correctly,
the average object size is now calculated automatically, right? Which
makes a lot of sense to me, as an operator I don't want to care too
much about the average object sizes since ceph should know them better
than me. ;-)

Yes, that's correct. This fix was part of the effort to incrementally
include
background OSD operations to be scheduled by mClock.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx