Hi Eugen, On Fri, Aug 23, 2024 at 1:37 PM Eugen Block <eblock@xxxxxx> wrote: > Hi again, > > I have a couple of questions about this. > What exactly happened to the PGs? They were queued for snaptrimming, > but we didn't see any progress. Let's assume the average object size > in that pool was around 2 MB (I don't have the actual numbers). Does > that mean if osd_snap_trim_cost (1M default) was too low, those too > large objects weren't trimmed? And then we split the PGs, reducing the > average object size to 1 MB, these objects could be trimmed then, > obviously. Does this explanation make sense? > If you have the OSD logs, I can take a look and see why the snaptrim ops did not make progress. The cost is one contributing factor on the position of the op in the queue. Therefore, even though the cost incorrectly represents the actual average size of the objects in the PG, the op should be scheduled based on the set cost and the profile allocations. The OSDs appear to be NVMe based is what I understand from the thread. Based on the actions taken to resolve the situation (increased pg_num to 64), I think something else was up on the cluster. For NVMe based cluster, the current cost shouldn't cause stalling of the snaptrim ops. I'd suggest raising an upstream tracker with your observation and OSD logs to investigate this further. > > I just browsed through the changes, if I understand the fix correctly, > the average object size is now calculated automatically, right? Which > makes a lot of sense to me, as an operator I don't want to care too > much about the average object sizes since ceph should know them better > than me. ;-) > Yes, that's correct. This fix was part of the effort to incrementally include background OSD operations to be scheduled by mClock. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx