Re: Snaptrim issue after nautilus to octopus upgrade

Özkan Göksu <ozkangksu@xxxxxxxxx> · Fri, 23 Aug 2024 19:11:50 +0300

I have 12+12 = 24 servers with 8 x 4TB SAS SSD on each node.
I will use the weekend and I will start compaction on 12 servers on
Saturday and 12 others on Sunday and when the compaction is complete I will
unset nosnaptrim and let the cluster clean the 2 weeks of snaps leftover.

Thank you for the advice, I will share the results when it's done.

Regards.

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>, 23 Ağu 2024 Cum, 18:48
tarihinde şunu yazdı:

> Hi Özkan,
>
> in our case, we tried online compaction first, and it helped to resolve
> the issue completely. I did first test with a single OSD daemon (i.e. only
> online compaction of that single OSD), and checked that the load of that
> daemon went down significantly
> (that was while snaptrims with high sleep value were still going on).
> Then, I went in batches of 10 % of the cluster's OSDs, and they finished
> rather fast (few minutes) so I could do it without a downtime, actually.
>
> In older threads on this list, snaptrim issues which seemed similar (but
> not clearly related to an upgrade) required more heavy operations (either
> offline compaction or OSD recreation).
> Since online compaction is comparatibely "cheap", I'd always try this
> first, in my case, each OSD took less than 2-3 minutes for this, but of
> course your mileage may vary.
>
> Cheers,
>         Oliver
>
> Am 23.08.24 um 17:42 schrieb Özkan Göksu:
> > Hello Oliver.
> >
> > Thank you so much for the answer!
> >
> > I was thinking of re-creating the OSD's but if you are sure the
> compaction is the solution here then it's worth to try.
> > I'm planning to shutdown all the VM's and when the cluster is safe then
> I will try OSD compaction.
> > May I learn did you do online compaction or offline?
> >
> > Because I have 2 side and I can shutdown 1 entire rack and do the
> offline compaction and do the same thing other side when its done.
> > What do you think?
> >
> > Regards.
> >
> >
> >
> >
> >
> > Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:
> freyermuth@xxxxxxxxxxxxxxxxxx>>, 23 Ağu 2024 Cum, 18:06 tarihinde şunu
> yazdı:
> >
> >     Hi Özkan,
> >
> >     FWIW, we observed something similar after upgrading from Mimic =>
> Nautilus => Octopus and starting to trim snapshots after.
> >
> >     The size of our cluster was a bit smaller, but the effect was the
> same: When snapshot trimming started, OSDs went into high load and RBD I/O
> was extremely slow.
> >
> >     We tried to use:
> >        ceph tell osd.* injectargs '--osd-snap-trim-sleep 10'
> >     first, which helped, but of course snapshots kept piling up.
> >
> >     Finally, we performed only RocksDB compactions via:
> >
> >        for A in {0..5}; do ceph tell osd.$A compact | sed 's/^/'$A': /'
> & done
> >
> >     for some batches of OSDs, and their load went down heavily. Finally,
> after we'd churned through all OSDs, I/O load was low again, and we could
> go back to the default:
> >        ceph tell osd.* injectargs '--osd-snap-trim-sleep 0'
> >
> >     After this, the situation has stabilized for us. So my guess would
> be that the RocksDBs grew too much after the OMAP format conversion and the
> compaction shrank them again.
> >
> >     Maybe that also helps in your case?
> >
> >     Interestingly, we did not observe this on other clusters (one mainly
> for CephFS, another one with mirrored RBD volumes), which took the same
> upgrade path.
> >
> >     Cheers,
> >              Oliver
> >
> >     Am 23.08.24 um 16:46 schrieb Özkan Göksu:
> >      > Hello folks.
> >      >
> >      > We have a ceph cluster and we have 2000+ RBD drives on 20 nodes.
> >      >
> >      > We upgraded the cluster from 14.2.16 to 15.2.14 and after the
> upgrade we
> >      > started to see snap trim issues.
> >      > Without the "nosnaptrim" flag, the system is not usable right now.
> >      >
> >      > I think the problem is because of the omap conversion at Octopus
> upgrade.
> >      >
> >      > Note that the first time each OSD starts, it will do a format
> conversion to
> >      > improve the accounting for “omap” data. This may take a few
> minutes to as
> >      > much as a few hours (for an HDD with lots of omap data). You can
> disable
> >      > this automatic conversion with:
> >      >
> >      > What should I do to solve this problem?
> >      >
> >      > Thanks.
> >      > _______________________________________________
> >      > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:
> ceph-users@xxxxxxx>
> >      > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:
> ceph-users-leave@xxxxxxx>
> >
> >     --
> >     Oliver Freyermuth
> >     Universität Bonn
> >     Physikalisches Institut, Raum 1.047
> >     Nußallee 12
> >     53115 Bonn
> >     --
> >     Tel.: +49 228 73 2367
> >     Fax:  +49 228 73 7869
> >     --
> >
>
> --
> Oliver Freyermuth
> Universität Bonn
> Physikalisches Institut, Raum 1.047
> Nußallee 12
> 53115 Bonn
> --
> Tel.: +49 228 73 2367
> Fax:  +49 228 73 7869
> --
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx