Re: Snaptrim issue after nautilus to octopus upgrade

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Aug 2024 21:12:05 +0200

Hi Özkan,

to my understanding, the upmap balancer should not affect this problem, but since I'm currently also living on Octopus with three clusters, it would of course be nice if someone with a better understanding of the bug confirms ;-).

You can usually jump either one or two major releases in one go (the upgrade instructions provide more details). Personally, I have always gone in steps of a single release, since in most cases, also OS changes were needed.

FWIW, we are running three Octopus clusters with RockyLinux 8 (one cluster as "Backup system" with focus on RGW and mirrored RBD volumes, one with RBD as virtualization backend, and one huge cluster with CephFS of which we also cannot take backups).
Since our RGW workload is quite simple (Backups via Restic, Duplicati and similar), I can not contribute much experience about "stability across upgrades" (worked fine for us across several releases, going from CentOS 7 with Minic in the past to RockyLinux 8 with Octopus for now), but maybe someone else on this list will chime in.

Cheers,
	Oliver

Am 26.08.24 um 18:48 schrieb Özkan Göksu:
I also don't use the pg autoscaler. I have calculated PG size from the beginning because this feature was not exist back then :)
I wonder if it can be affected by the upmap balancer ?

I started upgrade to achieve Quincy release and I'm upgrading from "Nautilus > Octopus > Pacific > Quincy"
I wonder can I upgrade directly from Nautilus to Quincy or Octopus to Quincy ?

On this cluster I only have RBD pool but on different clusters I have RGW s3 on erasure coded pool with mixed Cephfs pools.
I'm really terrified on this cluster because of the RGW updates and features. Too much changed at this side as I know and I'm not sure how smooth upgrade can be.
The cluster size is 4 petabyte and taking backup is not an option :)
The worst part is I have to change OS and Kernel from arch-linux to custom made ubuntu to be able to apply this change.

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>, 26 Ağu 2024 Pzt, 18:08 tarihinde şunu yazdı:

    Hello Özkan,

    that's great to hear!

    I think the advice from Boris is also critical to keep in mind, we did not encounter the bug he mnentioned bug yet, but to my understanding, the pglog_dup issue should only be triggered if PGs are split or merged (e.g. if the autoscaler is used), which probably explains why we have not seen it yet.
    The issue Boris linked also shows a way to test whether you are affected.

    So it's good to hear that in your case, the issue was more "harmless" and similar to what we observed :-).

    Cheers,
             Oliver

    Am 26.08.24 um 14:39 schrieb Özkan Göksu:
     > Hello Oliver.
     >
     > I confirm your solution works.
     > Compaction takes 2min for each SSD and I spent 8 hours for the whole cluster.
     > While compaction is running I was have nosnaptrim flag.
     > When the compaction completed I set "ceph tell osd.* injectargs '--osd-snap-trim-sleep 10'" and unset nosnaptrim.
     > Snap trim took 1 day to clear 2 weeks of snaps and while snaps are trimming thanks to  '--osd-snap-trim-sleep 10' I didn't see any slow down.
     >
     > Thank you for the advice.
     >
     >
     > Boris <bb@xxxxxxxxx <mailto:bb@xxxxxxxxx> <mailto:bb@xxxxxxxxx <mailto:bb@xxxxxxxxx>>>, 23 Ağu 2024 Cum, 19:24 tarihinde şunu yazdı:
     >
     >     I tried it with the offline compactation, and it didn't help a bit.
     >
     >     It took ages per OSD and starting the OSD afterwards wasn't fast either.
     >
     >
     >
     >      > Am 23.08.2024 um 18:16 schrieb Özkan Göksu <ozkangksu@xxxxxxxxx <mailto:ozkangksu@xxxxxxxxx> <mailto:ozkangksu@xxxxxxxxx <mailto:ozkangksu@xxxxxxxxx>>>:
     >      >
     >      > I have 12+12 = 24 servers with 8 x 4TB SAS SSD on each node.
     >      > I will use the weekend and I will start compaction on 12 servers on
     >      > Saturday and 12 others on Sunday and when the compaction is complete I will
     >      > unset nosnaptrim and let the cluster clean the 2 weeks of snaps leftover.
     >      >
     >      > Thank you for the advice, I will share the results when it's done.
     >      >
     >      > Regards.
     >      >
     >      > Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>>, 23 Ağu 2024 Cum, 18:48
     >      > tarihinde şunu yazdı:
     >      >
     >      >> Hi Özkan,
     >      >>
     >      >> in our case, we tried online compaction first, and it helped to resolve
     >      >> the issue completely. I did first test with a single OSD daemon (i.e. only
     >      >> online compaction of that single OSD), and checked that the load of that
     >      >> daemon went down significantly
     >      >> (that was while snaptrims with high sleep value were still going on).
     >      >> Then, I went in batches of 10 % of the cluster's OSDs, and they finished
     >      >> rather fast (few minutes) so I could do it without a downtime, actually.
     >      >>
     >      >> In older threads on this list, snaptrim issues which seemed similar (but
     >      >> not clearly related to an upgrade) required more heavy operations (either
     >      >> offline compaction or OSD recreation).
     >      >> Since online compaction is comparatibely "cheap", I'd always try this
     >      >> first, in my case, each OSD took less than 2-3 minutes for this, but of
     >      >> course your mileage may vary.
     >      >>
     >      >> Cheers,
     >      >>        Oliver
     >      >>
     >      >>> Am 23.08.24 um 17:42 schrieb Özkan Göksu:
     >      >>> Hello Oliver.
     >      >>>
     >      >>> Thank you so much for the answer!
     >      >>>
     >      >>> I was thinking of re-creating the OSD's but if you are sure the
     >      >> compaction is the solution here then it's worth to try.
     >      >>> I'm planning to shutdown all the VM's and when the cluster is safe then
     >      >> I will try OSD compaction.
     >      >>> May I learn did you do online compaction or offline?
     >      >>>
     >      >>> Because I have 2 side and I can shutdown 1 entire rack and do the
     >      >> offline compaction and do the same thing other side when its done.
     >      >>> What do you think?
     >      >>>
     >      >>> Regards.
     >      >>>
     >      >>>
     >      >>>
     >      >>>
     >      >>>
     >      >>> Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> <mailto:
     >      >> freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>>>, 23 Ağu 2024 Cum, 18:06 tarihinde şunu
     >      >> yazdı:
     >      >>>
     >      >>>    Hi Özkan,
     >      >>>
     >      >>>    FWIW, we observed something similar after upgrading from Mimic =>
     >      >> Nautilus => Octopus and starting to trim snapshots after.
     >      >>>
     >      >>>    The size of our cluster was a bit smaller, but the effect was the
     >      >> same: When snapshot trimming started, OSDs went into high load and RBD I/O
     >      >> was extremely slow.
     >      >>>
     >      >>>    We tried to use:
     >      >>>       ceph tell osd.* injectargs '--osd-snap-trim-sleep 10'
     >      >>>    first, which helped, but of course snapshots kept piling up.
     >      >>>
     >      >>>    Finally, we performed only RocksDB compactions via:
     >      >>>
     >      >>>       for A in {0..5}; do ceph tell osd.$A compact | sed 's/^/'$A': /'
     >      >> & done
     >      >>>
     >      >>>    for some batches of OSDs, and their load went down heavily. Finally,
     >      >> after we'd churned through all OSDs, I/O load was low again, and we could
     >      >> go back to the default:
     >      >>>       ceph tell osd.* injectargs '--osd-snap-trim-sleep 0'
     >      >>>
     >      >>>    After this, the situation has stabilized for us. So my guess would
     >      >> be that the RocksDBs grew too much after the OMAP format conversion and the
     >      >> compaction shrank them again.
     >      >>>
     >      >>>    Maybe that also helps in your case?
     >      >>>
     >      >>>    Interestingly, we did not observe this on other clusters (one mainly
     >      >> for CephFS, another one with mirrored RBD volumes), which took the same
     >      >> upgrade path.
     >      >>>
     >      >>>    Cheers,
     >      >>>             Oliver
     >      >>>
     >      >>>    Am 23.08.24 um 16:46 schrieb Özkan Göksu:
     >      >>>> Hello folks.
     >      >>>>
     >      >>>> We have a ceph cluster and we have 2000+ RBD drives on 20 nodes.
     >      >>>>
     >      >>>> We upgraded the cluster from 14.2.16 to 15.2.14 and after the
     >      >> upgrade we
     >      >>>> started to see snap trim issues.
     >      >>>> Without the "nosnaptrim" flag, the system is not usable right now.
     >      >>>>
     >      >>>> I think the problem is because of the omap conversion at Octopus
     >      >> upgrade.
     >      >>>>
     >      >>>> Note that the first time each OSD starts, it will do a format
     >      >> conversion to
     >      >>>> improve the accounting for “omap” data. This may take a few
     >      >> minutes to as
     >      >>>> much as a few hours (for an HDD with lots of omap data). You can
     >      >> disable
     >      >>>> this automatic conversion with:
     >      >>>>
     >      >>>> What should I do to solve this problem?
     >      >>>>
     >      >>>> Thanks.
     >      >>>> _______________________________________________
     >      >>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> <mailto:
     >      >> ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>>
     >      >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>> <mailto:
     >      >> ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>>
     >      >>>
     >      >>>    --
     >      >>>    Oliver Freyermuth
     >      >>>    Universität Bonn
     >      >>>    Physikalisches Institut, Raum 1.047
     >      >>>    Nußallee 12
     >      >>>    53115 Bonn
     >      >>>    --
     >      >>>    Tel.: +49 228 73 2367
     >      >>>    Fax:  +49 228 73 7869
     >      >>>    --
     >      >>>
     >      >>
     >      >> --
     >      >> Oliver Freyermuth
     >      >> Universität Bonn
     >      >> Physikalisches Institut, Raum 1.047
     >      >> Nußallee 12
     >      >> 53115 Bonn
     >      >> --
     >      >> Tel.: +49 228 73 2367
     >      >> Fax:  +49 228 73 7869
     >      >> --
     >      >>
     >      >>
     >      > _______________________________________________
     >      > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
     >      > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>>
     >

    -- 
    Oliver Freyermuth
    Universität Bonn
    Physikalisches Institut, Raum 1.047
    Nußallee 12
    53115 Bonn
    --
    Tel.: +49 228 73 2367
    Fax:  +49 228 73 7869
    --

--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment:
smime.p7s

Description: Kryptografische S/MIME-Signatur
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx