Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I can now add another data point as well. We upgraded our production cluster from mimic to octopus with the procedure of

- set quick-fix-on-start=false in all ceph.conf files and the mon config store
- set nosnaptrim
- upgrade all daemons
- set require-osd-release=octopus
- host by host: set quick-fix-on-start=true in ceph.conf and restart OSDs
- unset nosnaptrim

On our production system the conversion went much faster compared with the test system. This process is very CPU intensive, yet converting 70 OSDs per host with 2x18 core Broadwell CPUs worked without problems. Load reached more than 200% but it all finished without crashes.

Upgrading the daemons and completing the conversion of all hosts took 3 very long days. After conversion in this way no problems with snaptrim. We also enabled ephemeral pinning on our FS with 8 active MDSes and see no change in single-user performance, but at least 2-3 times higher aggregated throughput (home for a 500 node HPC cluster).

We did have a severe hiccup though. Very small OSDs with a size of ca. 100G crash on octopus when OMAP reaches a certain size. I don't know yet what a safe minimum size is (ongoing thread "OSD crashes during upgrade mimic->octopus"). The 300G OSDs on our test cluster worked fine.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Tyler Stachecki <stachecki.tyler@xxxxxxxxx>
Sent: 27 September 2022 02:00
To: Marc
Cc: Frank Schilder; ceph-users
Subject: Re:  Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

Just a datapoint - we upgraded several large Mimic-born clusters straight to 15.2.12 with the quick fsck disabled in ceph.conf, then did require-osd-release, and finally did the omap conversion offline after the cluster was upgraded using the bluestore tool while the OSDs were down (all done in batches). Clusters are zippy as ever.

Maybe on a whim, try doing an offline fsck with the bluestore tool and see if it improves things?

To answer an earlier question, if you have no health statuses muted, a 'ceph health detail' should show you at least a subset of OSDs that have not gone through the omap conversion yet.

Cheers,
Tyler

On Mon, Sep 26, 2022, 5:13 PM Marc <Marc@xxxxxxxxxxxxxxxxx<mailto:Marc@xxxxxxxxxxxxxxxxx>> wrote:
Hi Frank,

Thank you very much for this! :)

>
> we just completed a third upgrade test. There are 2 ways to convert the
> OSDs:
>
> A) convert along with the upgrade (quick-fix-on-start=true)
> B) convert after setting require-osd-release=octopus (quick-fix-on-
> start=false until require-osd-release set to octopus, then restart to
> initiate conversion)
>
> There is a variation A' of A: follow A, then initiate manual compaction
> and restart all OSDs.
>
> Our experiments show that paths A and B do *not* yield the same result.
> Following path A leads to a severely performance degraded cluster. As of
> now, we cannot confirm that A' fixes this. It seems that the only way
> out is to zap and re-deploy all OSDs, basically what Boris is doing
> right now.
>
> We extended now our procedure to adding
>
>   bluestore_fsck_quick_fix_on_mount = false
>
> to every ceph.conf file and executing
>
>   ceph config set osd bluestore_fsck_quick_fix_on_mount false
>
> to catch any accidents. After daemon upgrade, we set
> bluestore_fsck_quick_fix_on_mount = true host by host in the ceph.conf
> and restart OSDs.
>
> This procedure works like a charm.
>
> I don't know what the difference between A and B is. It is possible that
> B executes an extra step that is missing in A. The performance
> degradation only shows up when snaptrim is active, but then it is very
> severe. I suspect that many users who complained about snaptrim in the
> past have at least 1 A-converted OSD in their cluster.
>
> If you have a cluster upgraded with B-converted OSDs, it works like a
> native octopus cluster. There is very little performance reduction
> compared with mimic. In exchange, I have the impression that it operates
> more stable.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux