Upgrading Ceph 16.2 using rook

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Mon, 25 Apr 2022 08:44:53 +0200

Good morning everyone,

recently we read a lot of questions and updates around cephadm. Some of
you might remember that we went down the road of rook-ceph instead of
cephadm and I wanted to give a short overview of how the update from
16.2.6 to 16.2.7 is performed on rook-ceph with the intention of
spreading information about how containers/rook are working.

Long story short, the full work required for upgrading a whole cluster
was for this upgrade:

Change the line

    image: quay.io/ceph/ceph:v16.2.6

to

    image: quay.io/ceph/ceph:v16.2.7

in the CephCluster definition (appended below for reference [0]), git commit
& git push it. And from there on, it's only waiting.

We are utilising argocd [1], which picks up the cluster state from git
and then updates the kubernetes custom resource "CephCluster" using our
git commit.

>From there on, the rook-ceph-operator, basically a process running in
kubernetes, detects that an upgrade is requested and check the status of
the monitors, upgrades one after another, then continues with the mgr
and finally upgrades the OSDs (i.e. change the image to the new
version).

The interesting bit from our side: the behaviour is pretty much
"standard" in terms of how we upgrade our native ceph clusters, just
fully automated and observable:

Using kubernetes log functionality [2], we watched the operator progress
and take actions (waiting for monitors to join the quorom, etc.),
depending on the cluster state.

This comes with the typical two sides of the same coin: the whole
upgrade is fully automated and thus if everything works fine, well, the
practical required working time is in the pure minutes, not hours for
upgrading dozens or hundreds of osds. However, if things go wrong,
you'll need to work against automation (i.e. stopping the operator,
deploying things manually, etc.).

For us it is very interesting to see the differences
between Devuan/Home made/Ceph ("we know/do everything") orchestration to
Alpine/Kubernetes/Rook/Ceph ("the operator does everything").

Best regards,

Nico

p.s.: The process for updating rook itself is pretty similar, just doing
a git commit, however it comes without restarting the mons/mgr/osds.

--------------------------------------------------------------------------------
[0]

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v16.2.7
  dataDirHostPath: /var/lib/rook
  mon:
    count: 5
    allowMultiplePerNode: false
  storage:
    useAllNodes: true
    useAllDevices: true
    onlyApplyOSDPlacement: false
  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true
  network:
    ipFamily: "IPv6"
    dualStack: false
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    daysToRetain: 30

--------------------------------------------------------------------------------

[1] https://argo-cd.readthedocs.io/en/stable/

[2]

kubectl -n rook-ceph logs -f rook-ceph-operator-85f45d468f-lhwmm

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx