Re: Proxmox/ceph upgrade and addition of a new node/OSDs

Fabian Grünbichler <f.gruenbichler@xxxxxxxxxxx> · Fri, 21 Sep 2018 20:17:15 +0200

On Fri, Sep 21, 2018 at 09:03:15AM +0200, Hervé Ballans wrote:
> Hi MJ (and all),
> 
> So we upgraded our Proxmox/Ceph cluster, and if we have to summarize the
> operation in a few words : overall, everything went well :)
> The most critical operation of all is the 'osd crush tunables optimal', I
> talk about it in more detail after...
> 
> The Proxmox documentation is really well written and accurate and, normally,
> following the documentation step by step is almost sufficient !

Glad to hear that everything worked well.

> 
> * first step : upgrade Ceph Jewel to Luminous :
> https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous
> (Note here : OSDs remain in FileStore backend, no BlueStore migration)
> 
> * second step : upgrade Proxmox version 4 to 5 :
> https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0
> 
> Just some numbers, observations and tips (based on our feedback, I'm not an
> expert !) :
> 
> * Before migration, make sure you are in the lastest version of Proxmox 4
> (4.4-24) and Ceph Jewel (10.2.11)
> 
> * We don't use the pve repository for ceph packages but the official one
> (download.ceph.com). Thus, during the upgrade of Promox PVE, we don't
> replace ceph.com repository with promox.com Ceph repository...

This is not recommended (and for a reason) - our packages are almost
identical to the upstream/official ones. But we do include the
occasional bug fix much faster than the official packages do, including
reverting breakage. Furthermore, when using our repository, you know
that the packages went through our own testing to ensure compatibility
with our stack (e.g., issues like JSON output changing from one minor
release to the next breaking our integration/GUI). Also, this natural
delay between upstream releases and availability in our repository has
saved our users from lots of "serious bug noticed one day after release"
issues since we switched to providing Ceph via our own repositories.

> * When you upgrade Ceph to Luminous (without tunables optimal), there is no
> impact on Proxmox 4. VMs are still running normally.
> The side effect (non blocking for the functionning of VMs) is located in the
> GUI, on the Ceph menu : it can't report the status of the ceph cluster as it
> has a JSON formatting error (indeed the output of the command 'ceph -s' is
> completely different, really more readable on Luminous)

Yes, this is to be expected. Backporting all of that just for the short
time window of "upgrade in progress" is too much work for too little
gain.

> 
> * It misses a little step in section 8 "Create Manager instances" of the
> upgrade ceph documentation. As the Ceph manager daemon is new since
> Luminous, the package doesn't exist on Jewel. So you have to install the
> ceph-mgr package on each node first before doing 'pveceph createmgr'|||
> |

It actually does not ;) ceph-mgr is pulled in by ceph on upgrades from
Jewel to Luminous - unless you manually removed that package at some
point.

> Otherwise :
> - verify that all your VMs are recently backuped on an external storage (in
> case of Disaster recovery Plan !)

Good idea in general :D

> - if you can, stop all your non-critical VMs (in order to limit client io
> operations)
> - if any, wait for the end of current backups then disable datacenter backup
> (in order to limit client io operations). !! do not forget to re-enable it
> when all is over !!
> - if any and if no longer needed, delete your snapshots, it removes many
> useless objects !
> - start the tunables operation outside of major activity periods (night,
> week-end, ??) and take into account that it can be very slow...

Scheduling and carefully planning rebalancing operations is always
needed on a production cluster. Note that the upgrade docs state that
switching to "tunables optimal" is recommended, but "will cause a
massive rebalance".

> There are probably some options to configure in ceph to avoid 'pgs stuck'
> states, but on our side, as we previously moved our critical VM's disks, we
> didn't care about that !
> 
> * Anyway, the upgrade step of Proxmox PVE is done easily and quickly (just
> follow the documentation). Note that you can upgrade Proxmox PVE before
> doing the 'tunables optimal' operation.
> 
> Hoping that you will find this information useful, good luck with your very
> next migration !

Thank you for the detailled report and feedback!

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com