Re: Why you might want packages not containers for Ceph deployments

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Fri, 25 Jun 2021 15:58:01 +0200

Am 18.06.21 um 20:42 schrieb Sage Weil:
Following up with some general comments on the main container
downsides and on the upsides that led us down this path in the first
place.
[...]

Thanks, Sage, for the nice and concise summary on the Cephadm benefits, and the reasoning on why the path was chosen!
Also thanks for your reply on my question about the modularity of the actual orchestrator. I really appreciate this, and will try to reply in one place here.

After the huge activity in this thread, I did take a step back to watch and also make up my mind,
trying to condense my main issues with the "containers-only" approach, also taking other replies into account.

I hope this is not seen as a rant, but rather a collection of arguments for an additional orchestrator module,
or maybe even something different. Unfortunately, it has become a wall of text, but I hope at least some will fight their way through.

First of all, I fully agree with the positive points you raised — it's surely a gain for devs and many users to ship something tested and "complete"
without having a full and still necessarily incomplete OS test matrix to constantly check and extend. It also eases testing especially when trying out experimental features,
and takes away usage complexity e.g. in the upgrade path.

Of course, there's also the point that having a large test matrix across OSs tends to uncover actual bugs or issues which may not show up in a reduced test environment[0],
so reducing the matrix also comes at a price for reliability which has to be weighed against the time which is saved.

The security issue (50 containers -> 50 versions of openssl to patch) also still stands — the earlier question on this list (when to expect patched containers for a CVE affecting a library)
is still unreplied to[1], so these are real-life concerns. In general, I don't know any project which ever managed to keep up with the workload caused by the requirement to follow
all CVEs of all dependencies, informing about them and patching them, since this is a workload comparable to the one the security teams of Linux distributions have to handle.
In addition, you'll also need to address the question of when and how to pull new images when patched containers become available, how / when to inform the administrator,
and orchestrate service restarts as-needed (you'd basically need "needs-restarting" and friends). That's still quite a way to go, and will be a constant developer effort
from now on.

That being said, Ceph may be the first ever project managing to fulfil expectations here due to the close coupling to those guys wearing red hats ;-).

Another point raised on this list is that some users are anxious about pushing a "magic" button which upgrades a whole cluster.
Sure, this button is super useful, and incorporates developer wisdom, and allows the developers to test the full sequence and ship it to everybody.
So these buttons (e.g. "ceph orch upgrade") are something which is useful and, I should say, important.
However, by design, it hides the "inner workings" of Ceph, which is a major drawback for some users.

After this introduction, let me come to my main personal concern: Loss of integratibility.

Our model of operation is to have all machines (anything, be it your off-the-shelf desktop, a laptop,
a hypervisor, a compute node or a Ceph node) handled by the very same configuration management. It means all configuration is self-documenting,
reinstallation is done with the push of a button, and anybody who understands the configuration management and the services at a basic level can take over operations.
It's the only way we _can_ operate, given the huge number of services requested and required in the IT business these days.

To give just one example: We mount kerberised NFS on all our desktop nodes, via CephFS exposed via nfs-ganesha. The desktops run Ubuntu/Debian, the file servers CentOS.
If I need to change the Kerberos configuration (order of KDCs, roll out new principals etc.), for us this is a change in a single place: We perform the change in Puppet,
wait 30 minutes, and all systems run the new configuration[2].
When operating a service with its own orchestrator, this means for me: I have to manually adapt the configuration of this service.
I need someone who is able to do that (i.e. two configuration systems have to be learnt), and who then does it for all instances of the service (e.g. all Ceph clusters).
Hence, a previously simple change is multiplied in complexity. I can't just replace the OS disk of a Ceph-OSD node and push "reinstall" anymore,
letting Puppet install all services and Ceph packages (so I only have to adopt the disks, this is not automated as a safety precaution),
but I also have to talk to the Ceph orchestrator.

So automation is a must, and we also heavily rely on containers for scientific workloads to offer a large variety of software stacks to our users.
Automation works brilliantly on Linux (and likely also similar open platforms), since almost all services can be combined as building blocks,
and controlled by our configuration management Puppet (or any other tool). It's basically a consequence of the Unix idea that all programs do their own thing well,
and can be "glued" together as-needed for site-specific requirements.

While the "cephadm with containers" approach does this kind of glueing for me, it makes it harder (impossible?) to integrate as-is into existing configuration management systems.
I think this is also a strong point, and reading through the list, one of the major reasons why larger ceph sites do not want to use cephadm with containers seems to be exactly that:
It can't easily be "cut into pieces" and be integrated with an existing system they use for all other infrastructure.

So in summary:
The orchestrator is a good thing, but my point is that the currently implemented solution is not the right solution for a noticeable fraction of the existing community.
In addition to the users you had in mind when designing the current orchestrator,
there are also many active users of Ceph who want to have more direct access to the "complexity" of the system for two reasons[3]:
- To integrate it into existing automations.
- To learn how things work and interact.

The latter point is also a strong one, especially for me as an experimental physicist: I learned to love Ceph exactly because I played with the different components and their interactions,
tried to break them, found how they react, and got a deeper understanding and knowledge to understand any future issues.
This love never develops for me when I use a more "polished" product which has buttons doing things for me.
It's a major reason why I'd even choose Ceph(FS) over G*FS if the latter was available for free (of course, there are many more reasons).

So my conclusion is: The chosen path is changing the audience of Ceph, affecting both existing users, but also new users.
The length of this thread has shown that the community has different opinions on this path forward, for many different reasons.

My personal feeling is that the solution embraces new users who want to set something up quickly,
and is mainly a problem for existing and future long-term production users with larger clusters who want to understand the full stack and integrate it into their environment.

Does it really free developer resources in the long run?
I'm not sure about that — the community may shift more to users reporting issues like "I pushed the green upgrade button, and now it is stuck"
(similar mails have arrived on the mailing list already), and have less reports from users who report stacktraces or issues in network communication between services
(will automated crashdump reports be able to replace experienced bug reporters?). My personal feeling is that the latter type of users are those
who stay with Ceph for years or decades, and even though they may seem to be complaining that Ceph is a complicated machine with many nuts and bolts,
it's usually not something which disturbs them as much as it may seem to be the case.

So finally, what is my idea about a path forward?

In addition to keep delivering packages (will they stay?), I see two ways, basically:

- Having an additional orchestrator module running on bare-metal.
  Given the assumptions above and mails on this list, most users who'd use it would in any case install their packages differently,
  and only use the orchestrator e.g. to distribute cephx keys, set up initial configuration, enable systemd units etc.,
  finally persisting a differing fraction of these things into their configuration management.
  So the orchestrator may even be useful for them without the actual complex capabilities of adding repositories and installing packages.
  At least, that's what I'd be happy about: An orchestrator doing all the Ceph-only things for me, ideally telling me what it does.

- Having really extensive manual installation instructions.
  Currently, I find these super useful for the basic first steps, but they basically break off after you have a mon, mgr and osd.

In essence, these two points are the same: The modular orchestrator already has all tasks coded inside, i.e. essentially,
the orchestrator is the most up-to-date, complete, tested and well-maintained manual documentation we'll get — correct?
This is why my personal feeling is that having a "bare metal orchestrator" (i.e. an "SSH orchestrator" like ceph-deploy)
even without the features of adding repositories, installing packages, or upgrading at the push of a button will be sufficient for those of us
having issues with the current solution.
It's essentially about making the manual installation instructions into an orchestrator module (which is probably close to the current cephadm minus containers).

Is this a scope which is sufficiently low-hanging to warrant the effort?
It may also be less work than having a more extensive manual installation documentation, and maintaining it?

Cheers (and congratulations to all who made it to the end of this mail),
	Oliver

[0] As a basic example, g++ gets increasingly better at warning about potentially unintended behaviour caused by common classes of bugs in code,
    so a newer g++ may find more issues when testing.
    The same may happen for different library versions, which may point out API usage bugs early on, or reveal issues earlier in testing.
[1] https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/PPLJIHT6WKYPDJ45HVJ3Z37375WIGKDW/
[2] Of course, you can (and should) use a staged rollout.
[3] Well, that's a presumption, but the fact that you mentioned user concerns about this in the survey seems to strengthen that point.

--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx