Re: Why you might want packages not containers for Ceph deployments

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks, Sage.  This is a terrific distilation of the challenges and benefits.  

FWIW here are a few of my own perspectives, as someone experienced with Ceph but with limited container experience.  To be very clear, these are *perceptions* not *assertions*; my goal is discussion not argument.  For context, I have not used a release newer than Nautilus in production, in large part due to containers and cephadm.


>> Containers are more complicated than packages, making debugging harder.
> 
> I think that part of this comes down to a learning curve and some
> semi-arbitrary changes to get used to (e.g., systemd unit name has
> changed; logs now in /var/log/ceph/$fsid instead of /var/log/ceph).

Indeed, if there are logs at all. It seems that (by default?) one has to (know to) use journalctl to extract daemon or cluster logs, which is rather awkward compared to having straight files.  And those go away when the daemon restarts or is redeployed, losing data continuity?  Is logrotate used as usual, such that it can be adjusted?

If running multiple clusters on a set of hardware is deprecated, why include the fsid in the pathname?  This complicates scripting and monitoring / metrics collection.  Or have we retconned multiple clusters?

The admin sockets are under a similar path in /var/run.  I have yet to discover an incantation of `ceph daemon mon.foo` eg. that works, indeed specifying the whole path to the asok yields an error about the path being too long, so I’ve had to make a symlink to it.  This isn’t great usability, unless of course I’m missing something.

>> Security (50 containers -> 50 versions of openssl to patch)
> 
> This feels like the most tangible critique.  It's a tradeoff.  We have
> had so many bugs over the years due to varying versions of our
> dependencies that containers feel like a huge win: we can finally test
> and distribute something that we know won't break due to some random
> library on some random distro.  But it means the Ceph team is on the
> hook for rebuilding our containers when the libraries inside the
> container need to be patched.

This seems REALLY congruent with the tradeoffs that accompanied shared/dynamic linking years ago.  Shared linking saves on binary size and facilitates sharing of address space among processes; dynamic shared linking lets one update dependencies (notably openssl for sure since it has had lots of exploits over time, but others too).  But that also means that changes to those libraries can break applications.  So we’ve long seen commercial / pre-built binaries statically linked to avoid regression and breakage.  Kind of a rock and a hard place situation.  Some assert that Ceph daemon systems should mostly or entirely inaccessible from the Internet, and usually don’t have a large set of users — or any customers — logging into them.  Thus it can be argued that they are less exposed to attacks which would favor somewhat containerization.

One might say that containerization and orchestration make updates for security fixes trivial, but remember that in most cases such an upgrade is not against the immediately prior Ceph dot release, which means exposure to regressions and other unanticipated changes in behavior.  Which is one reason why enterprises especially may stick with a given specific dot release that works until compelled.  Updating upstream containers for security fixes is right back into the dependency hell situation too.

> 
> On the flip side, cephadm's use of containers offer some huge wins:
> 
> - Package installation hell is gone.

As a user I never experienced much of this, but then I was mostly installing packages outside of ceph-deploy et al.  With at least 3 different container technologies in play, though, are we substiuting one complexity for another?

> - Upgrades/downgrades can be carefully orchestrated. With packages,
> the version change is by host, with a limbo period (and occasional
> SIGBUS) before daemons were restarted.  Now we can run new or patched
> code on individual daemons and avoid an accidental upgrade when a
> daemon restarts.

Fair enough - that limbo period was never a problem for me, but re careful orchestration, we see people on this list all the time experiencing orchestration failures.  Is the list a nonrepresentative sample of people’s experience?  The opacity of said orchestration also complicates troubleshooting.

> - Ceph installations are carefully sandboxed.  Removing/scrubbing ceph
> from a host is trivial as only a handful of directories or
> configuration files are touched.

Plus of course any ancillary tools.  This seems like it would be advantageous in labs.  In production it’s not uncommon to reimage the entire box anyway.

>  And we can safely run multiple
> clusters on the same machine without worry about bad interactions

Wasn’t it observed a few years ago that almost nobody actually did that, hence the deprecation of custom cluster names?

> - Cephadm deploys a bunch of non-ceph software as well to provide a
> complete storage system, including haproxy and keepalived for HA
> ingress for RGW and NFS, ganesha for NFS service, grafana, prometheus,
> node-exporter, and (soon) samba for SMB.  All neatly containerized to
> avoid bumping into other software on the host; testing and supporting
> the huge matrix of packages versions available via various distros
> would be a huge time sink.

One size fits all?  None? Many? Some?  Does that get in the way of sites that, eg. choose nginux for LB/HA, to run their own Prometheus / Grafana infra for various reasons?  Is this more of the 

> We've been beat up for years about how complicated and hard Ceph is.

True.  I was told in an interview once that one needs a PhD in Ceph.  Over the years operators have had to rework tooling with every release, so the substantial retoolings that come with containers and cephadm / ceph orch can be daunting.  Midstream changes and changes made for no apparent reason contribute to the perception.  JSON output is supposed to be invariant, or at least backward compatible, yet we saw mon clock skew move for no apparent reason, and there have been other breaking changes.  cf. the ceph_exporter source for more examples.

> Rook and cephadm represent two of the most successful efforts to
> address usability (and not just because they enable deployment
> management via the dashboard!),

The goals here are totally worthy, to make things more turnkey.  I get that, I really do.  There are some wrinkles though:

* Are they successful, though?  I’m not saying they aren’t, I’m asking.  The frequency of cephadm / ceph orch SNAFUs posted to this list is daunting.  It seemed at one point that Rook would become the party line, but now it’s heterodox?

* Removing other complexity by introducing new complexity (containers).  There seems to have been assumption here that operators already grok containers?  In any of the three+ flavors in play?  It’s easy to just dismiss this as a learning curve, but it’s a rather significant one, and assuming that the operator will do that in their Copious Free Time isn’t IMHO reasonable.

* Dashboard operation by pushing buttons can make it dead simple to deploy a single dead simple configuration, but revision-controlled management of dozens of clusters is a different story.  Centralized config is one example (assuming the subtree limit bug has been fixed).  Absolutely mangaging ceph.conf across system types and multiple clusters is a pain — brittle ERB or J2 templates, inscrutable Ansible errors.  But how does one link CLI-based centralized config with revision control and peer review of changes?  One thing about turnkey solutions is that in generaly they usually are unreasonably simplistic or rigid in ways that are a bad fit for manageable enterprise deployment, and if we’re going to do everything for the user *and* make it diffcult for them to dig deep or customize, then the bar is *very* high for success.

* Some might add ceph-ansible to that list.





_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux