Re: Why you might want packages not containers for Ceph deployments

Eneko Lacunza <elacunza@xxxxxxxxx> · Wed, 17 Nov 2021 16:39:31 +0100

Hi,

El 17/11/21 a las 15:46, Dan van der Ster escribió:
My 2 cents:
* the best solution to ensure ceph's future rock solid stability is to 
continually improve our upstream testing. We have excellent unit 
testing to avoid regressions on specific bugs, and pretty adequate 
upgrade testing, but I'd like to know if we're missing some high level 
major upgrade testing that would have caught these issues and 
other unknown bugs.
* LTS alone wouldn't solve the root problem. Bugs can creep into LTS 
in the exact same way that the recent pacific bugs have, if the 
testing coverage is incomplete.
(I'm not convinced that all of the recent urgent bugs have come from 
"new features" per se -- one which comes to mind is the fix related to 
detecting network binding addrs, for example -- something that would 
reasonably have landed in and broken LTS clusters.)
* I personally wouldn't want to run an LTS release based on ... what 
would that be now.. Luminous + security patches??. IMO, the new 
releases really are much more performant, much more scalable. N, O, 
and P are really much much *much* better than previous releases. For 
example, I would not enable snapshots on any cephfs except Pacific -- 
but do we really want to backport the "stray dir splitting" 
improvement all the way back to mimic or L? -- to me that seems 
extremely unwise and a waste of the developers' limited time.

So I would prioritize a short one off effort to review the upstream 
testing, ensuring it is as complete and representative of our real 
user environments as possible.
And *also* we can complement this with RC point releases, whereby we 
invite the community members to participate in the testing and give 
feedback before a point release would have broken.

Why RCs? Because our environments are so diverse, complete test 
coverage will always be a challenge. So IMHO we need to help each 
other build confidence in the latest stable releases. We should 
regularly share upgrade stories on this list so it is very clear which 
use-cases are working or not for each release.
We can also do things like pull broken releases from repos, clearly 
document known issues in old releases, even add health warns to 
clusters with known issues, ...

And we should be asking and understanding why some of us are still on 
mimic (or nautilus, which I know personally why...)? What would 
convince us to upgrade to O or P ?
That would be a good metric, IMHO -- let's say, 4 months from now, who 
is not running pacific?? Why not??

I understand and agree with your points, so why wouldn't we advise to 
upgrade to P to our customers (today we upgraded from N to O one of 
them, we advised against continuing to P):

- We had our first downtime in our office Ceph cluster in about 8 years.

    - Mons needed normal x10+ hard disk space to recover from a lost 
quorum (almost simultaneous 2-mon reboot in a 3-mon cluster with 15 
OSDs), we had to add space to the mon partitiona repeatedly without 
knowing what would be enough after they were out of space the first 
time. We haven't changed disk specs nor cluster size in all those 8 years.

    - One MGR was caught eating 40GB of RAM before we shoot it down in 
the process. Kernel OOM was shooting down OSDs before that.

    - Cluster used massive amounts of bandwith (400MB/s) between mon 
lost quorum and 2 mons when out of disk space.

    - We upgrade our office cluster prior to upgrading customer's.

    - Regaining confidence after this kind of incident will take time.

    + Luckily no data was harmed :)

I know those tiny clusters aren't Ceph's target, but if I have such a 
weird issue with a tiny cluster... god help me with a medium cluster if 
such problem happens ;)

I know also that we aren't being very helpful ere; no bugs reported at 
all. :-( We analyzed to understand what went wrong, but didn't have time 
for full diagnosing... looked to fix issue that made two mons to reboot. 
Also it's a production cluster so trying to reproduce the issue isn't an 
option. We'll try to improve the next time.

I must also add that we are very grateful for this excelent storage, 
even with the issues described.

Thank you all developers!!!

Also thank you for trying to understand the concerns of users, I think 
that will make Ceph better.

Cheers

     EnekoLacunza

CTO | Zuzendari teknikoa

Binovo IT Human Project

	943 569 206 <tel:943 569 206>

	elacunza@xxxxxxxxx <mailto:elacunza@xxxxxxxxx>

	binovo.es <//binovo.es>

	Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

youtube <https://www.youtube.com/user/CANALBINOVO/>	
	linkedin <https://www.linkedin.com/company/37269706/>	

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx