Hi,
El 17/11/21 a las 15:46, Dan van der Ster escribió:
My 2 cents:
* the best solution to ensure ceph's future rock solid stability is to
continually improve our upstream testing. We have excellent unit
testing to avoid regressions on specific bugs, and pretty adequate
upgrade testing, but I'd like to know if we're missing some high level
major upgrade testing that would have caught these issues and
other unknown bugs.
* LTS alone wouldn't solve the root problem. Bugs can creep into LTS
in the exact same way that the recent pacific bugs have, if the
testing coverage is incomplete.
(I'm not convinced that all of the recent urgent bugs have come from
"new features" per se -- one which comes to mind is the fix related to
detecting network binding addrs, for example -- something that would
reasonably have landed in and broken LTS clusters.)
* I personally wouldn't want to run an LTS release based on ... what
would that be now.. Luminous + security patches??. IMO, the new
releases really are much more performant, much more scalable. N, O,
and P are really much much *much* better than previous releases. For
example, I would not enable snapshots on any cephfs except Pacific --
but do we really want to backport the "stray dir splitting"
improvement all the way back to mimic or L? -- to me that seems
extremely unwise and a waste of the developers' limited time.
So I would prioritize a short one off effort to review the upstream
testing, ensuring it is as complete and representative of our real
user environments as possible.
And *also* we can complement this with RC point releases, whereby we
invite the community members to participate in the testing and give
feedback before a point release would have broken.
Why RCs? Because our environments are so diverse, complete test
coverage will always be a challenge. So IMHO we need to help each
other build confidence in the latest stable releases. We should
regularly share upgrade stories on this list so it is very clear which
use-cases are working or not for each release.
We can also do things like pull broken releases from repos, clearly
document known issues in old releases, even add health warns to
clusters with known issues, ...
And we should be asking and understanding why some of us are still on
mimic (or nautilus, which I know personally why...)? What would
convince us to upgrade to O or P ?
That would be a good metric, IMHO -- let's say, 4 months from now, who
is not running pacific?? Why not??
I understand and agree with your points, so why wouldn't we advise to
upgrade to P to our customers (today we upgraded from N to O one of
them, we advised against continuing to P):
- We had our first downtime in our office Ceph cluster in about 8 years.
- Mons needed normal x10+ hard disk space to recover from a lost
quorum (almost simultaneous 2-mon reboot in a 3-mon cluster with 15
OSDs), we had to add space to the mon partitiona repeatedly without
knowing what would be enough after they were out of space the first
time. We haven't changed disk specs nor cluster size in all those 8 years.
- One MGR was caught eating 40GB of RAM before we shoot it down in
the process. Kernel OOM was shooting down OSDs before that.
- Cluster used massive amounts of bandwith (400MB/s) between mon
lost quorum and 2 mons when out of disk space.
- We upgrade our office cluster prior to upgrading customer's.
- Regaining confidence after this kind of incident will take time.
+ Luckily no data was harmed :)
I know those tiny clusters aren't Ceph's target, but if I have such a
weird issue with a tiny cluster... god help me with a medium cluster if
such problem happens ;)
I know also that we aren't being very helpful ere; no bugs reported at
all. :-( We analyzed to understand what went wrong, but didn't have time
for full diagnosing... looked to fix issue that made two mons to reboot.
Also it's a production cluster so trying to reproduce the issue isn't an
option. We'll try to improve the next time.
I must also add that we are very grateful for this excelent storage,
even with the issues described.
Thank you all developers!!!
Also thank you for trying to understand the concerns of users, I think
that will make Ceph better.
Cheers
EnekoLacunza
CTO | Zuzendari teknikoa
Binovo IT Human Project
943 569 206 <tel:943 569 206>
elacunza@xxxxxxxxx <mailto:elacunza@xxxxxxxxx>
binovo.es <//binovo.es>
Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun
youtube <https://www.youtube.com/user/CANALBINOVO/>
linkedin <https://www.linkedin.com/company/37269706/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx