Re: Ceph Leadership Team meeting 2021-07-14

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 15 Jul 2021 13:49:24 +0200

Hi,

On Wed, Jul 14, 2021 at 10:50 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> ...
>
> - high-level development priorities doc (why as opposed to what)
>   - https://docs.google.com/document/d/1kF8GEXUwB8y-SKZP6TM9mhfYluldEw_2D2qxyOy0p74/edit
>

Re: "Ceph developed a reputation early on for being complicated and
hard to use":

Adding 2 cents -- I believe that what I will say here is already
considered an implicit priority, but I didn't see it written down
so...

The solutions listed there seem to focus on easing Ceph from the PoV
of installation and mgmt.
But IMHO we should not forget that Ceph is also considered difficult
for reasons related to availability, troubleshooting complexity, and
error recovery.

Better orchestration clearly helps prevent procedural errors which
cause problems later on, but in those remaining rare occasions that
Ceph is degraded or down, it can still be quite difficult to
understand why and bring things back online. (And I think this might
be especially true for new admins who benefit from the easy-to-use
tooling.)

Part of solving this is having clear error codes and docs, for which
there has already been a lot of effort to simplify for known errors.
:+1:

But I think we need to continue to focus on core stability and perhaps
even add effort to improve Ceph's availability after unexpected
errors. These are also "ease of use" issues, IMHO.

To give some general examples, which might be vague or cherry-picked,
but I wanted to mention them to help explain my point:

1. PGs sometimes don't peer properly, or at all, after outages:
https://tracker.ceph.com/issues/46847 -- it's difficult to reproduce
but the impact can be significant.
2. Quite a few bluestore/fs corruption issues and unbound db workloads
over recent months. I think most known issues have been fixed by now,
but can the general testing approach be improved here?
3. An FS can go down because the MDS tends to ceph_assert to avoid
corruptions: rather than assert, could we rather EIO individual files
or directories (affecting one user) instead of taking down an entire
FS (affecting all users)? E.g. https://tracker.ceph.com/issues/44565
https://tracker.ceph.com/issues/49132

I really appreciate these CLT minutes!!

Thank you,

Dan
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx