Re: Ceph Leadership Team meeting 2021-07-14

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 30 Jul 2021 16:25:33 -0500

Hi Dan,

Thanks for this!  I've finally found a bit of time to come back to the
quincy priorities doc and fleshed it out a bit, and added an
item/section around continuous quality improvement.  I think this area
is hard in general because it is the cumulative effect of many many
small (and medium) changes, most of which are non-trivial and need
careful thought.  It almost feels like what's needed is a process to
identify these quality-related issues and then build a process to work
through them on an ongoing/continuous basis.  Any thoughts?  We can
put this on the CLT agenda for next Wednesday...

In the meantime, I'd appreciate more eyes on and feedback about the
doc.  The idea is to have something that both lay and expert users can
look at to understand why we are investing in the areas that we are.

  https://docs.google.com/document/d/1kF8GEXUwB8y-SKZP6TM9mhfYluldEw_2D2qxyOy0p74/edit#

Thanks!
sage

On Thu, Jul 15, 2021 at 6:50 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Wed, Jul 14, 2021 at 10:50 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > ...
> >
> > - high-level development priorities doc (why as opposed to what)
> >   - https://docs.google.com/document/d/1kF8GEXUwB8y-SKZP6TM9mhfYluldEw_2D2qxyOy0p74/edit
> >
>
> Re: "Ceph developed a reputation early on for being complicated and
> hard to use":
>
> Adding 2 cents -- I believe that what I will say here is already
> considered an implicit priority, but I didn't see it written down
> so...
>
> The solutions listed there seem to focus on easing Ceph from the PoV
> of installation and mgmt.
> But IMHO we should not forget that Ceph is also considered difficult
> for reasons related to availability, troubleshooting complexity, and
> error recovery.
>
> Better orchestration clearly helps prevent procedural errors which
> cause problems later on, but in those remaining rare occasions that
> Ceph is degraded or down, it can still be quite difficult to
> understand why and bring things back online. (And I think this might
> be especially true for new admins who benefit from the easy-to-use
> tooling.)
>
> Part of solving this is having clear error codes and docs, for which
> there has already been a lot of effort to simplify for known errors.
> :+1:
>
> But I think we need to continue to focus on core stability and perhaps
> even add effort to improve Ceph's availability after unexpected
> errors. These are also "ease of use" issues, IMHO.
>
> To give some general examples, which might be vague or cherry-picked,
> but I wanted to mention them to help explain my point:
>
> 1. PGs sometimes don't peer properly, or at all, after outages:
> https://tracker.ceph.com/issues/46847 -- it's difficult to reproduce
> but the impact can be significant.
> 2. Quite a few bluestore/fs corruption issues and unbound db workloads
> over recent months. I think most known issues have been fixed by now,
> but can the general testing approach be improved here?
> 3. An FS can go down because the MDS tends to ceph_assert to avoid
> corruptions: rather than assert, could we rather EIO individual files
> or directories (affecting one user) instead of taking down an entire
> FS (affecting all users)? E.g. https://tracker.ceph.com/issues/44565
> https://tracker.ceph.com/issues/49132
>
> I really appreciate these CLT minutes!!
>
> Thank you,
>
> Dan
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx