Re: Ceph needs your help with defining availability!

Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx> · Mon, 15 Aug 2022 16:34:55 -0400

Hi guys,

thank you so much for filling out the Ceph Cluster Availability survey!

we have received a total of 59 responses from various groups of people,
which is enough to help us understand more profoundly what availability
means to everyone.

As promised, here is the link to the results of the survey:
https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics

Also, I've summarized some of the written responses such that it is easier
for you to make sense of the results.

I hope you will find these responses helpful and please feel free to reach
out if you have any questions!

Response summary of the question:

“””

In your own words, please describe what availability means to you in a Ceph
cluster. (For example, is it the ability to serve read and write requests
even if the cluster is in a degraded state?).

“””

In summary, the majority of people consider the definition of availability
to be the ability to serve I/O with reasonable performance (some suggest
10-20%, others say it should be user configurable) + the ability to provide
other services. A couple of people define availability as all PGs being in
the state of active+clean, but we will come to learn that many people
disagree with this in the next question. Interestingly, a handful of people
suggests that cluster availability shouldn’t be binary, but rather a scale
or tiers, e.g., one response suggests that we should have:

   1.

   Fully available -  all services can serve I/O normal performance.
   2.

   Partially available
   1.

      some access method, although configured, is not available e.g.,
      CephFS works and RGW doesn’t.
      2.

      only reads or writes are possible on some storage pools.
      3.

      some storage pools are completely unavailable while others are
      completely or partially available.
      4.

      performance is severely degraded.
      5.

      some services are stopped/crashed.
      3.

   Unavailable - when Partially available is not reached.

Moreover, some suggest that we should track availability as per pool basis
to deal with a scenario where we have different crush rules or when we can
afford a pool to be unavailable. Furthermore, some response cares more
about the availability of one service than another, e.g., one response
states that they wouldn’t care about the availability of RADOS if RGW is
unavailable.

Response summary of the question:

“””

Do you agree with the following metric in evaluating a cluster's
availability:

"All placement group (PG) state in a cluster must have 'active'  in them,
if at least 1 PG does not have 'active' in them, then the cluster as a
whole is deemed as unavailable".

“””

35.8 % of Users answered `No`

35.8% of Users answered `Yes`

28.3% of Users answered `maybe`

Data clearly states that we can’t just have this as criteria for
availability. Therefore, here are some of the reasons why 64.1% do not
fully agree with the statement.

If the client does not interact with that particular PG then it is not
important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS
can still serve I/O, we cannot say that the cluster is unavailable. Some
disagree because they believe that a PG relates to a single pool,
therefore, that particular pool will be unavailable, not the cluster.
Furthermore, some suggest that there are events that might lead to PGs not
being inactive, such as provisioning a new OSD, creating a pool, or PG
split, however, these events don’t necessarily indicate unavailability.

Response summary of the question:

“””

>From your own experience, what are some of the most common events that
cause a Ceph cluster to be considered unavailable based on your definition
of availability.

“””

Top four responses:

   1.

   Network-related issues, e.g., network failure/instability.
   2.

   OSD-related issues, e.g., failure, slow ops, flapping.
   3.

   Disk-related issues, e.g., dead disks.
   4.

   PGs-related issues,  e.g., many PGs became stale, unknown, and stuck in
   peering.

Response summary of the question:

“””

Are there any events that you might consider a cluster to be unavailable
but you feel like it is not worth tracking and is dismissible?

“””

Top three responses:

   1.

   No, all unavailable events are worth tracking.
   2.

   Network related issues
   3.

   Scheduled upgrades or maintenance

On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx>
wrote:

> Hi John,
>
> Yes, I'm planning to summarize the results after this week. I will
> definitely share it with the community.
>
> Best,
>
> On Tue, Aug 9, 2022 at 1:19 PM John Bent <johnbent@xxxxxxxxx> wrote:
>
>> Hello Kamoltat,
>>
>> This sounds very interesting. Will you be sharing the results of the
>> survey back with the community?
>>
>> Thanks,
>>
>> John
>>
>> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> One of the features we are looking into implementing for our upcoming
>>> Ceph release (Reef) is the ability to track cluster availability over time.
>>> However, the biggest *problem* that we are currently facing is basing
>>> our measurement on the *definition of availability* that matches user
>>> expectations or business objectives. Therefore, we think it is worthwhile
>>> to ask for your opinion on what you think defines availability in a Ceph
>>> cluster.
>>>
>>> *Please help us* by filling in a* survey* that won't take longer than *10
>>> minutes* to complete:
>>>
>>> https://forms.gle/aFYvTCUM3s9daTJg8
>>>
>>> Feel free to reach out to me if you have any questions,
>>>
>>> Thank you and have a great weekend!
>>> --
>>>
>>> Kamoltat Sirivadhna (HE/HIM)
>>>
>>> SoftWare Engineer - Ceph Storage
>>>
>>> ksirivad@xxxxxxxxxx    T: (857) <(919)716-5348>253-8927
>>>
>>> _______________________________________________
>>> Dev mailing list -- dev@xxxxxxx
>>> To unsubscribe send an email to dev-leave@xxxxxxx
>>>
>>
>
> --
>
> Kamoltat Sirivadhna (HE/HIM)
>
> SoftWare Engineer - Ceph Storage
>
> ksirivad@xxxxxxxxxx    T: (857) <(919)716-5348>253-8927
>
>

-- 

Kamoltat Sirivadhna (HE/HIM)

SoftWare Engineer - Ceph Storage

ksirivad@xxxxxxxxxx    T: (857) <(919)716-5348>253-8927
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx