Re: Ceph needs your help with defining availability!

Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx> · Mon, 15 Aug 2022 16:34:55 -0400

Hi guys,

thank you so much for filling out the Ceph Cluster Availability survey!

we have received a total of 59 responses from various groups of people, which is enough to help us understand more profoundly what availability means to everyone.
As promised, here is the link to the results of the survey:
https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics

Also, I've summarized some of the written responses such that it is easier for you to make sense of the results.

I hope you will find these responses helpful and please feel free to reach out if you have any questions!

Response summary of the question:
“””
In your own words, please describe what availability means to you in a Ceph cluster. (For example, is it the ability to serve read and write requests even if the cluster is in a degraded state?).
“””

In summary, the majority of people consider the definition of availability to be the ability to serve I/O with reasonable performance (some suggest 10-20%, others say it should be user configurable) + the ability to provide other services. A couple of people define availability as all PGs being in the state of active+clean, but we will come to learn that many people disagree with this in the next question. Interestingly, a handful of people suggests that cluster availability shouldn’t be binary, but rather a scale or tiers, e.g., one response suggests that we should have:

Fully available -  all services can serve I/O normal performance.
Partially available
some access method, although configured, is not available e.g., CephFS works and RGW doesn’t.
only reads or writes are possible on some storage pools.
some storage pools are completely unavailable while others are completely or partially available.
performance is severely degraded.
some services are stopped/crashed.
Unavailable - when Partially available is not reached.

Moreover, some suggest that we should track availability as per pool basis to deal with a scenario where we have different crush rules or when we can afford a pool to be unavailable. Furthermore, some response cares more about the availability of one service than another, e.g., one response states that they wouldn’t care about the availability of RADOS if RGW is unavailable.

Response summary of the question:
“””
Do you agree with the following metric in evaluating a cluster's availability:

"All placement group (PG) state in a cluster must have 'active'  in them, if at least 1 PG does not have 'active' in them, then the cluster as a whole is deemed as unavailable". 
“””

35.8 % of Users answered `No` 
35.8% of Users answered `Yes`
28.3% of Users answered `maybe`

Data clearly states that we can’t just have this as criteria for availability. Therefore, here are some of the reasons why 64.1% do not fully agree with the statement.

If the client does not interact with that particular PG then it is not important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS can still serve I/O, we cannot say that the cluster is unavailable. Some disagree because they believe that a PG relates to a single pool, therefore, that particular pool will be unavailable, not the cluster. Furthermore, some suggest that there are events that might lead to PGs not being inactive, such as provisioning a new OSD, creating a pool, or PG split, however, these events don’t necessarily indicate unavailability.

Response summary of the question:
“””
From your own experience, what are some of the most common events that cause a Ceph cluster to be considered unavailable based on your definition of availability.
“””

Top four responses:

Network-related issues, e.g., network failure/instability.
OSD-related issues, e.g., failure, slow ops, flapping.
Disk-related issues, e.g., dead disks.
PGs-related issues,  e.g., many PGs became stale, unknown, and stuck in peering.

Response summary of the question:
“””
Are there any events that you might consider a cluster to be unavailable but you feel like it is not worth tracking and is dismissible?
“””

Top three responses:

No, all unavailable events are worth tracking.
Network related issues
Scheduled upgrades or maintenance

On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx> wrote:
Hi John,

Yes, I'm planning to summarize the results after this week. I will definitely share it with the community.

Best,

On Tue, Aug 9, 2022 at 1:19 PM John Bent <johnbent@xxxxxxxxx> wrote:
Hello Kamoltat,
This sounds very interesting. Will you be sharing the results of the survey back with the community?

Thanks,

John

On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna <ksirivad@xxxxxxxxxx> wrote:
Hi everyone,

One of the features we are looking into implementing for our upcoming Ceph release (Reef) is the ability to track cluster availability over time. However, the biggest problem that we are currently facing is basing our measurement on the definition of availability that matches user expectations or business objectives. Therefore, we think it is worthwhile to ask for your opinion on what you think defines availability in a Ceph cluster.

Please help us by filling in a survey that won't take longer than 10 minutes to complete:
https://forms.gle/aFYvTCUM3s9daTJg8 

Feel free to reach out to me if you have any questions,
Thank you and have a great weekend!
-- 
Kamoltat Sirivadhna (HE/HIM)
SoftWare Engineer - Ceph Storage
ksirivad@xxxxxxxxxx    T: (857)253-8927

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Kamoltat Sirivadhna (HE/HIM)
SoftWare Engineer - Ceph Storage
ksirivad@xxxxxxxxxx    T: (857)253-8927

-- 
Kamoltat Sirivadhna (HE/HIM)
SoftWare Engineer - Ceph Storage
ksirivad@xxxxxxxxxx    T: (857)253-8927

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx