Re: How Ceph cleans stale object on primary OSD failure?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 23 Oct 2024 15:43:41 +0300

This is tricky but i think you are correct, B and C will keep the new 
object copy  and not revert to old object version (or not revert to 
object removal if it had not existed).
To be 100% sure you have to dig into code to verify this.

My understanding..what is guaranteed is:
1) On success ack to client: obviously the last version of data written 
is what will be read back.
2) On either success or failure: data is consistent, if you repeat reads 
you will get the same data even if the acting set changes.
3) rados objects are transactional at the OSD level, they are either 
updated to new version or reverted to old version on failure at OSD 
level. Objects cannot be corrupt half way due to write failure.

I think 2) is good enough. Ideally in case of client ack failure you may 
want the old object version to be reverted back or the object removed in 
case it had not existed , this is the condition you are asking for B and 
C to revert.This would be a more complex requirement than 2) and may not 
be supported or practical to support.
I think if  clients want to guarantee data is reverted on failure, 
clients need to add transactional code themselves, like database 
journaling...etc. This is at least with block and filesystem clients. 
These clients would typically have such transactional code anyway to 
handle physical hardware failures..etc.

Even if rados did provide this extra transactional level and had B and C 
do revert back to old object version, this could still corrupt client 
data if client write spanned to 2 rados objects, one was success and one 
failed. Even though both rados objects are consistent, client data is 
corrupt as some of it is new and some old and so transactional code at 
the client will always be needed to revert data on failures.

/maged

On 23/10/2024 06:54, Vigneshwar S wrote:
Hi Frédéric,

5a section states that that the divergent events would be tracked and
deleted. But in the scenario I’ve mentioned, both B and C will have same
history.

So if the new primary is B, then when peering, it peers with C and looks at
the history and thinks they’ve no divergent events and keep the object
right?

Regards,
Vigneshwar

On Wed, 23 Oct 2024 at 9:09 AM, Frédéric Nass <
frederic.nass@xxxxxxxxxxxxxxxx> wrote:

Hi Vigneshwar,

You might want to check '5a' section from the peering process
documentation [1].

Regards,
Frédéric.

[1]
https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process

------------------------------
*De :* Vigneshwar S <svigneshj@xxxxxxxxx>
*Envoyé :* mardi 22 octobre 2024 11:05
*À :* ceph-users@xxxxxxx
*Objet :*  How Ceph cleans stale object on primary OSD
failure?

Hi everyone, I'm studying Ceph replication and peering and have a doubt in
how Ceph handles cleaning of stale objects on primary OSD failure.

Scenario:

Consider a placement group: PG1, whose acting set is [A, B, C] and A is
the
primary.

t1: Client writes an object to A
t2: A initiates local write
t3: A sends replication request to B
t4: A sends replication request to C
t5: A completes local write
t6: A gets replication response from B
t7: A gets replication response from C
t8: A is crashed and haven't sent acknowledgement to the client

In case if A never comes back, the object written in B and C would be
stale
as it is not acknowledged to the client. How does Ceph cleans this stale
object?

Also, it would be helpful if you can share any documentation or articles
explaining Ceph replication and peering in low-level.

Regards,
Vigneshwar.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx