Re: How Ceph cleans stale object on primary OSD failure?

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 23 Oct 2024 11:29:03 +0200 (CEST)

I think both B and C would keep this object (without serving it, since it was never acknowledged to the client) until another OSD joins the acting set and then get rid of it: 

"If the cluster failed before an operation was persisted by **all** members of the acting set, and the subsequent peering did not remember that operation, and a node that did remember that operation later rejoined, its logs would record a different (divergent) history than the authoritative history that was reconstructed in the peering after the failure. Since the divergent events were not recorded in other logs from that acting set, they were not acknowledged to the client, and there is no harm in discarding them (so that all OSDs agree on the authoritative history). But, we will have to instruct any OSD that stores data from a divergent update to delete the affected (and now deemed to be apocryphal) objects." 

I'm not sure this exactly covers the case you mentioned, but it seems like it. There might be a smarter internal logic that handles that case that this part of the dev documentation does not describe though. 

Oh, I realize maybe you meant A, B and C persisted the operation and agreed on that history, but A crashes before it had the chance acknowledge the write to the client? 
In that case, one would need to dive into the OSD code and see how it works exactly, but I doubt this could occur. 

Maybe someone with in-depth knowledge can explain that. 

Regards, 
Frédéric. 

----- Le 23 Oct 24, à 5:54, Vigneshwar <svigneshj@xxxxxxxxx> a écrit : 

> Hi Frédéric,

> 5a section states that that the divergent events would be tracked and deleted.
> But in the scenario I’ve mentioned, both B and C will have same history.

> So if the new primary is B, then when peering, it peers with C and looks at the
> history and thinks they’ve no divergent events and keep the object right?

> Regards,
> Vigneshwar

> On Wed, 23 Oct 2024 at 9:09 AM, Frédéric Nass < [
> mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx ] >
> wrote:

>> Hi Vigneshwar,
>> You might want to check '5a' section from the peering process documentation [1].
>> Regards,
>> Frédéric.

>> [1] [
>> https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process |
>> https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process ]

>> De : Vigneshwar S < [ mailto:svigneshj@xxxxxxxxx | svigneshj@xxxxxxxxx ] >
>> Envoyé : mardi 22 octobre 2024 11:05
>> À : [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ]
>> Objet :  How Ceph cleans stale object on primary OSD failure?

>> Hi everyone, I'm studying Ceph replication and peering and have a doubt in
>> how Ceph handles cleaning of stale objects on primary OSD failure.

>> Scenario:

>> Consider a placement group: PG1, whose acting set is [A, B, C] and A is the
>> primary.

>> t1: Client writes an object to A
>> t2: A initiates local write
>> t3: A sends replication request to B
>> t4: A sends replication request to C
>> t5: A completes local write
>> t6: A gets replication response from B
>> t7: A gets replication response from C
>> t8: A is crashed and haven't sent acknowledgement to the client

>> In case if A never comes back, the object written in B and C would be stale
>> as it is not acknowledged to the client. How does Ceph cleans this stale
>> object?

>> Also, it would be helpful if you can share any documentation or articles
>> explaining Ceph replication and peering in low-level.

>> Regards,
>> Vigneshwar.
>> _______________________________________________
>> ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ]
>> To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx |
>> ceph-users-leave@xxxxxxx ]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx