Re: How Ceph cleans stale object on primary OSD failure?

Vigneshwar S <svigneshj@xxxxxxxxx> · Wed, 23 Oct 2024 21:58:16 +0530

Thanks a lot Greg and Maged!

On Wed, Oct 23, 2024 at 8:02 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

>
> On 23/10/2024 16:40, Gregory Farnum wrote:
>
> On Wed, Oct 23, 2024 at 5:44 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
> wrote:
>
>>
>> This is tricky but i think you are correct, B and C will keep the new
>> object copy  and not revert to old object version (or not revert to
>> object removal if it had not existed).
>> To be 100% sure you have to dig into code to verify this.
>>
>> My understanding..what is guaranteed is:
>> 1) On success ack to client: obviously the last version of data written
>> is what will be read back.
>> 2) On either success or failure: data is consistent, if you repeat reads
>> you will get the same data even if the acting set changes.
>> 3) rados objects are transactional at the OSD level, they are either
>> updated to new version or reverted to old version on failure at OSD
>> level. Objects cannot be corrupt half way due to write failure.
>>
>> I think 2) is good enough. Ideally in case of client ack failure you may
>> want the old object version to be reverted back or the object removed in
>> case it had not existed , this is the condition you are asking for B and
>> C to revert.This would be a more complex requirement than 2) and may not
>> be supported or practical to support.
>> I think if  clients want to guarantee data is reverted on failure,
>> clients need to add transactional code themselves, like database
>> journaling...etc. This is at least with block and filesystem clients.
>> These clients would typically have such transactional code anyway to
>> handle physical hardware failures..etc.
>>
>> Even if rados did provide this extra transactional level and had B and C
>> do revert back to old object version, this could still corrupt client
>> data if client write spanned to 2 rados objects, one was success and one
>> failed. Even though both rados objects are consistent, client data is
>> corrupt as some of it is new and some old and so transactional code at
>> the client will always be needed to revert data on failures.
>
>
> All of this is transparent to RADOS users. Written objects will generally
> remain written, and librados will retry operations (and get the same
> response on retry as they should have gotten on first submit) without
> returning errors to the caller.
> The only way you can get the torn write you describe here is if the client
> had simultaneous outstanding writes and itself fails while they are
> in-progress — and you can see this same torn write on a local FS too if
> there is a crash while simultaneous writes are in progress.
> -Greg
>
>
>> Thanks Greg for the clarifications. It makes a lot of sense. So to answer
> the original question, both B and C will have new copy of object, A crashes
> without sending client ack, the librados layer will keep retrying writing
> the same copy again that B and C already have + new member D of active set.
> Client will never receive error, just experience delay. It is also
> interesting if the client itself crashes while retrying, still the new
> version of the object will kept...i think :)
>
> The case of client receiving error is probably corner cases, like timeouts
> at levels above like iSCSI/SMB/NFS gateways if set too low or timeouts set
> at application levels.
>
>
>
>> /maged
>>
>> On 23/10/2024 06:54, Vigneshwar S wrote:
>> > Hi Frédéric,
>> >
>> > 5a section states that that the divergent events would be tracked and
>> > deleted. But in the scenario I’ve mentioned, both B and C will have same
>> > history.
>> >
>> > So if the new primary is B, then when peering, it peers with C and
>> looks at
>> > the history and thinks they’ve no divergent events and keep the object
>> > right?
>> >
>> > Regards,
>> > Vigneshwar
>> >
>> > On Wed, 23 Oct 2024 at 9:09 AM, Frédéric Nass <
>> > frederic.nass@xxxxxxxxxxxxxxxx> wrote:
>> >
>> >> Hi Vigneshwar,
>> >>
>> >> You might want to check '5a' section from the peering process
>> >> documentation [1].
>> >>
>> >> Regards,
>> >> Frédéric.
>> >>
>> >> [1]
>> >>
>> https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process
>> >>
>> >> ------------------------------
>> >> *De :* Vigneshwar S <svigneshj@xxxxxxxxx>
>> >> *Envoyé :* mardi 22 octobre 2024 11:05
>> >> *À :* ceph-users@xxxxxxx
>> >> *Objet :*  How Ceph cleans stale object on primary OSD
>> >> failure?
>> >>
>> >>
>> >> Hi everyone, I'm studying Ceph replication and peering and have a
>> doubt in
>> >> how Ceph handles cleaning of stale objects on primary OSD failure.
>> >>
>> >> Scenario:
>> >>
>> >> Consider a placement group: PG1, whose acting set is [A, B, C] and A is
>> >> the
>> >> primary.
>> >>
>> >> t1: Client writes an object to A
>> >> t2: A initiates local write
>> >> t3: A sends replication request to B
>> >> t4: A sends replication request to C
>> >> t5: A completes local write
>> >> t6: A gets replication response from B
>> >> t7: A gets replication response from C
>> >> t8: A is crashed and haven't sent acknowledgement to the client
>> >>
>> >> In case if A never comes back, the object written in B and C would be
>> >> stale
>> >> as it is not acknowledged to the client. How does Ceph cleans this
>> stale
>> >> object?
>> >>
>> >> Also, it would be helpful if you can share any documentation or
>> articles
>> >> explaining Ceph replication and peering in low-level.
>> >>
>> >> Regards,
>> >> Vigneshwar.
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx