Re: How Ceph cleans stale object on primary OSD failure?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 23 Oct 2024 17:31:59 +0300

On 23/10/2024 16:40, Gregory Farnum wrote:
On Wed, Oct 23, 2024 at 5:44 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> 
wrote:

    This is tricky but i think you are correct, B and C will keep the new
    object copy  and not revert to old object version (or not revert to
    object removal if it had not existed).
    To be 100% sure you have to dig into code to verify this.

    My understanding..what is guaranteed is:
    1) On success ack to client: obviously the last version of data
    written
    is what will be read back.
    2) On either success or failure: data is consistent, if you repeat
    reads
    you will get the same data even if the acting set changes.
    3) rados objects are transactional at the OSD level, they are either
    updated to new version or reverted to old version on failure at OSD
    level. Objects cannot be corrupt half way due to write failure.

    I think 2) is good enough. Ideally in case of client ack failure
    you may
    want the old object version to be reverted back or the object
    removed in
    case it had not existed , this is the condition you are asking for
    B and
    C to revert.This would be a more complex requirement than 2) and
    may not
    be supported or practical to support.
    I think if  clients want to guarantee data is reverted on failure,
    clients need to add transactional code themselves, like database
    journaling...etc. This is at least with block and filesystem clients.
    These clients would typically have such transactional code anyway to
    handle physical hardware failures..etc.

    Even if rados did provide this extra transactional level and had B
    and C
    do revert back to old object version, this could still corrupt client
    data if client write spanned to 2 rados objects, one was success
    and one
    failed. Even though both rados objects are consistent, client data is
    corrupt as some of it is new and some old and so transactional
    code at
    the client will always be needed to revert data on failures.

All of this is transparent to RADOS users. Written objects will 
generally remain written, and librados will retry operations (and get 
the same response on retry as they should have gotten on first submit) 
without returning errors to the caller.
The only way you can get the torn write you describe here is if the 
client had simultaneous outstanding writes and itself fails while they 
are in-progress — and you can see this same torn write on a local FS 
too if there is a crash while simultaneous writes are in progress.
-Greg

Thanks Greg for the clarifications. It makes a lot of sense. So to 
answer the original question, both B and C will have new copy of object, 
A crashes without sending client ack, the librados layer will keep 
retrying writing the same copy again that B and C already have + new 
member D of active set. Client will never receive error, just experience 
delay. It is also interesting if the client itself crashes while 
retrying, still the new version of the object will kept...i think :)
The case of client receiving error is probably corner cases, like 
timeouts at levels above like iSCSI/SMB/NFS gateways if set too low or 
timeouts set at application levels.

    /maged

    On 23/10/2024 06:54, Vigneshwar S wrote:
    > Hi Frédéric,
    >
    > 5a section states that that the divergent events would be
    tracked and
    > deleted. But in the scenario I’ve mentioned, both B and C will
    have same
    > history.
    >
    > So if the new primary is B, then when peering, it peers with C
    and looks at
    > the history and thinks they’ve no divergent events and keep the
    object
    > right?
    >
    > Regards,
    > Vigneshwar
    >
    > On Wed, 23 Oct 2024 at 9:09 AM, Frédéric Nass <
    > frederic.nass@xxxxxxxxxxxxxxxx> wrote:
    >
    >> Hi Vigneshwar,
    >>
    >> You might want to check '5a' section from the peering process
    >> documentation [1].
    >>
    >> Regards,
    >> Frédéric.
    >>
    >> [1]
    >>
    https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process
    >>
    >> ------------------------------
    >> *De :* Vigneshwar S <svigneshj@xxxxxxxxx>
    >> *Envoyé :* mardi 22 octobre 2024 11:05
    >> *À :* ceph-users@xxxxxxx
    >> *Objet :*  How Ceph cleans stale object on primary OSD
    >> failure?
    >>
    >>
    >> Hi everyone, I'm studying Ceph replication and peering and have
    a doubt in
    >> how Ceph handles cleaning of stale objects on primary OSD failure.
    >>
    >> Scenario:
    >>
    >> Consider a placement group: PG1, whose acting set is [A, B, C]
    and A is
    >> the
    >> primary.
    >>
    >> t1: Client writes an object to A
    >> t2: A initiates local write
    >> t3: A sends replication request to B
    >> t4: A sends replication request to C
    >> t5: A completes local write
    >> t6: A gets replication response from B
    >> t7: A gets replication response from C
    >> t8: A is crashed and haven't sent acknowledgement to the client
    >>
    >> In case if A never comes back, the object written in B and C
    would be
    >> stale
    >> as it is not acknowledged to the client. How does Ceph cleans
    this stale
    >> object?
    >>
    >> Also, it would be helpful if you can share any documentation or
    articles
    >> explaining Ceph replication and peering in low-level.
    >>
    >> Regards,
    >> Vigneshwar.
    >> _______________________________________________
    >> ceph-users mailing list -- ceph-users@xxxxxxx
    >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >>
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx