Thanks a lot Greg and Maged! On Wed, Oct 23, 2024 at 8:02 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote: > > On 23/10/2024 16:40, Gregory Farnum wrote: > > On Wed, Oct 23, 2024 at 5:44 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> > wrote: > >> >> This is tricky but i think you are correct, B and C will keep the new >> object copy and not revert to old object version (or not revert to >> object removal if it had not existed). >> To be 100% sure you have to dig into code to verify this. >> >> My understanding..what is guaranteed is: >> 1) On success ack to client: obviously the last version of data written >> is what will be read back. >> 2) On either success or failure: data is consistent, if you repeat reads >> you will get the same data even if the acting set changes. >> 3) rados objects are transactional at the OSD level, they are either >> updated to new version or reverted to old version on failure at OSD >> level. Objects cannot be corrupt half way due to write failure. >> >> I think 2) is good enough. Ideally in case of client ack failure you may >> want the old object version to be reverted back or the object removed in >> case it had not existed , this is the condition you are asking for B and >> C to revert.This would be a more complex requirement than 2) and may not >> be supported or practical to support. >> I think if clients want to guarantee data is reverted on failure, >> clients need to add transactional code themselves, like database >> journaling...etc. This is at least with block and filesystem clients. >> These clients would typically have such transactional code anyway to >> handle physical hardware failures..etc. >> >> Even if rados did provide this extra transactional level and had B and C >> do revert back to old object version, this could still corrupt client >> data if client write spanned to 2 rados objects, one was success and one >> failed. Even though both rados objects are consistent, client data is >> corrupt as some of it is new and some old and so transactional code at >> the client will always be needed to revert data on failures. > > > All of this is transparent to RADOS users. Written objects will generally > remain written, and librados will retry operations (and get the same > response on retry as they should have gotten on first submit) without > returning errors to the caller. > The only way you can get the torn write you describe here is if the client > had simultaneous outstanding writes and itself fails while they are > in-progress — and you can see this same torn write on a local FS too if > there is a crash while simultaneous writes are in progress. > -Greg > > >> Thanks Greg for the clarifications. It makes a lot of sense. So to answer > the original question, both B and C will have new copy of object, A crashes > without sending client ack, the librados layer will keep retrying writing > the same copy again that B and C already have + new member D of active set. > Client will never receive error, just experience delay. It is also > interesting if the client itself crashes while retrying, still the new > version of the object will kept...i think :) > > The case of client receiving error is probably corner cases, like timeouts > at levels above like iSCSI/SMB/NFS gateways if set too low or timeouts set > at application levels. > > > >> /maged >> >> On 23/10/2024 06:54, Vigneshwar S wrote: >> > Hi Frédéric, >> > >> > 5a section states that that the divergent events would be tracked and >> > deleted. But in the scenario I’ve mentioned, both B and C will have same >> > history. >> > >> > So if the new primary is B, then when peering, it peers with C and >> looks at >> > the history and thinks they’ve no divergent events and keep the object >> > right? >> > >> > Regards, >> > Vigneshwar >> > >> > On Wed, 23 Oct 2024 at 9:09 AM, Frédéric Nass < >> > frederic.nass@xxxxxxxxxxxxxxxx> wrote: >> > >> >> Hi Vigneshwar, >> >> >> >> You might want to check '5a' section from the peering process >> >> documentation [1]. >> >> >> >> Regards, >> >> Frédéric. >> >> >> >> [1] >> >> >> https://docs.ceph.com/en/reef/dev/peering/#description-of-the-peering-process >> >> >> >> ------------------------------ >> >> *De :* Vigneshwar S <svigneshj@xxxxxxxxx> >> >> *Envoyé :* mardi 22 octobre 2024 11:05 >> >> *À :* ceph-users@xxxxxxx >> >> *Objet :* How Ceph cleans stale object on primary OSD >> >> failure? >> >> >> >> >> >> Hi everyone, I'm studying Ceph replication and peering and have a >> doubt in >> >> how Ceph handles cleaning of stale objects on primary OSD failure. >> >> >> >> Scenario: >> >> >> >> Consider a placement group: PG1, whose acting set is [A, B, C] and A is >> >> the >> >> primary. >> >> >> >> t1: Client writes an object to A >> >> t2: A initiates local write >> >> t3: A sends replication request to B >> >> t4: A sends replication request to C >> >> t5: A completes local write >> >> t6: A gets replication response from B >> >> t7: A gets replication response from C >> >> t8: A is crashed and haven't sent acknowledgement to the client >> >> >> >> In case if A never comes back, the object written in B and C would be >> >> stale >> >> as it is not acknowledged to the client. How does Ceph cleans this >> stale >> >> object? >> >> >> >> Also, it would be helpful if you can share any documentation or >> articles >> >> explaining Ceph replication and peering in low-level. >> >> >> >> Regards, >> >> Vigneshwar. >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@xxxxxxx >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx