-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Jan, I know that Sage has worked through a lot of this and spent a lot of time on it, so I'm somewhat inclined to say that if he says it needs to be there, then it needs to be there. I, however, have been known to stare at the tress so much that I miss the forest and I understand some of the points that you bring up about the data consistency and recovery from the client prospective. One thing that might be helpful is for you (or someone else) to get in the code and disable the journal pieces (not sure how difficult this would be) and test it against your theories. It seems like you have some deep and sincere interests in seeing Ceph be successful. If you theory holds up, then presenting the data and results will help others understand and be more interested in it. It took me a few months of this kind of work with the WeightedPriorityQueue, and I think the developers and understanding the limitations of the PrioritizedQueue and how WeightedPriorityQueue can overcome them with the battery of tests I've done with a proof of concept. Theory and actual results can be different, but results are generally more difficult to argue. Some of the decision about the journal may be based on RADOS and not RBD. For instance, the decision may have been made that if a RADOS write has been given to the cluster, it is to be assumed that the write is durable without waiting for an ACK. I can't see why an S3/RADOS client can't wait for an ACK from the web server/OSD, but I haven't gotten into that area yet. That is something else to keep in mind. Lionel, I don't think the journal is used for anything more than crash consistency of the OSD. I don't believe the journal is used a playback instrument for bringing other OSDs into sync. An osd that is out of sync will write it's updates to it's journal to speed up the process, but that is the extent. The OSD providing the update has to read the updates to send from disk/page cache. My understanding that the journal is "never" read from, only when the OSD process crashes. I'm happy to be corrected if I've misstated anything. Robert LeBlanc -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWq5u8CRDmVDuy+mK58QAALaoP/3r6oB7cZepby/KRbEGO pXfZ0X9bDW5S55KgcJjvt0bfrAdLrDxD+8TiKMxwFnFFfuErfVOr6+8E1lZD tuYuUBKivy8NKfJzeIZx/i81vdFkKSP7jwj8CGXeLoVes29xUY9vxING2ydI hC9fhb+xdSxn01aPiwnocmVWA5YE/nJ7mKiHAIwgSYAcIITjmtarAKXxRUz6 TAw3mxjLCpLzBd9qP4yZ4q3F35Z9HCvPwES3OogYmimI0sxHM6xZlqChLkKA aquWWcy+RdBrLhxv+i8NcO835vVnQtCbu6MBpOuVzLiTW/sbXNyOSJiFW9Df XUKw1biv2znNN534hprAYMgE2+XxzxkpX1j1seplS+cHA+5uNfHbvu4DdHP2 0zeCm3GNgj3cpU0NGbchfyxT+b1VyzafrjQs3Ltv5CqUtfvYTCmpIS59BZkZ K1KwoBX2cv22WQoP3mnc8eOp44uRkBOfdqefnAf8zE25X0jBW46atFfw52CP OIdrPJ1+woUgMrhJXjHNG8mybAjAS6lx5YIEx7beHuYIqVCyhuYXjZyNdko2 H410+91n/RK3NvvSvJmdJ0wU93KMyf9QMZ43jwYVj0nkFk0mHHhb+NQ6wJeC fah9vRIeX5Fi4UNGGW5H0O+LB2mFoMr7ecHB50UnEja67XIUPsWRdMySPs1E qIDe =TUe+ -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote: > Le 29/01/2016 16:25, Jan Schermer a écrit : > > [...] > > But if I understand correctly, there is indeed a log of the recent > modifications in the filestore which is used when a PG is recovering > because another OSD is lagging behind (not when Ceph reports a full > backfill where I suppose all objects' versions of a PG are compared). > > That list of transactions becomes useful only when OSD crashes and comes > back up - it needs to catch up somehow and this is one of the options. But > do you really need the "content" of those transactions which is what the > journal does? > If you have no such list then you need to either rely on things like mtime > of the object, or simply compare the hash of the objects (scrub). > > > This didn't seem robust enough to me but I think I had forgotten about the > monitors' role in maintaining coherency. > > Let's say you use a pool with size=3 and min_size=2. You begin with a PG > with 3 active OSDs then you lose a first OSD for this PG and only two active > OSDs remain: the clients still happily read and write to this PG and the > downed OSD is now lagging behind. > Then one of the remaining active OSDs disappears. Client I/O blocks because > of min_size. Now the first downed (lagging) OSD comes back. At this point > Ceph has everything it needs to recover (enough OSDs to reach min_size and > all the data reported committed to disk to the client in the surviving OSD) > but must decide which OSD actually has this valid data between the two. > > At this point I was under the impression that OSDs could determine this for > themselves without any outside intervention. But reflecting on this > situation I don't see how they could handle all cases by themselves (for > example an active primary should be able to determine by itself that it must > send the last modifications to any other OSD but it wouldn't work if all OSD > go down for a PG : when coming back all could be the last primary from their > point of view with no robust way to decide which is right without the > monitors being involved). > The monitors maintain the status of each OSDs for each PG if I'm not > mistaken so I suppose the monitors knowledge of the situation will be used > to determine which OSDs have the good data (the last min_size OSDs up for > each PG) and trigger the others to resync before the PG reaches > active+clean. > > That said this doesn't address the other point: when the resync happens, > using the journal content of the primary could theoretically be faster if > the filestores are on spinning disks. I realize that recent writes in the > filestore might be in the kernel's cache (which would avoid the costly > seeks) and that using the journal instead would probably mean that the OSDs > maintain an in-memory index of all the IO transactions still stored in the > journal to be efficient so it isn't such a clear win. > > Thanks a lot for the explanations. > > Lionel > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com