Hi, unfortunately I'm not a dev, so it's gonna be someone else ripping the journal out and trying. But I came to understand that getting rid of journal is not that easy of a task. To me, more important would be if the devs understood what I'm trying to say :-) because without that any new development will still try and accomodate whatever the original design provided, and in this way it will be inherently "flawed". I never used RADOS directly (the closest I came was trying RadosGW to replace a Swift cluster), so it's possible someone build a custom app on top of that and it's using some more powerful features that RBD/S3 don't need. Is there such a project? Is someone actually building on top of librados, or is it in reality only tied to RBD/S3 as we know it? If journal is really only for crash consitency in case of abrupt OSD failure, then I can tell you right now RBD doesn't need it. The tricky part comes when you have to compare different data on OSDs after a crash - from filesystem perspective anything goes, but we need to stick to one "version" of the data (or leave marked it as unused in a bitmap if one is to be used, who cares what data was actually in there). But no need to reiterate I guess, there are more scenarios I haven't thought of. I'd like to see Ceph competetive to vSAN, ScaleIO or even some concoction that I can brew using NBD/DM/ZFS/whatever, and it's pretty obvious to me something isn't right in the design - at least for the "most important" workload which is RBD. Jan > On 29 Jan 2016, at 18:05, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > Signed PGP part > Jan, > > I know that Sage has worked through a lot of this and spent a lot of > time on it, so I'm somewhat inclined to say that if he says it needs > to be there, then it needs to be there. I, however, have been known to > stare at the tress so much that I miss the forest and I understand > some of the points that you bring up about the data consistency and > recovery from the client prospective. One thing that might be helpful > is for you (or someone else) to get in the code and disable the > journal pieces (not sure how difficult this would be) and test it > against your theories. It seems like you have some deep and sincere > interests in seeing Ceph be successful. If you theory holds up, then > presenting the data and results will help others understand and be > more interested in it. It took me a few months of this kind of work > with the WeightedPriorityQueue, and I think the developers and > understanding the limitations of the PrioritizedQueue and how > WeightedPriorityQueue can overcome them with the battery of tests I've > done with a proof of concept. Theory and actual results can be > different, but results are generally more difficult to argue. > > Some of the decision about the journal may be based on RADOS and not > RBD. For instance, the decision may have been made that if a RADOS > write has been given to the cluster, it is to be assumed that the > write is durable without waiting for an ACK. I can't see why an > S3/RADOS client can't wait for an ACK from the web server/OSD, but I > haven't gotten into that area yet. That is something else to keep in > mind. > > Lionel, > > I don't think the journal is used for anything more than crash > consistency of the OSD. I don't believe the journal is used a playback > instrument for bringing other OSDs into sync. An osd that is out of > sync will write it's updates to it's journal to speed up the process, > but that is the extent. The OSD providing the update has to read the > updates to send from disk/page cache. My understanding that the > journal is "never" read from, only when the OSD process crashes. > > I'm happy to be corrected if I've misstated anything. > > Robert LeBlanc > > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton > <lionel-subscription@xxxxxxxxxxx> wrote: > > Le 29/01/2016 16:25, Jan Schermer a écrit : > > > > [...] > > > > But if I understand correctly, there is indeed a log of the recent > > modifications in the filestore which is used when a PG is recovering > > because another OSD is lagging behind (not when Ceph reports a full > > backfill where I suppose all objects' versions of a PG are compared). > > > > That list of transactions becomes useful only when OSD crashes and comes > > back up - it needs to catch up somehow and this is one of the options. But > > do you really need the "content" of those transactions which is what the > > journal does? > > If you have no such list then you need to either rely on things like mtime > > of the object, or simply compare the hash of the objects (scrub). > > > > > > This didn't seem robust enough to me but I think I had forgotten about the > > monitors' role in maintaining coherency. > > > > Let's say you use a pool with size=3 and min_size=2. You begin with a PG > > with 3 active OSDs then you lose a first OSD for this PG and only two active > > OSDs remain: the clients still happily read and write to this PG and the > > downed OSD is now lagging behind. > > Then one of the remaining active OSDs disappears. Client I/O blocks because > > of min_size. Now the first downed (lagging) OSD comes back. At this point > > Ceph has everything it needs to recover (enough OSDs to reach min_size and > > all the data reported committed to disk to the client in the surviving OSD) > > but must decide which OSD actually has this valid data between the two. > > > > At this point I was under the impression that OSDs could determine this for > > themselves without any outside intervention. But reflecting on this > > situation I don't see how they could handle all cases by themselves (for > > example an active primary should be able to determine by itself that it must > > send the last modifications to any other OSD but it wouldn't work if all OSD > > go down for a PG : when coming back all could be the last primary from their > > point of view with no robust way to decide which is right without the > > monitors being involved). > > The monitors maintain the status of each OSDs for each PG if I'm not > > mistaken so I suppose the monitors knowledge of the situation will be used > > to determine which OSDs have the good data (the last min_size OSDs up for > > each PG) and trigger the others to resync before the PG reaches > > active+clean. > > > > That said this doesn't address the other point: when the resync happens, > > using the journal content of the primary could theoretically be faster if > > the filestores are on spinning disks. I realize that recent writes in the > > filestore might be in the kernel's cache (which would avoid the costly > > seeks) and that using the journal instead would probably mean that the OSDs > > maintain an in-memory index of all the IO transactions still stored in the > > journal to be efficient so it isn't such a clear win. > > > > Thanks a lot for the explanations. > > > > Lionel > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com