Re: SSD Journal

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 1 Feb 2016 13:04:52 +0100

Hi,
unfortunately I'm not a dev, so it's gonna be someone else ripping the journal out and trying.
But I came to understand that getting rid of journal is not that easy of a task.

To me, more important would be if the devs understood what I'm trying to say :-)
because without that any new development will still try and accomodate whatever the
original design provided, and in this way it will be inherently "flawed".

I never used RADOS directly (the closest I came was trying RadosGW to replace a Swift cluster),
so it's possible someone build a custom app on top of that and it's using some more powerful
features that RBD/S3 don't need. Is there such a project? Is someone actually building on top of
librados, or is it in reality only tied to RBD/S3 as we know it?

If journal is really only for crash consitency in case of abrupt OSD failure, then I can tell you right now
RBD doesn't need it. The tricky part comes when you have to compare different data on OSDs after
a crash - from filesystem perspective anything goes, but we need to stick to one "version" of the data
(or leave marked it as unused in a bitmap if one is to be used, who cares what data was actually in there).
But no need to reiterate I guess, there are more scenarios I haven't thought of.

I'd like to see Ceph competetive to vSAN, ScaleIO or even some concoction that I can brew using
NBD/DM/ZFS/whatever, and it's pretty obvious to me something isn't right in the design - at least
for the "most important" workload which is RBD.

Jan

> On 29 Jan 2016, at 18:05, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> 
> Signed PGP part
> Jan,
> 
> I know that Sage has worked through a lot of this and spent a lot of
> time on it, so I'm somewhat inclined to say that if he says it needs
> to be there, then it needs to be there. I, however, have been known to
> stare at the tress so much that I miss the forest and I understand
> some of the points that you bring up about the data consistency and
> recovery from the client prospective. One thing that might be helpful
> is for you (or someone else) to get in the code and disable the
> journal pieces (not sure how difficult this would be) and test it
> against your theories. It seems like you have some deep and sincere
> interests in seeing Ceph be successful. If you theory holds up, then
> presenting the data and results will help others understand and be
> more interested in it. It took me a few months of this kind of work
> with the WeightedPriorityQueue, and I think the developers and
> understanding the limitations of the PrioritizedQueue and how
> WeightedPriorityQueue can overcome them with the battery of tests I've
> done with a proof of concept. Theory and actual results can be
> different, but results are generally more difficult to argue.
> 
> Some of the decision about the journal may be based on RADOS and not
> RBD. For instance, the decision may have been made that if a RADOS
> write has been given to the cluster, it is to be assumed that the
> write is durable without waiting for an ACK. I can't see why an
> S3/RADOS client can't wait for an ACK from the web server/OSD, but I
> haven't gotten into that area yet. That is something else to keep in
> mind.
> 
> Lionel,
> 
> I don't think the journal is used for anything more than crash
> consistency of the OSD. I don't believe the journal is used a playback
> instrument for bringing other OSDs into sync. An osd that is out of
> sync will write it's updates to it's journal to speed up the process,
> but that is the extent. The OSD providing the update has to read the
> updates to send from disk/page cache. My understanding that the
> journal is "never" read from, only when the OSD process crashes.
> 
> I'm happy to be corrected if I've misstated anything.
> 
> Robert LeBlanc
> 
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton
> <lionel-subscription@xxxxxxxxxxx> wrote:
> > Le 29/01/2016 16:25, Jan Schermer a écrit :
> >
> > [...]
> >
> > But if I understand correctly, there is indeed a log of the recent
> > modifications in the filestore which is used when a PG is recovering
> > because another OSD is lagging behind (not when Ceph reports a full
> > backfill where I suppose all objects' versions of a PG are compared).
> >
> > That list of transactions becomes useful only when OSD crashes and comes
> > back up - it needs to catch up somehow and this is one of the options. But
> > do you really need the "content" of those transactions which is what the
> > journal does?
> > If you have no such list then you need to either rely on things like mtime
> > of the object, or simply compare the hash of the objects (scrub).
> >
> >
> > This didn't seem robust enough to me but I think I had forgotten about the
> > monitors' role in maintaining coherency.
> >
> > Let's say you use a pool with size=3 and min_size=2. You begin with a PG
> > with 3 active OSDs then you lose a first OSD for this PG and only two active
> > OSDs remain: the clients still happily read and write to this PG and the
> > downed OSD is now lagging behind.
> > Then one of the remaining active OSDs disappears. Client I/O blocks because
> > of min_size. Now the first downed (lagging) OSD comes back. At this point
> > Ceph has everything it needs to recover (enough OSDs to reach min_size and
> > all the data reported committed to disk to the client in the surviving OSD)
> > but must decide which OSD actually has this valid data between the two.
> >
> > At this point I was under the impression that OSDs could determine this for
> > themselves without any outside intervention. But reflecting on this
> > situation I don't see how they could handle all cases by themselves (for
> > example an active primary should be able to determine by itself that it must
> > send the last modifications to any other OSD but it wouldn't work if all OSD
> > go down for a PG : when coming back all could be the last primary from their
> > point of view with no robust way to decide which is right without the
> > monitors being involved).
> > The monitors maintain the status of each OSDs for each PG if I'm not
> > mistaken so I suppose the monitors knowledge of the situation will be used
> > to determine which OSDs have the good data (the last min_size OSDs up for
> > each PG) and trigger the others to resync before the PG reaches
> > active+clean.
> >
> > That said this doesn't address the other point: when the resync happens,
> > using the journal content of the primary could theoretically be faster if
> > the filestores are on spinning disks. I realize that recent writes in the
> > filestore might be in the kernel's cache (which would avoid the costly
> > seeks) and that using the journal instead would probably mean that the OSDs
> > maintain an in-memory index of all the IO transactions still stored in the
> > journal to be efficient so it isn't such a clear win.
> >
> > Thanks a lot for the explanations.
> >
> > Lionel
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com