Re: SSD Journal

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 29 Jan 2016 10:05:03 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Jan,

I know that Sage has worked through a lot of this and spent a lot of
time on it, so I'm somewhat inclined to say that if he says it needs
to be there, then it needs to be there. I, however, have been known to
stare at the tress so much that I miss the forest and I understand
some of the points that you bring up about the data consistency and
recovery from the client prospective. One thing that might be helpful
is for you (or someone else) to get in the code and disable the
journal pieces (not sure how difficult this would be) and test it
against your theories. It seems like you have some deep and sincere
interests in seeing Ceph be successful. If you theory holds up, then
presenting the data and results will help others understand and be
more interested in it. It took me a few months of this kind of work
with the WeightedPriorityQueue, and I think the developers and
understanding the limitations of the PrioritizedQueue and how
WeightedPriorityQueue can overcome them with the battery of tests I've
done with a proof of concept. Theory and actual results can be
different, but results are generally more difficult to argue.

Some of the decision about the journal may be based on RADOS and not
RBD. For instance, the decision may have been made that if a RADOS
write has been given to the cluster, it is to be assumed that the
write is durable without waiting for an ACK. I can't see why an
S3/RADOS client can't wait for an ACK from the web server/OSD, but I
haven't gotten into that area yet. That is something else to keep in
mind.

Lionel,

I don't think the journal is used for anything more than crash
consistency of the OSD. I don't believe the journal is used a playback
instrument for bringing other OSDs into sync. An osd that is out of
sync will write it's updates to it's journal to speed up the process,
but that is the extent. The OSD providing the update has to read the
updates to send from disk/page cache. My understanding that the
journal is "never" read from, only when the OSD process crashes.

I'm happy to be corrected if I've misstated anything.

Robert LeBlanc
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWq5u8CRDmVDuy+mK58QAALaoP/3r6oB7cZepby/KRbEGO
pXfZ0X9bDW5S55KgcJjvt0bfrAdLrDxD+8TiKMxwFnFFfuErfVOr6+8E1lZD
tuYuUBKivy8NKfJzeIZx/i81vdFkKSP7jwj8CGXeLoVes29xUY9vxING2ydI
hC9fhb+xdSxn01aPiwnocmVWA5YE/nJ7mKiHAIwgSYAcIITjmtarAKXxRUz6
TAw3mxjLCpLzBd9qP4yZ4q3F35Z9HCvPwES3OogYmimI0sxHM6xZlqChLkKA
aquWWcy+RdBrLhxv+i8NcO835vVnQtCbu6MBpOuVzLiTW/sbXNyOSJiFW9Df
XUKw1biv2znNN534hprAYMgE2+XxzxkpX1j1seplS+cHA+5uNfHbvu4DdHP2
0zeCm3GNgj3cpU0NGbchfyxT+b1VyzafrjQs3Ltv5CqUtfvYTCmpIS59BZkZ
K1KwoBX2cv22WQoP3mnc8eOp44uRkBOfdqefnAf8zE25X0jBW46atFfw52CP
OIdrPJ1+woUgMrhJXjHNG8mybAjAS6lx5YIEx7beHuYIqVCyhuYXjZyNdko2
H410+91n/RK3NvvSvJmdJ0wU93KMyf9QMZ43jwYVj0nkFk0mHHhb+NQ6wJeC
fah9vRIeX5Fi4UNGGW5H0O+LB2mFoMr7ecHB50UnEja67XIUPsWRdMySPs1E
qIDe
=TUe+
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton
<lionel-subscription@xxxxxxxxxxx> wrote:
> Le 29/01/2016 16:25, Jan Schermer a écrit :
>
> [...]
>
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).
>
> That list of transactions becomes useful only when OSD crashes and comes
> back up - it needs to catch up somehow and this is one of the options. But
> do you really need the "content" of those transactions which is what the
> journal does?
> If you have no such list then you need to either rely on things like mtime
> of the object, or simply compare the hash of the objects (scrub).
>
>
> This didn't seem robust enough to me but I think I had forgotten about the
> monitors' role in maintaining coherency.
>
> Let's say you use a pool with size=3 and min_size=2. You begin with a PG
> with 3 active OSDs then you lose a first OSD for this PG and only two active
> OSDs remain: the clients still happily read and write to this PG and the
> downed OSD is now lagging behind.
> Then one of the remaining active OSDs disappears. Client I/O blocks because
> of min_size. Now the first downed (lagging) OSD comes back. At this point
> Ceph has everything it needs to recover (enough OSDs to reach min_size and
> all the data reported committed to disk to the client in the surviving OSD)
> but must decide which OSD actually has this valid data between the two.
>
> At this point I was under the impression that OSDs could determine this for
> themselves without any outside intervention. But reflecting on this
> situation I don't see how they could handle all cases by themselves (for
> example an active primary should be able to determine by itself that it must
> send the last modifications to any other OSD but it wouldn't work if all OSD
> go down for a PG : when coming back all could be the last primary from their
> point of view with no robust way to decide which is right without the
> monitors being involved).
> The monitors maintain the status of each OSDs for each PG if I'm not
> mistaken so I suppose the monitors knowledge of the situation will be used
> to determine which OSDs have the good data (the last min_size OSDs up for
> each PG) and trigger the others to resync before the PG reaches
> active+clean.
>
> That said this doesn't address the other point: when the resync happens,
> using the journal content of the primary could theoretically be faster if
> the filestores are on spinning disks. I realize that recent writes in the
> filestore might be in the kernel's cache (which would avoid the costly
> seeks) and that using the journal instead would probably mean that the OSDs
> maintain an in-memory index of all the IO transactions still stored in the
> journal to be efficient so it isn't such a clear win.
>
> Thanks a lot for the explanations.
>
> Lionel
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com