Re: SSD Journal

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Fri, 29 Jan 2016 17:27:17 +0100

    Le 29/01/2016 16:25, Jan Schermer a
      écrit :

      [...]

But if I understand correctly, there is indeed a log of the recent
modifications in the filestore which is used when a PG is recovering
because another OSD is lagging behind (not when Ceph reports a full
backfill where I suppose all objects' versions of a PG are compared).

      That list of transactions becomes useful only when OSD crashes and comes back up - it needs to catch up somehow and this is one of the options. But do you really need the "content" of those transactions which is what the journal does?
If you have no such list then you need to either rely on things like mtime of the object, or simply compare the hash of the objects (scrub).

    This didn't seem robust enough to me but I think I had forgotten
    about the monitors' role in maintaining coherency.

    Let's say you use a pool with size=3 and min_size=2. You begin with
    a PG with 3 active OSDs then you lose a first OSD for this PG and
    only two active OSDs remain: the clients still happily read and
    write to this PG and the downed OSD is now lagging behind.

    Then one of the remaining active OSDs disappears. Client I/O blocks
    because of min_size. Now the first downed (lagging) OSD comes back.
    At this point Ceph has everything it needs to recover (enough OSDs
    to reach min_size and all the data reported committed to disk to the
    client in the surviving OSD) but must decide which OSD actually has
    this valid data between the two.

    At this point I was under the impression that OSDs could determine
    this for themselves without any outside intervention. But reflecting
    on this situation I don't see how they could handle all cases by
    themselves (for example an active primary should be able to
    determine by itself that it must send the last modifications to any
    other OSD but it wouldn't work if all OSD go down for a PG : when
    coming back all could be the last primary from their point of view
    with no robust way to decide which is right without the monitors
    being involved).

    The monitors maintain the status of each OSDs for each PG if I'm not
    mistaken so I suppose the monitors knowledge of the situation will
    be used to determine which OSDs have the good data (the last
    min_size OSDs up for each PG) and trigger the others to resync
    before the PG reaches active+clean.

    That said this doesn't address the other point: when the resync
    happens, using the journal content of the primary could
    theoretically be faster if the filestores are on spinning disks. I
    realize that recent writes in the filestore might be in the kernel's
    cache (which would avoid the costly seeks) and that using the
    journal instead would probably mean that the OSDs maintain an
    in-memory index of all the IO transactions still stored in the
    journal to be efficient so it isn't such a clear win.

    Thanks a lot for the explanations.

    Lionel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com