Re: SSD Journal

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Fri, 29 Jan 2016 16:00:55 +0100

Le 29/01/2016 01:12, Jan Schermer a écrit :
> [...]
>> Second I'm not familiar with Ceph internals but OSDs must make sure that their PGs are synced so I was under the impression that the OSD content for a PG on the filesystem should always be guaranteed to be on all the other active OSDs *or* their journals (so you wouldn't apply journal content unless the other journals have already committed the same content). If you remove the journals there's no intermediate on-disk "buffer" that can be used to guarantee such a thing: one OSD will always have data that won't be guaranteed to be on disk on the others. As I understand this you could say that this is some form of 2-phase commit.
> You can simply commit the data (to the filestore), and it would be in fact faster.
> Client gets the write acknowledged when all the OSDs have the data - that doesn't change in this scenario. If one OSD gets ahead of the others and commits something the other OSDs do not before the whole cluster goes down then it doesn't hurt anything - you didn't acknowledge so the client has to replay if it cares, _NOT_ the OSDs.
> The problem still exists, just gets shifted elsewhere. But the client (guest filesystem) already handles this.

Hum, if one OSD gets ahead of the others there must be a way for the
OSDs to resynchronize themselves. I assume that on resync for each PG
OSDs probably compare something very much like a tx_id.

What I was expecting is that in the case of a small backlog the journal
- containing the last modifications by design - was used during recovery
to fetch all the recent transaction contents. It seemed efficient to me:
especially on rotating media fetching data from the journal would avoid
long seeks. The first alternative I can think of is maintaining a
separate log of the recently modified objects in the filestore without
the actual content of the modification. Then you can fetch the objects
from the filestore as needed but this probably seeks all over the place.
In the case of multiple PGs lagging behind on other OSDs, reading the
local journal would be even better as you have even more chances of
ordering reads to avoid seeks on the journal and much more seeks would
happen on the filestore.

But if I understand correctly, there is indeed a log of the recent
modifications in the filestore which is used when a PG is recovering
because another OSD is lagging behind (not when Ceph reports a full
backfill where I suppose all objects' versions of a PG are compared).

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com