Le 29/01/2016 01:12, Jan Schermer a écrit : > [...] >> Second I'm not familiar with Ceph internals but OSDs must make sure that their PGs are synced so I was under the impression that the OSD content for a PG on the filesystem should always be guaranteed to be on all the other active OSDs *or* their journals (so you wouldn't apply journal content unless the other journals have already committed the same content). If you remove the journals there's no intermediate on-disk "buffer" that can be used to guarantee such a thing: one OSD will always have data that won't be guaranteed to be on disk on the others. As I understand this you could say that this is some form of 2-phase commit. > You can simply commit the data (to the filestore), and it would be in fact faster. > Client gets the write acknowledged when all the OSDs have the data - that doesn't change in this scenario. If one OSD gets ahead of the others and commits something the other OSDs do not before the whole cluster goes down then it doesn't hurt anything - you didn't acknowledge so the client has to replay if it cares, _NOT_ the OSDs. > The problem still exists, just gets shifted elsewhere. But the client (guest filesystem) already handles this. Hum, if one OSD gets ahead of the others there must be a way for the OSDs to resynchronize themselves. I assume that on resync for each PG OSDs probably compare something very much like a tx_id. What I was expecting is that in the case of a small backlog the journal - containing the last modifications by design - was used during recovery to fetch all the recent transaction contents. It seemed efficient to me: especially on rotating media fetching data from the journal would avoid long seeks. The first alternative I can think of is maintaining a separate log of the recently modified objects in the filestore without the actual content of the modification. Then you can fetch the objects from the filestore as needed but this probably seeks all over the place. In the case of multiple PGs lagging behind on other OSDs, reading the local journal would be even better as you have even more chances of ordering reads to avoid seeks on the journal and much more seeks would happen on the filestore. But if I understand correctly, there is indeed a log of the recent modifications in the filestore which is used when a PG is recovering because another OSD is lagging behind (not when Ceph reports a full backfill where I suppose all objects' versions of a PG are compared). Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com