> On 29 Jan 2016, at 16:00, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote: > > Le 29/01/2016 01:12, Jan Schermer a écrit : >> [...] >>> Second I'm not familiar with Ceph internals but OSDs must make sure that their PGs are synced so I was under the impression that the OSD content for a PG on the filesystem should always be guaranteed to be on all the other active OSDs *or* their journals (so you wouldn't apply journal content unless the other journals have already committed the same content). If you remove the journals there's no intermediate on-disk "buffer" that can be used to guarantee such a thing: one OSD will always have data that won't be guaranteed to be on disk on the others. As I understand this you could say that this is some form of 2-phase commit. >> You can simply commit the data (to the filestore), and it would be in fact faster. >> Client gets the write acknowledged when all the OSDs have the data - that doesn't change in this scenario. If one OSD gets ahead of the others and commits something the other OSDs do not before the whole cluster goes down then it doesn't hurt anything - you didn't acknowledge so the client has to replay if it cares, _NOT_ the OSDs. >> The problem still exists, just gets shifted elsewhere. But the client (guest filesystem) already handles this. > > Hum, if one OSD gets ahead of the others there must be a way for the > OSDs to resynchronize themselves. I assume that on resync for each PG > OSDs probably compare something very much like a tx_id. Why? Yes, it makes sense when you scrub them to have the same data, but the client doesn't care. If it were a hard drive the situation is the same - maybe the data was written, maybe it was not. You have no way of knowing and you don't care - the filesystem (or even any sane database) handles this by design. It's your choice whether to replay the tx or rollback because the client doesn't care either way - that block that you write (or don't) is either unallocated or containing any of the 2 versions of the data at that point. You clearly don't want to give the client 2 differnt versions of the data, so something like data=journal should be used and the data compared when OSD comes back up... still nothing that required "ceph journal" though. > > What I was expecting is that in the case of a small backlog the journal > - containing the last modifications by design - was used during recovery > to fetch all the recent transaction contents. It seemed efficient to me: > especially on rotating media fetching data from the journal would avoid > long seeks. The first alternative I can think of is maintaining a > separate log of the recently modified objects in the filestore without > the actual content of the modification. Then you can fetch the objects > from the filestore as needed but this probably seeks all over the place. > In the case of multiple PGs lagging behind on other OSDs, reading the > local journal would be even better as you have even more chances of > ordering reads to avoid seeks on the journal and much more seeks would > happen on the filestore. > > But if I understand correctly, there is indeed a log of the recent > modifications in the filestore which is used when a PG is recovering > because another OSD is lagging behind (not when Ceph reports a full > backfill where I suppose all objects' versions of a PG are compared). That list of transactions becomes useful only when OSD crashes and comes back up - it needs to catch up somehow and this is one of the options. But do you really need the "content" of those transactions which is what the journal does? If you have no such list then you need to either rely on things like mtime of the object, or simply compare the hash of the objects (scrub). In the meantime you simply have to run from the other copies or stick to one copy of the data. But even if you stick to the "wrong" version it does no harm as long as you don't arbitrarily change that copy because the client didn't know what data ended on drive and must be (and is) prepared to use whatever you have. > > Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com