Le 29/01/2016 16:25, Jan Schermer a
écrit :
[...]But if I understand correctly, there is indeed a log of the recent modifications in the filestore which is used when a PG is recovering because another OSD is lagging behind (not when Ceph reports a full backfill where I suppose all objects' versions of a PG are compared). That list of transactions becomes useful only when OSD crashes and comes back up - it needs to catch up somehow and this is one of the options. But do you really need the "content" of those transactions which is what the journal does? If you have no such list then you need to either rely on things like mtime of the object, or simply compare the hash of the objects (scrub). This didn't seem robust enough to me but I think I had forgotten about the monitors' role in maintaining coherency. Let's say you use a pool with size=3 and min_size=2. You begin with a PG with 3 active OSDs then you lose a first OSD for this PG and only two active OSDs remain: the clients still happily read and write to this PG and the downed OSD is now lagging behind. Then one of the remaining active OSDs disappears. Client I/O blocks because of min_size. Now the first downed (lagging) OSD comes back. At this point Ceph has everything it needs to recover (enough OSDs to reach min_size and all the data reported committed to disk to the client in the surviving OSD) but must decide which OSD actually has this valid data between the two. At this point I was under the impression that OSDs could determine this for themselves without any outside intervention. But reflecting on this situation I don't see how they could handle all cases by themselves (for example an active primary should be able to determine by itself that it must send the last modifications to any other OSD but it wouldn't work if all OSD go down for a PG : when coming back all could be the last primary from their point of view with no robust way to decide which is right without the monitors being involved). The monitors maintain the status of each OSDs for each PG if I'm not mistaken so I suppose the monitors knowledge of the situation will be used to determine which OSDs have the good data (the last min_size OSDs up for each PG) and trigger the others to resync before the PG reaches active+clean. That said this doesn't address the other point: when the resync happens, using the journal content of the primary could theoretically be faster if the filestores are on spinning disks. I realize that recent writes in the filestore might be in the kernel's cache (which would avoid the costly seeks) and that using the journal instead would probably mean that the OSDs maintain an in-memory index of all the IO transactions still stored in the journal to be efficient so it isn't such a clear win. Thanks a lot for the explanations. Lionel |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com