On Mon, 28 Dec 2015, Zhiqiang Wang wrote: > 2015-12-27 20:48 GMT+08:00 Dong Wu <archer.wudong@xxxxxxxxx>: > > Hi, > > When add osd or remove osd, ceph will backfill to rebalance data. > > eg: > > - pg1.0 [1, 2, 3] > > - add an osd(eg. osd.7) > > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] > > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now > > object a is backfilling > > - when a write io hits object a, then the io needs to wait for its > > complete, then goes on. > > - but if io hits object b which has not been backfilled, io reaches > > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not > > have object b, so osd.7 needs to wait for object b to backfilled, then > > write. Is it right? Or osd.1 only send the io to osd.2, not both? > > I think in this case, when the write of object b reaches osd.1, it > holds the client write, raises the priority of the recovery of object > b, and kick off the recovery of it. When the recovery of object b is > done, it requeue the client write, and then everything goes like > usual. It's more complicated than that. In a normal (log-based) recovery situation, it is something like the above: if the acting set is [1,2,3] but 3 is missing the latest copy of A, a write to A will block on the primary while the primary initiates recovery of A immediately. Once that completes the IO will continue. For backfill, it's different. In your example, you start with [1,2,3] then add in osd.7. The OSD will see that 7 has no data for teh PG and install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then things will proceed normally while backfill happens to 7. Backfill won't interfere with normal IO at all, except that IO to the portion of the PG that has already been backfilled will also be sent to the backfill target (7) so that it stays up to date. Once it complets, the pg_temp entry is removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to remove it's copy of the PG. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html