if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3] --> pg1.0 [7, 2, 3], is it similar with the example above? still install a pg_temp entry mapping the PG back to [1, 2, 3], then backfill happens to 7, normal io write to [1, 2, 3], if io to the portion of the PG that has already been backfilled will also be sent to osd.7? how about these examples about removing an osd: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [1, 2, 5], but osd.5 has no data, then install a pg_temp mapping the PG back to [1, 2], then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [1, 2, 5] another example: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then install a pg_temp mapping the PG back to [1, 2] which osd.1 temporarily becomes the primary, then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [5, 1, 2] is my ananysis right? 2015-12-29 1:30 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: > On Mon, 28 Dec 2015, Zhiqiang Wang wrote: >> 2015-12-27 20:48 GMT+08:00 Dong Wu <archer.wudong@xxxxxxxxx>: >> > Hi, >> > When add osd or remove osd, ceph will backfill to rebalance data. >> > eg: >> > - pg1.0 [1, 2, 3] >> > - add an osd(eg. osd.7) >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now >> > object a is backfilling >> > - when a write io hits object a, then the io needs to wait for its >> > complete, then goes on. >> > - but if io hits object b which has not been backfilled, io reaches >> > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not >> > have object b, so osd.7 needs to wait for object b to backfilled, then >> > write. Is it right? Or osd.1 only send the io to osd.2, not both? >> >> I think in this case, when the write of object b reaches osd.1, it >> holds the client write, raises the priority of the recovery of object >> b, and kick off the recovery of it. When the recovery of object b is >> done, it requeue the client write, and then everything goes like >> usual. > > It's more complicated than that. In a normal (log-based) recovery > situation, it is something like the above: if the acting set is [1,2,3] > but 3 is missing the latest copy of A, a write to A will block on the > primary while the primary initiates recovery of A immediately. Once that > completes the IO will continue. > > For backfill, it's different. In your example, you start with [1,2,3] > then add in osd.7. The OSD will see that 7 has no data for teh PG and > install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then > things will proceed normally while backfill happens to 7. Backfill won't > interfere with normal IO at all, except that IO to the portion of the PG > that has already been backfilled will also be sent to the backfill target > (7) so that it stays up to date. Once it complets, the pg_temp entry is > removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to > remove it's copy of the PG. > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html