On Thu, Jul 21, 2016 at 2:04 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > On 07/19/2016 02:26 PM, Ilya Dryomov wrote: >> >> Currently, osd_weight and osd_state fields are updated in the encoding >> order. This is wrong, because an incremental map may look like e.g. >> >> new_up_client: { osd=6, addr=... } # set osd_state and addr >> new_state: { osd=6, xorstate=EXISTS } # clear osd_state >> >> Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After >> applying new_up_client, osd_state is changed to EXISTS | UP. Carrying >> on with the new_state update, we flip EXISTS and leave osd6 in a weird >> "!EXISTS but UP" state. A non-existent OSD is considered down by the >> mapping code >> >> 2087 for (i = 0; i < pg->pg_temp.len; i++) { >> 2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) { >> 2089 if (ceph_can_shift_osds(pi)) >> 2090 continue; >> 2091 >> 2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE; >> >> and so requests get directed to the second OSD in the set instead of >> the first, resulting in OSD-side errors like: >> >> [WRN] : client.4239 192.168.122.21:0/2444980242 misdirected >> client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680 >> >> and hung rbds on the client: >> >> [ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0) >> [ 493.566805] rbd: rbd0: result -6 xferred 400000 >> [ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688 >> >> The fix is to decouple application from the decoding and: >> - apply new_weight first >> - apply new_state before new_up_client >> - twiddle osd_state flags if marking in >> - clear out some of the state if osd is destroyed >> >> Fixes: http://tracker.ceph.com/issues/14901 >> >> Cc: stable@xxxxxxxxxxxxxxx # 3.15+: 6dd74e44dc1d: libceph: set 'exists' >> flag for newly up osd >> Cc: stable@xxxxxxxxxxxxxxx # 3.15+ >> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> >> --- >> net/ceph/osdmap.c | 156 >> +++++++++++++++++++++++++++++++++++++++--------------- >> 1 file changed, 113 insertions(+), 43 deletions(-) > > > This matches userspace's incremental handling now. Looks good. Technically, no. new_primary_affinity is decoded along with new_weight in userspace and crushmap is applied at the very end, after weights and states are applied. AFAICT crushmap is applied last only because of the userspace-only adjust_osd_weights(), which needs to have the new weights. As for primary-affinity, the following new_up_client: { osd=X, addr=... } new_state: { osd=X, xorstate=EXISTS } new_primary_affinity: { osd=X, aff=Y } would be misinterpreted - the kernel client would stay on aff=Y instead of aff=1 (reset), but I don't see a (obvious, at least) way to get such a map, and, as I wanted this to be as small as possible for backporting, I punted on that. Let me know if you see otherwise. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html