Re: [PATCH] libceph: apply new_state before new_up_client on incrementals

Josh Durgin <jdurgin@xxxxxxxxxx> · Thu, 21 Jul 2016 12:22:55 -0700

On 07/21/2016 08:27 AM, Ilya Dryomov wrote:
On Thu, Jul 21, 2016 at 2:04 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
On 07/19/2016 02:26 PM, Ilya Dryomov wrote:

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

      new_up_client: { osd=6, addr=... } # set osd_state and addr
      new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected
client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Cc: stable@xxxxxxxxxxxxxxx # 3.15+: 6dd74e44dc1d: libceph: set 'exists'
flag for newly up osd
Cc: stable@xxxxxxxxxxxxxxx # 3.15+
Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
---
   net/ceph/osdmap.c | 156
+++++++++++++++++++++++++++++++++++++++---------------
   1 file changed, 113 insertions(+), 43 deletions(-)

This matches userspace's incremental handling now. Looks good.

Technically, no.  new_primary_affinity is decoded along with new_weight
in userspace and crushmap is applied at the very end, after weights and
states are applied.  AFAICT crushmap is applied last only because of
the userspace-only adjust_osd_weights(), which needs to have the new
weights.

As for primary-affinity, the following

     new_up_client: { osd=X, addr=... }
     new_state: { osd=X, xorstate=EXISTS }
     new_primary_affinity: { osd=X, aff=Y }

would be misinterpreted - the kernel client would stay on aff=Y instead
of aff=1 (reset), but I don't see a (obvious, at least) way to get such
a map, and, as I wanted this to be as small as possible for backporting,
I punted on that.  Let me know if you see otherwise.

Yeah, I also see no way for this to happen, and it isn't likely to be
needed in the future either. So keeping this update small sounds good
to me.

Josh

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html