Re: [PATCH] libceph: apply new_state before new_up_client on incrementals

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 21 Jul 2016 17:27:36 +0200

On Thu, Jul 21, 2016 at 2:04 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> On 07/19/2016 02:26 PM, Ilya Dryomov wrote:
>>
>> Currently, osd_weight and osd_state fields are updated in the encoding
>> order.  This is wrong, because an incremental map may look like e.g.
>>
>>      new_up_client: { osd=6, addr=... } # set osd_state and addr
>>      new_state: { osd=6, xorstate=EXISTS } # clear osd_state
>>
>> Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
>> applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
>> on with the new_state update, we flip EXISTS and leave osd6 in a weird
>> "!EXISTS but UP" state.  A non-existent OSD is considered down by the
>> mapping code
>>
>> 2087    for (i = 0; i < pg->pg_temp.len; i++) {
>> 2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
>> 2089                    if (ceph_can_shift_osds(pi))
>> 2090                            continue;
>> 2091
>> 2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;
>>
>> and so requests get directed to the second OSD in the set instead of
>> the first, resulting in OSD-side errors like:
>>
>> [WRN] : client.4239 192.168.122.21:0/2444980242 misdirected
>> client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
>>
>> and hung rbds on the client:
>>
>> [  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
>> [  493.566805] rbd: rbd0:   result -6 xferred 400000
>> [  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
>>
>> The fix is to decouple application from the decoding and:
>> - apply new_weight first
>> - apply new_state before new_up_client
>> - twiddle osd_state flags if marking in
>> - clear out some of the state if osd is destroyed
>>
>> Fixes: http://tracker.ceph.com/issues/14901
>>
>> Cc: stable@xxxxxxxxxxxxxxx # 3.15+: 6dd74e44dc1d: libceph: set 'exists'
>> flag for newly up osd
>> Cc: stable@xxxxxxxxxxxxxxx # 3.15+
>> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
>> ---
>>   net/ceph/osdmap.c | 156
>> +++++++++++++++++++++++++++++++++++++++---------------
>>   1 file changed, 113 insertions(+), 43 deletions(-)
>
>
> This matches userspace's incremental handling now. Looks good.

Technically, no.  new_primary_affinity is decoded along with new_weight
in userspace and crushmap is applied at the very end, after weights and
states are applied.  AFAICT crushmap is applied last only because of
the userspace-only adjust_osd_weights(), which needs to have the new
weights.

As for primary-affinity, the following

    new_up_client: { osd=X, addr=... }
    new_state: { osd=X, xorstate=EXISTS }
    new_primary_affinity: { osd=X, aff=Y }

would be misinterpreted - the kernel client would stay on aff=Y instead
of aff=1 (reset), but I don't see a (obvious, at least) way to get such
a map, and, as I wanted this to be as small as possible for backporting,
I punted on that.  Let me know if you see otherwise.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html