Re: rbd kernel client fencing

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Tue, 25 Apr 2017 18:14:36 -0700

Hi,

On Wed, Apr 19, 2017 at 9:08 PM, Chaofan Yu <chaofanyu@xxxxxxxxxxx> wrote:
> Thank you so much.
>
> The blacklist entries are stored in osd map, which is supposed to be tiny and clean.
> So we are doing similar cleanups after reboot.

In the face of churn - this won't necessarily matter as I believe
there's some osdmap
history stored. It'll eventually fall off. This may also have
improved, my bad experience
were from around hammer.

> I’m quite interested in how the host commit suicide and reboot,

echo b >/proc/sysrq-trigger # This is about as brutal as it gets

The machine is blacklisted, it has no hope of reading/writing anything from/to
a rbd device.

There's a couple of caveats that come with this:
 - Your workload needs to structure it's writes in such a way that it
can recover
   from this kind of failure.
 - You need to engineer your workload in such a way that it can
tolerate a machine
   falling off the face of the earth. (I.e. combination of workload
scheduler like
   mesos/aurora/kubernetes and some HA where necessary)

> can you successfully umount the folder and unmap the rbd block device
>
> after it is blacklisted?
>
> I wonder whether the IO will hang and the umount process will stop at D state
>
> thus the host cannot be shutdown since it is waiting for the umount to finish

No, see previous comment.

> ==============================
>
> and now that cento 7.3 kernel support exclusive lock feature,
>
> could anyone give out new flow of failover ?

This may not be what you think it is, see i.e.:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004857.html

(And I can't really provide you with much more context, I've primarily
registered
that it isn't made for fencing image access. It's all about
arbitrating modification,
in support of i.e. object-map).

>
> Thanks.
>
>
>> On 20 Apr 2017, at 6:31 AM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> As long as you blacklist the old owner by ip, you should be fine. Do
>> note that rbd lock remove implicitly also blacklists unless you also
>> pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
>> (that is I think "ceph osd blacklist add a.b.c.d interval" translates
>> into blacklisting a.b.c.d:0/0 - which should block every client with
>> source ip a.b.c.d).
>>
>> Regardless, I believe the client taking out the lock (rbd cli) and the
>> kernel client mapping the rbd will be different (port, nonce), so
>> specifically if it is possible to blacklist a specific client by (ip,
>> port, nonce) it wouldn't do you much good where you have different
>> clients dealing with the locking and doing the actual IO/mapping (rbd
>> cli and kernel).
>>
>> We do a variation of what you are suggesting, although additionally we
>> check for watches, if watched we give up and complain rather than
>> blacklist. If previous lock were held by my ip we just silently
>> reclaim. The hosts themselves run a process watching for
>> blacklistentries, and if they see themselves blacklisted they commit
>> suicide and re-boot. On boot, machine removes blacklist, reclaims any
>> locks it used to hold before starting the things that might map rbd
>> images. There's some warts in there, but for the most part it works
>> well.
>>
>> If you are going the fencing route - I would strongly advise you also
>> ensure your process don't end up with the possibility of cascading
>> blacklists, in addition to being highly disruptive, it causes osd(?)
>> map churn. (We accidentally did this - and ended up almost running our
>> monitors out of disk).
>>
>> Cheers,
>> KJ
>>
>> On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu <chaofanyu@xxxxxxxxxxx> wrote:
>>> Hi list,
>>>
>>>  I wonder someone can help with rbd kernel client fencing (aimed to avoid
>>> simultaneously rbd map on different hosts).
>>>
>>> I know the exclusive rbd image feature is added later to avoid manual rbd
>>> lock CLIs. But want to know previous blacklist solution.
>>>
>>> The official workflow I’ve got is listed below (without exclusive rbd
>>> feature) :
>>>
>>> - identify old rbd lock holder (rbd lock list <img>)
>>> - blacklist old owner (ceph osd blacklist add <addr>)
>>> - break old rbd lock (rbd lock remove <img> <lockid> <addr>)
>>> - lock rbd image on new host (rbd lock add <img> <lockid>)
>>> - map rbd image on new host
>>>
>>>
>>> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>>>
>>> However as far as I know, ceph kernel client will do socket reconnection if
>>> connection failed. So I wonder in this scenario it won’t work:
>>>
>>> 1. old client network down for a while
>>> 2. perform below steps on new host to achieve failover
>>> - identify old rbd lock holder (rbd lock list <img>)
>>>
>>> - blacklist old owner (ceph osd blacklist add <addr>)
>>> - break old rbd lock (rbd lock remove <img> <lockid> <addr>)
>>> - lock rbd image on new host (rbd lock add <img> <lockid>)
>>> - map rbd image on new host
>>>
>>> 3. old client network come back and reconnect to osds with new created
>>> socket client, i.e. new (ip, port,nonce) turple
>>>
>>> as a result both new and old client can write to same rbd image, which might
>>> potentially cause the data corruption.
>>>
>>> So does this mean if kernel client does not support exclusive-lock image
>>> feature, fencing is not possible ?
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> --
>> Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
>> SRE, Medallia Inc
>> Phone: +1 (650) 739-6580
>

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com