Re: rbd map automation woes

Hannes Landeholm <hannes@xxxxxxxxxxxxxx> · Tue, 22 Apr 2014 21:58:47 +0200

On Tue, Apr 22, 2014 at 12:29 PM, Ilya Dryomov <ilya.dryomov@xxxxxxxxxxx> wrote:
> On Tue, Apr 22, 2014 at 1:58 PM, Hannes Landeholm <hannes@xxxxxxxxxxxxxx> wrote:
>> Hello,
>>
>> We're doing some rbd map automation and a week ago we had a problem
>> where an rbd map failed and the system contained the following error:
>>
>> systemd-udevd[138]: worker [24919] /devices/virtual/block/rbd28 timeout; kill it
>> systemd-udevd[138]: seq 6903 '/devices/virtual/block/rbd28' killed
>> systemd-udevd[138]: worker [24919] terminated by signal 9 (Killed)
>>
>> Afterwards the system had a problematic state where the block device
>> was mapped but no symlink was created for it in /dev/rbd/$pool/$name.
>> This caused us to investigate the problem and realized udevd is
>> responsible for creating this symlink and it had for some reason timed
>> out. My guess is that ceph/rbd had some temporary slowness or
>> connectivity issue (which is normal and expected). The big problem
>> here is that rbd map relies on the udev system to complete and this
>> architecture/dependency makes rbd mapping unautomatable:
>>
>> "It's completely wrong to launch any long running task from a udev
>> rule and you should expect that it will be killed."
>>
>> http://lists.freedesktop.org/archives/systemd-devel/2012-November/007390.html
>
> Not exactly sure what happened there, but that task is by no means long
> running.  It's a small shell script that pokes some sysfs attributes
> and outputs a name of the symlink for udev to create.

Yeah, I just figured that a temporary network problem, load spike etc
could cause the sysfs operation to be blocked for a while and this
should ideally just make the rbd map just take a longer time, not make
it fail mid operation. Since udev explicitly claim that they will
never support anything that could run for a "longer" time it will
introduce failures. Even if the probability is low because it requires
some coincidental states the probability of it happening regularly
approaches 100% as the number of mappings and unmappings you make
approaches a sufficiently large value. Timeouts and sleeps are anti
patterns when building automated robust systems.

>>
>> We could always use retry wrappers and other ugly race barriers (we
>> already do) around rbd mapping and unmapping even though it's a huge
>> architecture smell but not even that helps us in this case because the
>> rbd map can both fail and have side effect that causes subsequent
>> retries to also fail. Commands with side effects are okay-ish as long
>> as they guarantee that they are idempotent, but in this case we don't
>> even have this guarantee.
>>
>> This caused us to consider switching to using raw rbd numbers to avoid
>> depending on the udev system at all. Unfortunately the design of "rbd
>> map" strive to be a "unix 'worse is better' side effect" rather than a
>> pure mathematical function that takes a pool and a block name and
>> returns an id allocated for it. This causes the actual allocated id
>> after a map to be unknown and prevents this raw-number workaround. If
>> you get stuck in the intermediary state you also have to manually look
>> in the system log to understand what number was actually allocated to
>> know what "rbd unmap" command to run to get back to square zero.
>>
>> Note that if rbd is designed with the assumption that a human is
>> punching in one command at a time then the current architecture is
>> fine, otherwise there is some guarantees that is currently missing
>> IMO.
>
> Our qa scripts have the same problem (retry loops, etc.).  I have
> a branch (that will hopefully make it into 0.81) that makes 'rbd map'
> output the device node it created:
>
>     $ sudo rbd map foo/bar
>     /dev/rbd3
>
> It also adds the guarantee that 'rbd map' will not return until the
> device node is created and can be stat(2)ed, open(2)ed, etc, and that
> 'rbd unmap' will not return until the device node is gone.  It uses
> libudev to achieve that, and so it still depends on udev, but I don't
> think there is a sane way to get around udev on modern systems.
>
> Thanks,
>
>                 Ilya

Okay, it's unfortunate that it still depends on udev but we would
really appreciate to have those guarantees.

Does the "rbd returning the new device node" feature also depend on udev?

Also, is this branch public? I'd like to have a look.

Thank you for your time,
--
Hannes Landeholm
Co-founder & CTO
Jumpstarter - www.jumpstarter.io

☎ +46 72 301 35 62
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html