Hello, We're doing some rbd map automation and a week ago we had a problem where an rbd map failed and the system contained the following error: systemd-udevd[138]: worker [24919] /devices/virtual/block/rbd28 timeout; kill it systemd-udevd[138]: seq 6903 '/devices/virtual/block/rbd28' killed systemd-udevd[138]: worker [24919] terminated by signal 9 (Killed) Afterwards the system had a problematic state where the block device was mapped but no symlink was created for it in /dev/rbd/$pool/$name. This caused us to investigate the problem and realized udevd is responsible for creating this symlink and it had for some reason timed out. My guess is that ceph/rbd had some temporary slowness or connectivity issue (which is normal and expected). The big problem here is that rbd map relies on the udev system to complete and this architecture/dependency makes rbd mapping unautomatable: "It's completely wrong to launch any long running task from a udev rule and you should expect that it will be killed." http://lists.freedesktop.org/archives/systemd-devel/2012-November/007390.html We could always use retry wrappers and other ugly race barriers (we already do) around rbd mapping and unmapping even though it's a huge architecture smell but not even that helps us in this case because the rbd map can both fail and have side effect that causes subsequent retries to also fail. Commands with side effects are okay-ish as long as they guarantee that they are idempotent, but in this case we don't even have this guarantee. This caused us to consider switching to using raw rbd numbers to avoid depending on the udev system at all. Unfortunately the design of "rbd map" strive to be a "unix 'worse is better' side effect" rather than a pure mathematical function that takes a pool and a block name and returns an id allocated for it. This causes the actual allocated id after a map to be unknown and prevents this raw-number workaround. If you get stuck in the intermediary state you also have to manually look in the system log to understand what number was actually allocated to know what "rbd unmap" command to run to get back to square zero. Note that if rbd is designed with the assumption that a human is punching in one command at a time then the current architecture is fine, otherwise there is some guarantees that is currently missing IMO. I would appreciate any thoughts or feedback on this. Thank you for your time, -- Hannes Landeholm Co-founder & CTO Jumpstarter - www.jumpstarter.io ☎ +46 72 301 35 62 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html