Re: Introduce Storage Instantiation Daemon - Fedora 33 System-Wide Change proposal

Peter Rajnoha <prajnoha@xxxxxxxxxx> · Wed, 1 Jul 2020 15:23:12 +0200

On 7/1/20 9:50 AM, Zbigniew Jędrzejewski-Szmek wrote:
> On Tue, Jun 30, 2020 at 03:18:57PM -0400, Ben Cotton wrote:
>> == Benefit to Fedora ==
>> The main benefit is all about centralizing the solution to solve
>> issues that storage subsystem maintainers have been hitting with udev,
>> that is:
>>
>> * providing a central infrastructure for storage event processing,
>> currently targeted at udev events
>>
>> * improving the way storage events and their sequences are recognized
>> and for which complex udev rules were applied before
>>
>> * single notion of device readiness shared among various storage
>> subsystems (single API to set the state instead of setting various
>> variables by different subsystems)
>>
>> * providing more enhanced possibilities to store and retrieve
>> storage-device-related records when compared to udev database
>>
>> * direct support for generic device grouping (matching
>> subsystem-related groups like LVM, multipath, MD... or creating
>> arbitrary groups of devices)
>>
>> * centralized solution for scheduling triggers with associated actions
>> defined on groups of storage devices
> 
> This sounds interesting. Assembling complex storage from udev rules is
> not easy, in particular because while it is easy to collect devices
> and handle the case where all awaited devices have been detected, it's
> much harder to do timeouts or partial assembly or conditional
> handling. A daemon can listen to hotplug events and have an internal
> state take decisions based on configuration and time and events.
> 

Exactly, that's also one of the areas we'd like to cover here - partial
activations based on policies. This is hard to do within pure udev... or at
least, at the moment, we'd need to put together several *external* pieces
together besides udev to make this working somehow at least. SID will try to
provide the infrastructure to implement this in one place.

> OTOH, based on this description, SID seems to want to take on some
> bigger role, e.g. by providing an alternate execution and device
> description mechanism. That sounds unnecessary (since udev does that
> part reasonably well) and complex (also because support would have to
> be added to consumers who currently get this data from udev). I would
> love to see a daemon to handle storage devices, but with close
> cooperation with udev and filling in the bits that udev cannot provide.
>
Not quite. If it sounds that SID is taking over most of udev's responsibility,
then no. It's trying to build on top of it - still considering udev as
low-level layer for event processing based on simple rules. Then SID adding
abstraction that we need for storage mainly - that is the grouping part, state
recording and delayed trigger/action part.

The issue with udev is that it's concentrated on single device processing and
on current state (yes, we have IMPORT{db}, but that's good for simple records
only). But this is OK as it is a low-level tool.

Also, udev's primary job is to record these single device properties and then
to create the /dev content so these devs are accessible. But there are actions
we don't need to execute within udev context at all - e.g. the device
activation itself. And there are other details where we come short with udev
like the udev rule language itself so if you need to define more complex
logic, you need to call out external commands to do that (and that is just
another fork, just another delay). Even comparing values of two variables is
not possible in udev (you can compare only with a literal constant).

With SID, for backwards compatibility and for udev db readers, we have still
the possibility to export selected information from SID to udev db, if needed
(importing and exporting from/to udev environment is just about using
dedicated namespace we have in SID db). But I think storage subsystems would
go for SID directly if it provides this domain specific information - it's
just adding more details to what udev can see.

What I would probably like to see in the future though is surely a more closer
cooperation of udevd and SID in a way where udevd could still record those
simple generic single device properties as it does today and if it sees that
this is a device that falls under certain domain (like "storage" here), udevd
itself can contact the domain-specific daemon/resource for more information
and then provide that through its interface. Similar logic could apply for
"network" domain, etc. All these domain-specific external resources could be
registered with udevd. But this is for later time and much more discussion...

>> * adding a centralized solution for delayed actions on storage devices
>> and groups of devices (avoiding unnecessary work done within udev
>> context and hence avoiding frequent udev timeouts when processing
>> events for such devices)
> I don't think such timeouts are common. Currently the default worker
> timeout is 180s, and this should be enough to handle any device hotplug
> event. And if there are things that need to be executed that take a
> long time (for example some health check), then systemd units should be used
> for this. Udev already has a mechanism to schedule long-running systemd
> jobs in response to events. So I don't think we should add anything new
> here. So maybe I'm misunderstanding what this point is about?

Well, timeouts do appear quite often from my experience (reports and
discussions from our support teams). 180s might not be good for all, if you
have hundreds/thousands of devices. And these timeouts are usually quite hard
to debug (most of them are also races of different kinds). Various devices
could also be in suspended state as well, inaccessible. And it's good to have
wider picture of the device group than a single device for us to do proper
actions. Also, on timeouts, udev usually silently drops the event, without a
possibility to execute fallback/correcting actions.

Yes, we can use SYSTEMD_WANTS in udev to instantiate a service outside udev
context (we actually do that in lvm2 with lvm2-pvscan@.service which is
responsible for scanning, making decision whether to activate and then
activate if applicable). The issue here is that systemd unit itself doesn't
have any state - I mean, you can't store any information in it - you need an
external entity to do that, a daemon? Or a database somewhere ...which SID has
all in one place and it tries to centralize/standardize this for storage devs.
When SID processes the event, it has what I call "stage A" and "stage B" where
the "A" is executing when SID is contacted from within a udev rule. Then "B"
has all the trigger/action - delayed execution. But important here is that it
uses the same state, same database underneath - it's all in one place for both
stages.

There are also other little issues like that I can bind service unit existence
only to device unit existence, but that does not always apply for storage. The
layers in a stack for various subsystems are defined by existence of the
signature on disk... But that's just another little detail that adds to the
overall storage matters we'd like to solve here.

> 
>> == How To Test ==
>> * Basic testing involves (considering we have at least multipath
>> and/or LVM module present as well):
>> ** installing new 'sid' package
>> ** installing device-mapper-multipath and/or lvm module (presumably
>> named device-mapper-multipath-sid-module and lvm2-sid-module)
>> ** creating a device stack including device-mapper-multipath and/or LVM volumes
>> ** booting with 'sid.enabled=1' kernel command line
>> ** checking device-mapper-multipath and/or LVM volumes are correctly activated
> 
> Do you plan to handle multi-device btrfs?

Honestly, I haven't looked at btrfs in much detail how it does the scanning
and activation part exactly. But if it can do a single scan, add to the
detected group and then fire the "activation" (mount) when it's complete (or
not necessarily complete, but "activatable"), it could be fit into a SID
module. But btrfs is a bit special here - it's a filesystem, not a block dev
while SID is concentrated on block devs primarily. So I'd be a little aware
here, I'd need to check the details here to be sure.

-- 
Peter
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx