Re: systemd-sysupdate support for slow rollout (aka A/B testing)

Nils Kattenbeck <nilskemail@xxxxxxxxx> · Tue, 2 Jan 2024 14:40:08 +0100

> > does sysupdate currently support any way to slowly roll out updates
> > where the server providing the files can be in control? [...]
>
> This is currently not available, no.
>
> The idea so far was always that the server is dumb, and the client
> picks the release it wants.

I feel like it would be more flexible to have the client mostly
handling transferring and applying the data and any additional logic
should be handled by either the server or secondary applications which
call into sysupdate (or its future dbus API).

> I have thought about this usecase a while back, and my thinking was
> that such a staged update logic should be driven by the machine
> ID. i.e. we should teach sysupdate a simple logic that allows pattern
> matching of new versions based on some arithmetic of the machine
> ID. More specifically, include some value in the URL pattern that
> indicates the percentage of hosts that shall update to this
> release. Then, each client takes its machine ID, treats it as an
> integer and calculates modulo 100 of it or so, and then checks if the
> resulting value is below the intended percentage, and if so it
> updates, otherwise it doesn't.
>
> (or something like that, the above is probably not ideal, since it
> would mean it's always the same hosts that try a new release first,
> and it probably should be evened out across the set of clients).

Any logic based on the machine ID would also have the problem I
mentioned below that the ratios would be skewed for stateless devices
which cannot persist their machine id to disk.
One would at least be able to override it with something persistent
like a MAC address though this could be exposed as some argument or
environment variable which a secondary application could set before
calling sysupdate.

> This would then mean for the server that it would first serve
> foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of
> the hosts would update to it. And then, once you collected enough
> feedback you'd rename the file to foobar_47.11_25.raw and 25% of the
> hosts would switch over. Finally you'd set the value to 100 (or maybe
> just drop it, which should be considered equivalent to 100), and then
> all remaining hosts would update.
>
> The effect of this is that client's could still explicitly upgrade if
> they want, and the updates would be entirely driven by the clients,
> but simply via naming the download images the server can control that
> "by default" only the chosen number of clients update.

The explicit update by clients is definitely a nice bonus though this
can also be achieved by a secondary set of definitions looking for
files under s3.domain.com/rc/.

> > Currently it seems like I would have to implement a different service
> > which calls the sysupdate binary (or uses dbus once #28134 has landed)
> > and then decides based on some other information.
> >
> > One idea I had would be that systemd-pull could send the machine-id
> > based on which the server could then decide to provide the newer file
> > (e.g. last two chars == "00" would roll it out to ~1/255). Though I am
> > not sure if sd-pull is supposed to be "anonymous", i.e. do not provide
> > this identifying information. Another drawback of this would be that
> > stateless systems which reboot often get a new machine-id each boot,
> > thus having an increased chance to get the newer version.
>
> So this idea is not entirely different from my idea, I was just
> thinking about pushing this into sysupdate rather than pull.
>
> > Does anything like this already exist or is planned? Or should that be
> > done by different applications on the client side?
>
> I think it makes a ton of sense to add this to sysupdate. Would love
> to review/merge a patch for that.
>
> > I also remember there being a discussion about plugging in different
> > sd-pull like implementations/backends[1] to support delta updates,
> > other transports, or TLS client authentication. This could at least be
> > adapted to support my idea to send the machine-id as an HTTP header
> > (e.g. X-MACHINE-ID).
>
> If we can avoid it, I'd always adopt a logic whether identifying info
> doesn't have to be sent to the server. After all the logic should be
> generic and applicable in scenarios where the client should get
> anonymity as much as it wants.

If the client automatically applies updates the server could always
deliver an image which exposes information by e.g. simply updating the
Path= to include %m somewhere in it.
Though I agree that always sending such information in headers would
not be optimal.

I also found out that sd-import drops query parameters from the URL.
If this were not the case my use case would already be possible by
embedding the machine ID as part of the query.
This would also make it possible to opt in to sending the information.

The problem I think is that there are two user groups of sysupdate
with different requirements.
On one hand we have end user distributions with A/B style updates
where the distribution only has limited to no interest over precise
control of updates and user devices and the users wish for anonymity.
On the other hand though are enterprises which deploy sysupdate for
(I)IoT devices. In these case devices commonly have to be registered
anyhow, and the enterprise controls how updates are rolled out etc. In
these cases anonymity is not necessary and instead customers often pay
the enterprise to perform all the management on their behalf.

The latter scenario is what I am currently focusing on.
Especially once we start to talk about stuff like rolling out an
update for a specific customer or in a specific region first this
logic would have to be implemented in another application or by the
server anyhow. Moving this logic to the server would also have the
advantage that decisions like including the "@[rollout percentage]" as
part of the URL do not have to be made proactively but can be done
later on.

> The machine-id we usually consider a "half-secret", i.e. all local
> programs get access to it (unless sandboxed), but they are not
> supposed to be send it across the wire. If they really need to send
> some identifier across the wire they should derive an app-specific ID
> instead, which we make easy to acquire via
> sd_id128_get_machine_app_specific().
>
> But better than app-specific machine IDs are no machine IDs at all in
> the protocol, if we can get away with it. Hence, my idea of doing the
> rollout percentage logic client-side.
>
> Lennart
>
> --
> Lennart Poettering, Berlin