Re: Working recovery with locked root user (rescue.service)

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 10 Dec 2020 15:56:15 -0700

On Thu, Dec 10, 2020 at 1:07 PM Benjamin Berg <bberg@xxxxxxxxxx> wrote:
>
> Hi,
>
> On Thu, 2020-12-10 at 12:20 -0700, Chris Murphy wrote:
> > On Thu, Dec 10, 2020 at 5:40 AM Benjamin Berg <bberg@xxxxxxxxxx>
> > wrote:
> > > Hi,
> > >
> > > so, the other day we had a major regression in the PAM stack[1]
> > > that,
> > > unfortunately, ended up hitting rawhide and the Fedora 33 testing
> > > (not
> > > stable) repository before being unpushed.
> > >
> > > In this case it was easy to work around as SSH was still working
> > > fine.
> > > But, it seems that rescue mode requires having a root password set,
> > > which we do not always do during the Fedora install.
> > >
> > >
> > > So, I think we should have an obvious way for users to enter
> > > recovery
> > > mode even with a locked root account.
> > >
> > > Currently rescue.service is executing "systemd-sulogin-shell" which
> > > in
> > > turn runs "sulogin" (part of util-linux). A workaround is to
> > > set SYSTEMD_SULOGIN_FORCE=1 in rescue.service, but that just
> > > disables
> > > authentication entirely.
> > >
> > > I suppose to improve this, we would need a kind of "sudologin" that
> > > accepts any user in the "wheel" group. Or maybe some other more
> > > rigid
> > > requirement like configuring the first admin user that was created.
> > >
> > > Anyone has a good idea on how to solve this?
> >
> > I solve it with early debug shell using boot param
> > systemd.debug-shell=1 but that presents a root login on tty9 without
> > needing a password.
>
> Yeah, if you are able to modify the command line and have the
> background, then it is really simple to bypass the authentication.
>
> > I'm under the impression authentication services aren't even available
> > for emergency or rescue targets (?). I also wonder what happens if we
> > move to systemd-homed and whether that can start sooner and provide
> > the ability to use rescue target? Or if it starts late enough that it
> > can't be used for rescue and then also what that means for non-root
> > use of rescue because with systemd-home, there are no (human) users in
> > /etc at all.
>
> True, systemd-homed could be a problem.
>
> Maybe at the end of the day this is a lost cause?
>
> I mean, if you need to drop into rescue mode, you already need to have
> quite in-depth knowledge. So it could be better to focus on having more
> versatile solutions. Like being able to revert back to a known good
> state of the OS instead of providing a rescue shell.

There is also the sysroot fails to mount problem. That leaves us in
the initramfs which is an even more limited environment. For sure
falling over at boot or during startup is rare, but no matter why it
often induces panic in even experienced users, in part because it's
rare.

rpm-ostree has a way to mostly solve the problem if the startup
failure is isolated to a particular deployment. But it could still
have the rare case where it falls over in the initramfs. So that's a
hole that would be nice to fix because it's something all Fedora
editions and spins could fall into.

There's a wish list item / idea for a recovery partition from which a
system could be booted. Maybe it's a limited "netintsall" kind of
environment, to keep it space efficient. (While it's in the Fedora
Btrfs tracker, it doesn't mean system root must be Btrfs.)
https://pagure.io/fedora-btrfs/project/issue/23

And also a couple of Btrfs specific snapshot-rollback ideas
https://pagure.io/fedora-btrfs/project/issue/18
https://pagure.io/fedora-btrfs/project/issue/31

A bit more tangentially related is can we make it easy and cheap for
folks to backup consistently so that a reset is less painful? This is
neat but probably a hard sell to actually depend on most users opting
into, however good of an idea it is to back up regularly.
https://pagure.io/fedora-btrfs/project/issue/12

There are other ways boot+startup can fail other than a regression in
a package, we kinda need to look at all of them and see if it's
possible to take a holistic approach that solves a large chunk of them
at once. It's one reason why I'm not pushing hard for /boot on Btrfs,
because we don't need another option just to have another option.
There are actually good reasons to put /boot on Btrfs no matter what
the sysroot file system is, so if there's a way to "standardize"
regardless of what that is, the better off we are. But if not /boot on
Btrfs we need some other way to deal with the disconnect on rollback
between the kernels on /boot and the possibly older modules on an
older sysroot snapshot.

I personally am gravitating toward the idea of not updating the
currently running OS (sometimes called transactional system updates)
where if we had a way to test the out-of-band updated OS, like in a
container or VM, and only if it passes do we make it the next active
system at reboot time. There's some complexities there but also
rpm-ostree has learned a lot of those lessons that maybe we wouldn't
have to relearn. This might make it possible to avoid the need for a
rollback. If the update fails or fails to work, just throw away that
system root.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx