Securing bind with systemd methods (was: bind-mount of /run/systemd for chrooted bind9/named)

Marc Haber <mh+systemd-devel@xxxxxxxxxxxx> · Mon, 17 Jul 2023 14:44:03 +0200

Hi,

I'm back. This is my first try at doing a decent systemd unit for bind 9
/ named chrooted with named's own features, making the chroot minimal
and code-free.

Here we go (this has been merged from various plug-in/overrides files, I
don't guarantee correct syntax). I have interspersed my
comments/questions as # comments. If one of the suggested improvements
warrant filing of an issue, let me know and I'll write well-explained
issues that are able to stand for themselves.

The first phase of writing this unit was done with systemd 253 on Debian
unstable, the second phase was on a productive machine running Debian
stable, systemd 252.

[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network.target network-online.target
Wants=nss-lookup.target network-online.target
Before=nss-lookup.target
StartLimitIntervalSec=90s
StartLimitBurst=5

[Service]
Type=notify
ExecStart=/usr/sbin/named -f -u bind -c /etc/bind/named.conf -t /var/local/chroot/bind
# named(8): In routine operation, signals should not be used to control
# the nameserver; rndc  should  be  used instead. We're following
# upstream's advice here.
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop
Restart=on-failure
RestartSec=5s
# I'd rather not have / as working directory and this looks the most
# sensible
WorkingDirectory=/var/local/chroot/bind
# Setting RootDirectory=/ results into service failure ("too many
# symlinks"), repeated StartLimitBurst times. I think this should be
# special cased with a better speaking error message if RootDirectory=/
# is unwanted. I'd like to explain why I tried that - a lot of the
# sandboxing directives only apply (or make sense) if RootDirectory
# is set or a service is being chrooted, my service is chrooting itself
# and I wanted systemd to know about that and enable those directives
# that only work in the RootDirectory set case. If I'm not making sense
# here, then it's a docs issue ;-)
#RootDirectory=/
ProtectProc=invisible
ProcSubset=pid
BindReadOnlyPaths=/run/systemd/notify:/var/local/chroot/bind/run/systemd/notify
BindReadOnlyPaths=/usr/share/dns:/var/local/chroot/bind/usr/share/dns
User=bind
Group=bind
UMask=077
# This means that my non-root service gets those three capabilities and
# is unable to obtain more, right? Would this warrant its own
# configuration directive like "servcie has those capabilities, not
# more, not less than that"?
CapabilityBoundingSet=cap_net_admin cap_net_bind_service cap_sys_chroot
AmbientCapabilities=  cap_net_admin cap_net_bind_service cap_sys_chroot
NoNewPrivileges=true
# Haven't investigated the AppArmor profiles that come with bind yet
#AppArmorProfile
ProtectSystem=strict
ProtectHome=yes
# {Runtime,Cache,Configuration}Directory cannot be used
# because our bind chroots itself and those directives only
# create directories under the standard paths. This makes those
# directives useless in the case where a service chroots itself and
# needs its Cache, Configuration etc inside the chroot. Maybe it
# makes sense to adapt the functionality to support this case?
#RuntimeDirectory=bind
ReadWritePaths=/var/local/chroot/bind/run
#CacheDirectory=bind
ReadWritePaths=/var/local/chroot/bind/var/cache/bind
#ConfigurationDirectory=bind
ReadOnlyPaths=/
InaccessiblePaths=-/lost+found
NoExecPaths=/
# /lib is necessary here, or execve will fail without indication for
# reason - that was a surprise and hard to debug because even strace
# didnt hint me towards the real issue
ExecPaths=/usr/sbin/named /usr/sbin/rndc /lib
PrivateTmp=true
PrivateDevices=true
PrivateIPC=true
# enabling PrivateUsers=true causes bind to not bind to its ports and
# log "couldn't add command channel 127.0.0.1#953: permission denied"
# What do PrivateUsers have to do with binding to ports?
ProtectHostname=true
ProtectClock=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
# if AF_UNIX is mentioned in systemd.exec(5), maybe mentioning
# AF_NETLINK would also be in order? This was also one of the
# solutions I had to pull from an strace.
RestrictAddressFamilies=AF_NETLINK AF_UNIX AF_INET AF_INET6
RestrictNamespaces=~user pid net uts mnt cgroup ipc
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
RestrictSUIDSGID=true
RemoveIPC=true
# My first version of SystemCallFilter was like ~@mount ~@swap
# ~@resources etc, which didn't work. Reading the docs with a computer
# scientist's mind ("informatiker") gave a hint, but I think this is
# hard to understand for people who haven't had formal training. But I
# also understand that this is hard to change without changing semantics
# for existing units, so maybe a few examples in systemd.exec(5) might ease
# this - the SystemCallFilter chapter in systemd.exec(5) is already long
# though. @raw-ip isnt available in systemd 252, so I had to template
# that in my ansible. And setuid is setuid32 on 32 bit archs like armhf,
# so I had to template _that_ for my Banana Pi.
SystemCallFilter=~@mount @swap @raw-ip @resources @reboot @privileged @obsolete
@module @debug @cpu-emulation @clock
SystemCallFilter=chroot setuid
SystemCallArchitectures=native

[Install]
WantedBy=multi-user.target
# strangely, this alias only holds if the unit is enabled. If the unit
# is disabled, the alias is not available which was kind of a surprise.
Alias=bind9.service

Generally, the error messages I received during the debugging phase were
not very helpful. I frequently had to resort to strace -p 1 to find out
what exactly went wrong trying to start named.

For example, there is no exact feedback when the daemon is being
terminated because of a SystemCallFilter violation, I'd like the system
call in question to be part of the log.

The same applies to directives regarding sandboxing, when paths are
given in the directive. My way to debug was either randomly removing
some of the directives to narrow down the possible error range, or
stracing again to find out what my daemon tried before it was
terminated.

Those things might be out of scope for systemd, I simply don't know.

With this unit, systemd-analyze security named is now down to "1.9 OK",
I think it was > 9 with the standard unit.

Thanks for your help, I wanted to give something back. I'll probably
suggest this unit for the Debian package once it has reached some
stability.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421