Re: dm-crypt: Automatically enable no_{read,write}_workqueue on SSDs

Adrian Vovk <adrianvovk@xxxxxxxxx> · Tue, 30 Jan 2024 11:10:18 -0500

On 1/30/24 03:07, Ignat Korchagin wrote:
On Tue, Jan 30, 2024 at 2:14 AM Adrian Vovk <adrianvovk@xxxxxxxxx> wrote:
Hello all,

I am working as a contractor for the GNOME Foundation to integrate
systemd-homed into the GNOME desktop and related components, as part of
a grant by the Sovereign Tech Fund[1]. systemd-homed is a component of
systemd that, among other things, puts each user's home directory into a
separate LUKS volume so that each user can have their data encrypted
independently.
Seems filesystem encryption is better suited here, but I see it
already supports this option.

Filesystem encryption does not protect metadata so we consider it weaker 
encryption for the purposes of homed. Thus we prefer LUKS, which is 
definitely more complicated but it protects all user data.

It's also easier to move a loopback file between devices, which is 
another goal of homed

I have recently come across a blog post from Cloudflare[2] that details
a significant (~2x) improvement to throughput and latency that they were
able to achieve on dm-crypt devices by implementing[3] and using the
no_read_workqueue and no_write_workqueue performance flags. These flags
bypass queuing that was in place to optimize for access patterns of
spinning media and work around limitations of kernel subsystems at the time.

Thus, to me it looks like these flags should default to on for SSDs
(i.e. /sys/block/<name>/queue/rotational is 0). Such a distinction based
on the type of storage media has precedent (i.e. selecting the `none` IO
scheduler for SSDs to improve throughput). So, I was going to change
This is a different layer IMO. The IO scheduler is more "tightly
coupled" with underlying storage vs device mapper framework, which is
a higher level abstraction. Does it even have this capability to look
into the underlying storage device? What if there are several layers
of device mapper (dm-crypt->lvm->dm-raid->device)?

I don't know if you have convenient access on the kernel side, but from 
the userspace side we can fairly easily resolve what is backing a 
dm-crypt device by probing sysfs a little

As for several layers: I don't know. My use-case isn't that complicated. 
I suppose if you have dm-crypt on LVM, then it's not on an SSD and thus 
we don't turn on the flags (i.e. we don't walk the layers until we reach 
an actual physical disk; we just check the immediate parent)

I'm mentally comparing device mapper to a network OSI model: should
the TCP protocol tune itself based on what data link layer it runs on
(ethernet vs token ring vs serial port)?

In theory I suppose not. But I _really_ wouldn't be surprised if in 
practice people do this.

But also, TCP does make an effort to tune itself based on the 
bandwith/etc it finds available. I do not think dm-crypt can do this so 
easily.
systemd to turn these flags on when it detected an SSD, but I spoke to
the maintainers first and they suggested that I try to implement this in
cryptsetup instead. So I reached out there[4], and they suggested that I
should write to you, and that the kernel should set optimal defaults in
cases like this. They also mentioned recalling some hardware
configurations where these flags shouldn't be on.
At least within our testing it is not as straightforward as "if you
run on SSDs - you always get performance benefits". Some of
Cloudflare's workloads chose not to enable these flags to this day.
With all the external feedback I received for these flags - most
people reported improvements, but there were also a small number of
reports of regressions.

Are these workloads more complicated than dm-crypt running directly on a 
drive? In other words, are these workloads involving many device-mapper 
layers?

Is this feedback public anywhere or internal to Cloudflare?I would like 
to know in which situations the regressions can happen, and in which we 
get improvements.

So ultimately it is more than just a type of storage, which goes into
the decision whether to enable these flags, which in turn creates a
"policy", which the kernel should not dictate. I can see a world where
spinning media would not be used anymore at all, so the kernel
probably can flip these for everything to "default on", but more
"complex decision trees" would be better off somewhere in userspace
(systemd, cryptsetup, users themselves).

Sure, complicated policy probably starts to belong in userspace, 
especially if someone will need to tune it later for a given workload.

I'd hesitate to make it default on. This would mean someone running 
spinning media + a new kernel + an old userspace or config will suddenly 
see performance regressions. If userspace is dictating the policy, it 
will be deciding when to turn the flags on anyway.

Is there some reason these two flags aren't on by default for SSDs?

Thanks,
Adrian
Ignat

[1]:
https://foundation.gnome.org/2023/11/09/gnome-recognized-as-public-interest-infrastructure/

[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
[3]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/dm-crypt.c?id=39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877

[4]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/862
Adrian