Re: dm-crypt: Automatically enable no_{read,write}_workqueue on SSDs

Ignat Korchagin <ignat@xxxxxxxxxxxxxx> · Tue, 30 Jan 2024 16:20:56 +0000

On Tue, Jan 30, 2024 at 4:10 PM Adrian Vovk <adrianvovk@xxxxxxxxx> wrote:
>
> On 1/30/24 03:07, Ignat Korchagin wrote:
> > On Tue, Jan 30, 2024 at 2:14 AM Adrian Vovk <adrianvovk@xxxxxxxxx> wrote:
> >> Hello all,
> >>
> >> I am working as a contractor for the GNOME Foundation to integrate
> >> systemd-homed into the GNOME desktop and related components, as part of
> >> a grant by the Sovereign Tech Fund[1]. systemd-homed is a component of
> >> systemd that, among other things, puts each user's home directory into a
> >> separate LUKS volume so that each user can have their data encrypted
> >> independently.
> > Seems filesystem encryption is better suited here, but I see it
> > already supports this option.
>
> Filesystem encryption does not protect metadata so we consider it weaker
> encryption for the purposes of homed. Thus we prefer LUKS, which is
> definitely more complicated but it protects all user data.
>
> It's also easier to move a loopback file between devices, which is
> another goal of homed
>
> >> I have recently come across a blog post from Cloudflare[2] that details
> >> a significant (~2x) improvement to throughput and latency that they were
> >> able to achieve on dm-crypt devices by implementing[3] and using the
> >> no_read_workqueue and no_write_workqueue performance flags. These flags
> >> bypass queuing that was in place to optimize for access patterns of
> >> spinning media and work around limitations of kernel subsystems at the time.
> >>
> >> Thus, to me it looks like these flags should default to on for SSDs
> >> (i.e. /sys/block/<name>/queue/rotational is 0). Such a distinction based
> >> on the type of storage media has precedent (i.e. selecting the `none` IO
> >> scheduler for SSDs to improve throughput). So, I was going to change
> > This is a different layer IMO. The IO scheduler is more "tightly
> > coupled" with underlying storage vs device mapper framework, which is
> > a higher level abstraction. Does it even have this capability to look
> > into the underlying storage device? What if there are several layers
> > of device mapper (dm-crypt->lvm->dm-raid->device)?
>
> I don't know if you have convenient access on the kernel side, but from
> the userspace side we can fairly easily resolve what is backing a
> dm-crypt device by probing sysfs a little
>
> As for several layers: I don't know. My use-case isn't that complicated.
> I suppose if you have dm-crypt on LVM, then it's not on an SSD and thus
> we don't turn on the flags (i.e. we don't walk the layers until we reach
> an actual physical disk; we just check the immediate parent)
>
> > I'm mentally comparing device mapper to a network OSI model: should
> > the TCP protocol tune itself based on what data link layer it runs on
> > (ethernet vs token ring vs serial port)?
>
> In theory I suppose not. But I _really_ wouldn't be surprised if in
> practice people do this.
>
> But also, TCP does make an effort to tune itself based on the
> bandwith/etc it finds available. I do not think dm-crypt can do this so
> easily.
> >> systemd to turn these flags on when it detected an SSD, but I spoke to
> >> the maintainers first and they suggested that I try to implement this in
> >> cryptsetup instead. So I reached out there[4], and they suggested that I
> >> should write to you, and that the kernel should set optimal defaults in
> >> cases like this. They also mentioned recalling some hardware
> >> configurations where these flags shouldn't be on.
> > At least within our testing it is not as straightforward as "if you
> > run on SSDs - you always get performance benefits". Some of
> > Cloudflare's workloads chose not to enable these flags to this day.
> > With all the external feedback I received for these flags - most
> > people reported improvements, but there were also a small number of
> > reports of regressions.
>
> Are these workloads more complicated than dm-crypt running directly on a
> drive? In other words, are these workloads involving many device-mapper
> layers?

No. No layers, but for us it is some database/storage workloads. We
generally see that performance benefit comes from workloads generating
"many small bios" (closer to sector size) vs "large bios" (big chunks
of data attached to a single bio).

> Is this feedback public anywhere or internal to Cloudflare?I would like
> to know in which situations the regressions can happen, and in which we
> get improvements.

It's in the mailing lists and (unfortunately) some off-list replies to
my patches.

> > So ultimately it is more than just a type of storage, which goes into
> > the decision whether to enable these flags, which in turn creates a
> > "policy", which the kernel should not dictate. I can see a world where
> > spinning media would not be used anymore at all, so the kernel
> > probably can flip these for everything to "default on", but more
> > "complex decision trees" would be better off somewhere in userspace
> > (systemd, cryptsetup, users themselves).
>
> Sure, complicated policy probably starts to belong in userspace,
> especially if someone will need to tune it later for a given workload.

Exactly

> I'd hesitate to make it default on. This would mean someone running
> spinning media + a new kernel + an old userspace or config will suddenly
> see performance regressions. If userspace is dictating the policy, it
> will be deciding when to turn the flags on anyway.
>
> >> Is there some reason these two flags aren't on by default for SSDs?
> >>
> >> Thanks,
> >> Adrian
> > Ignat
> >
> >> [1]:
> >> https://foundation.gnome.org/2023/11/09/gnome-recognized-as-public-interest-infrastructure/
> >>
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >> [3]:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/dm-crypt.c?id=39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877
> >>
> >> [4]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/862
> Adrian