On Tue, Jan 30, 2024 at 4:10 PM Adrian Vovk <adrianvovk@xxxxxxxxx> wrote: > > On 1/30/24 03:07, Ignat Korchagin wrote: > > On Tue, Jan 30, 2024 at 2:14 AM Adrian Vovk <adrianvovk@xxxxxxxxx> wrote: > >> Hello all, > >> > >> I am working as a contractor for the GNOME Foundation to integrate > >> systemd-homed into the GNOME desktop and related components, as part of > >> a grant by the Sovereign Tech Fund[1]. systemd-homed is a component of > >> systemd that, among other things, puts each user's home directory into a > >> separate LUKS volume so that each user can have their data encrypted > >> independently. > > Seems filesystem encryption is better suited here, but I see it > > already supports this option. > > Filesystem encryption does not protect metadata so we consider it weaker > encryption for the purposes of homed. Thus we prefer LUKS, which is > definitely more complicated but it protects all user data. > > It's also easier to move a loopback file between devices, which is > another goal of homed > > >> I have recently come across a blog post from Cloudflare[2] that details > >> a significant (~2x) improvement to throughput and latency that they were > >> able to achieve on dm-crypt devices by implementing[3] and using the > >> no_read_workqueue and no_write_workqueue performance flags. These flags > >> bypass queuing that was in place to optimize for access patterns of > >> spinning media and work around limitations of kernel subsystems at the time. > >> > >> Thus, to me it looks like these flags should default to on for SSDs > >> (i.e. /sys/block/<name>/queue/rotational is 0). Such a distinction based > >> on the type of storage media has precedent (i.e. selecting the `none` IO > >> scheduler for SSDs to improve throughput). So, I was going to change > > This is a different layer IMO. The IO scheduler is more "tightly > > coupled" with underlying storage vs device mapper framework, which is > > a higher level abstraction. Does it even have this capability to look > > into the underlying storage device? What if there are several layers > > of device mapper (dm-crypt->lvm->dm-raid->device)? > > I don't know if you have convenient access on the kernel side, but from > the userspace side we can fairly easily resolve what is backing a > dm-crypt device by probing sysfs a little > > As for several layers: I don't know. My use-case isn't that complicated. > I suppose if you have dm-crypt on LVM, then it's not on an SSD and thus > we don't turn on the flags (i.e. we don't walk the layers until we reach > an actual physical disk; we just check the immediate parent) > > > I'm mentally comparing device mapper to a network OSI model: should > > the TCP protocol tune itself based on what data link layer it runs on > > (ethernet vs token ring vs serial port)? > > In theory I suppose not. But I _really_ wouldn't be surprised if in > practice people do this. > > But also, TCP does make an effort to tune itself based on the > bandwith/etc it finds available. I do not think dm-crypt can do this so > easily. > >> systemd to turn these flags on when it detected an SSD, but I spoke to > >> the maintainers first and they suggested that I try to implement this in > >> cryptsetup instead. So I reached out there[4], and they suggested that I > >> should write to you, and that the kernel should set optimal defaults in > >> cases like this. They also mentioned recalling some hardware > >> configurations where these flags shouldn't be on. > > At least within our testing it is not as straightforward as "if you > > run on SSDs - you always get performance benefits". Some of > > Cloudflare's workloads chose not to enable these flags to this day. > > With all the external feedback I received for these flags - most > > people reported improvements, but there were also a small number of > > reports of regressions. > > Are these workloads more complicated than dm-crypt running directly on a > drive? In other words, are these workloads involving many device-mapper > layers? No. No layers, but for us it is some database/storage workloads. We generally see that performance benefit comes from workloads generating "many small bios" (closer to sector size) vs "large bios" (big chunks of data attached to a single bio). > Is this feedback public anywhere or internal to Cloudflare?I would like > to know in which situations the regressions can happen, and in which we > get improvements. It's in the mailing lists and (unfortunately) some off-list replies to my patches. > > So ultimately it is more than just a type of storage, which goes into > > the decision whether to enable these flags, which in turn creates a > > "policy", which the kernel should not dictate. I can see a world where > > spinning media would not be used anymore at all, so the kernel > > probably can flip these for everything to "default on", but more > > "complex decision trees" would be better off somewhere in userspace > > (systemd, cryptsetup, users themselves). > > Sure, complicated policy probably starts to belong in userspace, > especially if someone will need to tune it later for a given workload. Exactly > I'd hesitate to make it default on. This would mean someone running > spinning media + a new kernel + an old userspace or config will suddenly > see performance regressions. If userspace is dictating the policy, it > will be deciding when to turn the flags on anyway. > > >> Is there some reason these two flags aren't on by default for SSDs? > >> > >> Thanks, > >> Adrian > > Ignat > > > >> [1]: > >> https://foundation.gnome.org/2023/11/09/gnome-recognized-as-public-interest-infrastructure/ > >> > >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ > >> [3]: > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/dm-crypt.c?id=39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 > >> > >> [4]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/862 > Adrian