Hi Milan, On 28/11/2021 17:33, Itai Handler wrote: > On 27/11/2021 13:19, Milan Broz wrote: >> On 11/25/21 17:28, Itai Handler wrote: >>> On 14/11/2021 13:56, Itai Handler wrote: >>>> On 11/11/2021 15:07, Milan Broz wrote: >>>>> On 10/11/2021 18:43, Itai Handler wrote: >>>>>> Maximum sector size of dm-crypt is currently limited to 4096 bytes. >>>>>> >>>>>> On systems where PAGE_SIZE is larger than 4096 bytes, using larger >>>>>> sectors can be beneficial for performance reasons. >>>>> The limit to 4096 was set because this is the smallest possible >>>>> page size that all platform supports. >>>>> >>>>> If you allow a higher size here, the device cannot be activated on a >>>>> platform >>>>> with the smaller page size. (Encrypted sector size becomes >>>>> atomic sector size for all upper layers - as you mention below, not >>>>> all fs support bigger sectors.) >>>>> >>>>> For LUKS, this is not acceptable - the format is portable by definition. >>>>> >>>>> For specific dm-crypt device, I am not sure. I would better kept >>>>> the 4096 page size limit here. >>>> I considered only plain dm-crypt since I am unfamiliar with LUKS. >>>> Does LUKS assume that dm-crypt sector size is limited to 4K? >>>> If so, maybe I'll be able to also patch LUKS regarding this issue. >>>> >>>>> It also depends on crypto API driver here (performance is usually >>>>> optimized to 4k). >>>>> What cipher and encryption mode did you use for test? >>>> The cipher I used for the test is not publicly available but I can say >>>> that it's performance >>>> is not optimized to 4k blocks. >>>> I believe that this results from the high overhead of setting up DMA >>>> transfers. (my >>>> cipher uses DMA to transfer data between memory and programmable logic). >>>> There are many additional ciphers that use DMA in the tree, but I >>>> cannot run any >>>> benchmark with them at the moment. >>>> >>>> I have performed some additional benchmarks using the ARM >>>> Cryptographic Extensions >>>> CPU ciphers and saw that increasing block size beyond 4K does increase >>>> performance, >>>> albeit the performance improvement isn't as large as I've seen when >>>> using my cipher. >>>> >>>> Following are "tcrypt mode=600 sec=5 num_mb=512" results for >>>> xts-aes-ce decryption >>>> (ARM CPU Cryptographic Extensions cipher): >>>> testing speed of multibuffer xts(aes) (xts-aes-ce) decryption >>>> ... >>>> trcypt: test 5 (256 bit key, 4096 byte blocks): 801792 operations in >>>> 5 seconds (3284140032 bytes) >>>> ... >>>> trcypt: test 9 (256 bit key, 65536 byte blocks): 63488 operations in >>>> 5 seconds (4160749568 bytes) >>>> >>>> That translates to: >>>> 657 MB/s for 4K byte blocks. >>>> 832 MB/s for 64K blocks. >>>> >>>> That means that there is about 27 percents improvement when >>>> transitioning to 64K sectors, >>>> for the cipher alone (only tcrypt benchmark). >>>> >>>> This benchmark had been performed on an ARM Cortex A53 CPU. >>>> (Note that in all of my benchmarks PAGE_SIZE=64K). >>>> >>>>> How the number looks for random access? Linear test is usually >>>>> misleading. >>>>> I expect there will be big performance problem if you write small >>>>> data chunks, >>>>> writes and encryption will be amplified to full big sectors here...) >>>> I understand your concern. >>>> However my patch does not force anyone to use large sectors - it only >>>> opens up this >>>> possibility for those interested in that option. >>>> This is similarly to the option to format an ext4 filesystem with 64K >>>> sectors. >>>> By the way: when you do that, you get a warning saying that the >>>> filesystem >>>> will not be usable on most systems. >>>> >>>> Sometime users need to store mostly large files on a filesystem, for >>>> example for >>>> backup or for video files. >>>> I think that in these cases random access time is not so important. >>>> Some users may also be able to reserve a dedicated partition for >>>> storing such >>>> large files. >>>> >>>>> (Technical detail: such pat MUST increase dm-crypt minor version.) >>>> Thanks for pointing that out. I believe that I can prepare a v2 patch >>>> that will >>>> address that issue. >>>>> Milan >>>>> >>>>>> This patch changes maximum sector size from 4096 bytes to PAGE_SIZE, >>>>>> and in addition it changes the type of sector_size in >>>>>> struct crypt_config from 'unsigned short int' to 'unsigned int', in >>>>>> order to be able to represent larger values. >>>>>> >>>>>> On a prototype system which has PAGE_SIZE of 65536 bytes, I saw about >>>>>> x2 performance improvement in sequential read throughput benchmark >>>>>> while using only about half of the CPU usage, after simply increasing >>>>>> sector size from 4096 to 65536 bytes. >>>>>> I used ext4 filesystem for that benchmark, which supports 64KiB >>>>>> sectors. >>>>>> >>>>>> Note: A small change should be made in cryptsetup in order to add >>>>>> support for sectors larger than 4096 bytes. >>>>>> >>>>>> Signed-off-by: Itai Handler <itai.handler@xxxxxxxxx> >>>>>> --- >>>>>> drivers/md/dm-crypt.c | 6 +++--- >>>>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c >>>>>> index 916b7da16de2..78c239443bd5 100644 >>>>>> --- a/drivers/md/dm-crypt.c >>>>>> +++ b/drivers/md/dm-crypt.c >>>>>> @@ -168,7 +168,7 @@ struct crypt_config { >>>>>> } iv_gen_private; >>>>>> u64 iv_offset; >>>>>> unsigned int iv_size; >>>>>> - unsigned short int sector_size; >>>>>> + unsigned int sector_size; >>>>>> unsigned char sector_shift; >>>>>> >>>>>> union { >>>>>> @@ -3115,9 +3115,9 @@ static int crypt_ctr_optional(struct dm_target >>>>>> *ti, unsigned int argc, char **ar >>>>>> cc->cipher_auth = kstrdup(sval, GFP_KERNEL); >>>>>> if (!cc->cipher_auth) >>>>>> return -ENOMEM; >>>>>> - } else if (sscanf(opt_string, "sector_size:%hu%c", >>>>>> &cc->sector_size, &dummy) == 1) { >>>>>> + } else if (sscanf(opt_string, "sector_size:%u%c", >>>>>> &cc->sector_size, &dummy) == 1) { >>>>>> if (cc->sector_size < (1 << SECTOR_SHIFT) || >>>>>> - cc->sector_size > 4096 || >>>>>> + cc->sector_size > PAGE_SIZE || >>>>>> (cc->sector_size & (cc->sector_size - >>>>>> 1))) { >>>>>> ti->error = "Invalid feature value for >>>>>> sector_size"; >>>>>> return -EINVAL; >>>>>> >>>> I appreciate your valuable comments. >>>> >>>> Itai >>>> >>> Milan, can you comment on the above? >> Hi, >> >> well, if you want my opinion, let's summarize it: >> >> - you tested it on a driver that is proprietary and is not in >> the mainline kernel > You are correct here. I also suspect that some drivers in the mainline may > benefit as well, but I didn't have the opportunity to verify that yet. >> - provided numbers are with tcrypt - I think this is >> an internal crypto API test; it says nothing about performance >> with your dm-crypt patch above it, right? > You are correct. Please see additional benchmark results below. >> - no numbers for random access/small file access here (note, >> if you need to reencrypt 64k sector vs. 4k of mostly unused data, >> it WILL have a huge performance impact). >> >> IMO, optimizing for big linear access is better on the fs layer >> (and even years ago, I noticed some vendor's patches for setting >> large fs sectors in video recording equipment). > Interesting. Can you point me to some of those patches? >> The same applies to encryption - I think FDE is not the best layer >> here if you need to work with large sectors (encryption of not-used >> sectors is just huge waste of resources - fs encrypts only used space, >> I hope... :). >> >> I said that LUKS must remain multiplatform. Adding larger than >> 4k sectors is not an option; I will undoubtedly reject that patch >> in cryptsetup. >> So this patch can be used only for plain dm-crypt mapping, >> where you need to handle your key management (that could be ok, though). > In the future I plan to submit a cryptsetup patch that will allow > larger sectors only for the plain dm-crypt mapping. Such a change will > not affectthe portability of LUKS. >> Sorry, but without any performance numbers that prove this really >> helps and does not create huge performance problems... > Following are some benchmark results using the ARM Cryptographic > Extensions CPU cipher "xts(aes)", on an ARM Cortex A53 CPU, > and a 2 TB Samsung EVO 970 NVME SSD: > > Random read, dm crypt sector size 4K: 966 MB/s > Random read, dm crypt sector size 64K: 1054 MB/s (9% improvement) > > Random write, sector size 4K: 841 MB/s > Random write, sector size 64K: 983 MB/s (16% improvement) > > Following are cryptsetup and fio parameters used for this test: > cryptsetup open --type plain --sector-size $SECTOR_SIZE \ > -c 'capi:xts(aes)-plain64' > /dev/nvme0n1p1 dmcrypt0 > > (SECTOR_SIZE is set to 4096 or 65536). > > fio --ioengine pvsync --filename=/dev/mapper/dmcrypt0 \ > --readwrite=$READWRITE --bs=65536 --blockalign=65536 \ > --iomem_align=65536 --time_based --group_reporting \ > --runtime=50 --direct=1 --num_jobs=32 --name=plain > > (READWRITE is set to randread or randwrite) > > Note that I suspect that with ciphers that use DMA we will be able to see > much larger improvement (similarly to what I have reported in my initial > post), however I don't have suitable hardware for testing that at the > moment. > It would be nice if someone who has such hardware could test that or > point me to publicly available hardware for buying for such an experiment. > > Note that this benchmark randomly reads/writes 64K files to the SSD. > For smaller than 64KB files, the performance is indeed expected to drop > significantly, however: > 1. Today users can format (unencrypted) ext4 filesystems to 64K sectors. > If they store small files on such filesystems, they probably already > suffer from performance problems. > 2. My patch allows to use all sector sizes between 4K and 64K > (i.e. 4K, 8K, 16K, 32K and 64K). When we use 8K sectors, the performance > drop for small files isn't expected to be so high and still I can show in > that case huge performance improvement, at least when using my proprietary > cipher (I may be able to show improvement with other hardware when I > get my hands on such hardware). > 3. Users might be able to partition their disk to several partitions, with > the 64K bytes sectors partitions reserved for larger files. > >> NACK from me for this patch (but it is up to Mike to decide here). >> >> Milan > Thanks. > Please note that I'll be off-work from now until next Tuesday. > > Itai > I finally managed to get my hands on an Open-Q 865 SOM Development board and have just performed a few benchmarks using the qce (qcrypt) crypto accelerator, which is available also in the mainline kernel. Following is a summary of these benchmarks: +------------------+-----------+-----------+-----------+-------------+ | sector_size/test | seq_read | rand_read | seq_write | rand_write | +------------------+-----------+-----------+-----------+-------------+ | 4 KiB sectors | 13.3 | 22.8 | 27.2 | 24 | | 64 KiB sectors | 581 | 582 | 586 | 588 | +------------------+-----------+-----------+-----------+-------------+ (All numbers in MB/s). These results show significant improvements: x43 sequential read. x25 random read. x21 sequential write. x24 random write. So it appears to me that there is real benefit for enabling 64K sectors in dm-crypt at least for the particular use case I showed (qce driver on arm with 64K PAGE_SIZE, plain dm-crypt mode). Notes: 1. These benchmarks have been performed using a ramdisk. 2. These benchmarks have been performed on an MSM kernel, which is based on linux kernel 4.19.81. Kernel sources are available at: https://source.codeaurora.org/quic/la/kernel/msm-4.19 commit id: c815d7543db92377d85d214af0dbf1d11324726e In all benchmarks, CONFIG_ARM64_64K_PAGES=y, the ramdisk size is 1 GB (/dev/ram0). For benchmarking with 4 KiB sectors, I created the dm-crypt device as follows: cryptsetup open --type plain --sector-size 4096 \ -c 'capi:qcom-xts(aes)-plain64' /dev/ram0 dmcrypt0 For 64 KiB sectors, simply replace 4096 with 65536. Following are the benchmark commands: fio --filename=/dev/mapper/dmcrypt0 --readwrite=read --bs=1048576 \ --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \ --ioengine=libaio --name=plain fio --filename=/dev/mapper/dmcrypt0 --readwrite=randread --bs=1048576 \ --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \ --ioengine=libaio --name=plain fio --filename=/dev/mapper/dmcrypt0 --readwrite=write --bs=1048576 \ --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \ --ioengine=libaio --name=plain fio --filename=/dev/mapper/dmcrypt0 --readwrite=randwrite --bs=1048576 \ --time_based --group_reporting --runtime=10 --direct=1 --iodepth=4 \ --ioengine=libaio --name=plain - Itai -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel