Hi I placed the dm-crypt parallization patches at: http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ The patches paralellize dm-crypt and make it possible to use all processor cores. The patch dm-crypt-remove-percpu.patch removes some percpu variables and replaces them with per-request variables. The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the encryption workqueue, allowing the encryption to be distributed to all CPUs in the system. The patch dm-crypt-offload-writes-to-thread.patch moves submission of all write requests to a single thread. The patch dm-crypt-sort-requests.patch sorts write requests submitted by a single thread. The requests are sorted according to the sector number, rb-tree is used for efficient sorting. Some usage notes: * turn off automatic cpu frequency scaling (or set it to "performance" governor) - cpufreq doesn't recognize encryption workload correctly, sometimes it underclocks all the CPU cores when there is some encryption work to do, resulting in bad performance * when using filesystem on encrypted dm-crypt device, reduce maximum request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute "dm-2" with the real name of your dm-crypt device). Note that having too big requests means that there is a small number of requests and they cannot be distributed to all available processors in parallel - it results in worse performance. Having too small requests results in high request overhead and also reduced performance. So you must find the optimal request size for your system and workload. For me, when testing this on ramdisk, the optimal is 8KiB. --- Now, the problem with I/O scheduler: when doing performance testing, it turns out that the parallel version is sometimes worse than the previous implementation. When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of ext2 filesystem on 15k SCSI disk and run this command time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12 the results are this: CFQ scheduler: -------------- no patches: 21.9s patch 1: 21.7s patches 1,2: 2:33s patches 1,2 (+ nr_requests = 1280000) 2:18s patches 1,2,3: 20.7s patches 1,2,3,4: 20.7s deadline scheduler: ------------------- no patches: 27.4s patch 1: 27.4s patches 1,2: 27.8s patches 1,2,3: 29.6s patches 1,2,3,4: 29.6s We can see that CFQ performs badly with the patch 2, but improves with the patch 3. All that patch 3 does is that it moves write requests from encryption threads to a separate thread. So it seems that CFQ has some deficiency that it cannot merge adjacent requests done by different processes. The problem is this: - we have 256k write direct-i/o request - it is broken to 4k bios (because we run on dm-loop on a filesystem with 4k block size) - encryption of these 4k bios is distributed to 12 processes on a 12-core machine - encryption finishes out of order and in different processes, 4k bios with encrypted data are submitted to CFQ - CFQ doesn't merge them - the disk is flooded with random 4k write requests, and performs much worse than with 256k requests Increasing nr_requests to 1280000 helps a little, but not much - it is still order of magnitute slower. I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look at it and find out why it doesn't merge requests from different processes. Why do I have to do a seemingly senseless operation (hand over write requests to a separate thread) in patch 3 to improve performance? Mikulas -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel