For the vast majority of cases, Linus is correct. Pre-fetch is a hardware specific tweak that just doesn't apply across the breadth of systems Linus has to pay attention to. In this case, our tunnel-vision is somewhat encouraged by the code and logic at hand. The AVX2 and AVX512 code is pretty specific to new, 64-bit, x86_64 platforms that tend to share a lot of cache behavior. It is not like this will ever run on an Atom or ARM CPU. Even more important, the data access patterns are 100% defined by the task at hand. With RAID parity calculation, you are basically taking a stroll through RAM with 100% defined patterns. This is one case, where you can pre-fetch memory enough ahead of time so that the hardware prefetch unit never triggers and you will always be correct. I am somewhat surprised you see 10%. 10% on an expensive function like this is a lot. Even more amazing is that the AVX2 and AVX512 code look to be memory bandwidth limited. 256 bytes of source reads in about 30 clocks is about 17 GB/sec, which is faster than a single RAM channel. Congrats to Intel for "finding the next bottleneck". I also see a follow-up from Linus, and again, totally agree with him. I guess my take is that prefetch only works when you have a use case where you are 100% correct with your pre-fetches. I would also note that this "raid case" is the default, clean array, parity calc code. The same logic probably needs to be applied to the recovery cases, but I have not looked at those yet. Doug On Fri, Oct 7, 2016 at 10:39 AM, Shaohua Li <shli@xxxxxxxxxx> wrote: > On Thu, Oct 06, 2016 at 10:39:21PM -0700, Doug Dumitru wrote: >> Mr. Li, >> >> There is another thread in [linux-raid] discussing pre-fetches in the >> raid-6 AVX2 code. My testing implies that the prefetch distance is >> too short. In your new AVX512 code, it looks like there are 24 >> instructions, each with latencies of 1, between the prefetch and the >> actual memory load. I don't have a AVX512 CPU to try this on, but the >> prefetch might do better at a bigger distance. If I am not mistaken, >> it takes a lot longer than 24 clocks to fetch 4 cache lines. >> >> Just a comment while the code is still fluid. > > I did try your patch and it improved 10% in my machine, but this isn't > relevent to the pull. We can do the tunning later if necessary. I'm > hoping the intel guys can share some hints, but apparently Linus isn't a > fan for such tuning. > > Thanks, > Shaohua > >> On Thu, Oct 6, 2016 at 5:38 PM, Shaohua Li <shli@xxxxxxxxxx> wrote: >> > Hi Linus, >> > Please pull MD update for 4.9. This update includes: >> > - new AVX512 instruction based raid6 gen/recovery algorithm >> > - A couple of md-cluster related bug fixes >> > - Fix a potential deadlock >> > - Set nonrotational bit for raid array with SSD >> > - Set correct max_hw_sectors for raid5/6, which hopefuly can improve >> > performance a little bit >> > - Other minor fixes >> > >> > Thanks, >> > Shaohua >> > >> > The following changes since commit 7d1e042314619115153a0f6f06e4552c09a50e13: >> > >> > Merge tag 'usercopy-v4.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux (2016-09-20 17:11:19 -0700) >> > >> > are available in the git repository at: >> > >> > git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.9-rc1 >> > >> > for you to fetch changes up to bb086a89a406b5d877ee616f1490fcc81f8e1b2b: >> > >> > md: set rotational bit (2016-10-03 10:20:27 -0700) >> > >> > ---------------------------------------------------------------- >> > Chao Yu (1): >> > raid5: fix to detect failure of register_shrinker >> > >> > Gayatri Kammela (5): >> > lib/raid6: Add AVX512 optimized gen_syndrome functions >> > lib/raid6: Add AVX512 optimized recovery functions >> > lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions >> > lib/raid6: Add AVX512 optimized xor_syndrome functions >> > raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays >> > >> > Guoqing Jiang (9): >> > md-cluster: call md_kick_rdev_from_array once ack failed >> > md-cluster: use FORCEUNLOCK in lockres_free >> > md-cluster: remove some unnecessary dlm_unlock_sync >> > md: changes for MD_STILL_CLOSED flag >> > md-cluster: clean related infos of cluster >> > md-cluster: protect md_find_rdev_nr_rcu with rcu lock >> > md-cluster: convert the completion to wait queue >> > md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang >> > md-cluster: make resync lock also could be interruptted >> > >> > Shaohua Li (5): >> > raid5: allow arbitrary max_hw_sectors >> > md/bitmap: fix wrong cleanup >> > md: fix a potential deadlock >> > raid5: handle register_shrinker failure >> > md: set rotational bit >> > >> > arch/x86/Makefile | 5 +- >> > drivers/md/bitmap.c | 4 +- >> > drivers/md/md-cluster.c | 99 ++++++--- >> > drivers/md/md.c | 44 +++- >> > drivers/md/md.h | 5 +- >> > drivers/md/raid5.c | 11 +- >> > include/linux/raid/pq.h | 4 + >> > lib/raid6/Makefile | 2 +- >> > lib/raid6/algos.c | 12 + >> > lib/raid6/avx512.c | 569 +++++++++++++++++++++++++++++++++++++++++++++++ >> > lib/raid6/recov_avx512.c | 388 ++++++++++++++++++++++++++++++++ >> > lib/raid6/test/Makefile | 5 +- >> > lib/raid6/test/test.c | 7 +- >> > lib/raid6/x86.h | 10 + >> > 14 files changed, 1111 insertions(+), 54 deletions(-) >> > create mode 100644 lib/raid6/avx512.c >> > create mode 100644 lib/raid6/recov_avx512.c >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Doug Dumitru >> EasyCo LLC -- Doug Dumitru EasyCo LLC -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html