Fio / pmem benchmark with MGLRU TLDR ==== With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian (distribution) access and Gaussian (distribution) access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. Background ========== Many applications running on warehouse-scale computers heavily use POSIX read(2)/write(2) and page cache, e.g., Apache Kafka, a distributed streaming application used by "more than 80% of all Fortune 100 companies" [1] and PostgreSQL, "the world's most advanced open source relational database" [2]. Intel DC Persistent Memory, as an affordable alternative to DRAM, can deliver large capacity and data persistence. Specifically, the device used in this benchmark can achieve up to 36 GiB/s and 15 GiB/s throughput, respectively, for sequential and random read access. Our research group at the University of Rochester focuses on the intersection of computer architecture and system software. My current research interest is memory management on tiered memory systems. Matrix ====== Kernels: version [+ patchset] * Baseline: 5.15 * Patched: 5.15 + MGLRU Access patterns (4KB read): * Random (uniform) * Zipfian (theta 0.8; the recommended range is 0-2) * Gaussian (deviation 40; the possible range is 0-100) Concurrency conditions (the average number of jobs per CPU): * 1 * 2 Total file size (GB): 400 (~2x memory capacity) Total configurations: 12 Data points per configuration: 10 Total run duration (minutes) per data point: ~30 Notes ----- 1. All files were stored on pmem. Each job had the exclusive access to a single file. 2. Due to the hardware limitation when accessing remote pmem [3], numactl was used to bind the fio processes to the local pmem. Only one of the two NUMA nodes was used during the benchmark. 3. During dry runs, we observed that the throughput doesn't improve beyond 2 jobs per CPU for random access. Moreover, the patched kernel showed consistent improvements over the baseline kernel when using 3 or 4 jobs per CPU. 4. We wanted to simulate the real-world scenarios and therefore used default swap configuration (on). Moreover, we didn't observe any negative impact on performance with dry runs that disabled swap. Procedure ========= <for each kernel> grub2-reboot <baseline, patched> <for each concurrency condition> <generate test files> <for each access pattern> <for each data point> <reboot> <run fio> Hardware -------- Memory (GiB per socket): 192 CPU (# per socket): 40 Pmem (GiB per socket): 768 Fio --- $ fio -version fio-3.28 $ numactl --cpubind=0 --membind=0 fio --name=randread \ --directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \ --time_based --numjobs={40, 80} --ioengine=io_uring \ --ramp_time=20m --runtime=10m --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution={random, zipf:0.8, normal:40} \ --direct=0 --norandommap --group_reporting Results ======= Throughput ---------- The patched kernel achieved substantially higher throughput for all three access patterns and two concurrency conditions. Specifically, comparing the patched with the baseline kernel, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian access, and Gaussian access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. +---------------------+---------------+---------------+ | Mean MiB/s [95% CI] | 1 job / CPU | 2 jobs / CPU | +---------------------+---------------+---------------+ | Random access | 8411 / 11742 | 8417 / 12267 | | | [3275, 3387] | [3562, 4137] | +---------------------+---------------+---------------+ | Zipfian access | 14576 / 15360 | 12932 / 14181 | | | [600, 967] | [1220, 1279] | +---------------------+---------------+---------------+ | Gaussian access | 14564 / 15993 | 11513 / 14037 | | | [1348, 1508] | [2417, 2631] | +---------------------+---------------+---------------+ Table 1. Throughput comparison between the baseline and the patched kernels The patched kernel exhibited less degradation in throughput when running more concurrent jobs. Comparing 2 jobs per CPU with 1 job per CPU, fio achieved 95% CIs [-11.54, -11.02]%, [-16.91, -12.01]% and [-21.61, -20.30]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when using the baseline kernel; 95% CIs [2.04, 6.92]%, [-8.86, -6.48]% and [-12.83, -11.64]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when using the patched kernel. There were no statistically significant changes in throughput for the rest of the test matrix. +---------------------+-----------------+----------------+ | Mean MiB/s [95% CI] | Baseline kernel | Patched kernel | +---------------------+-----------------+----------------+ | Random access | 8411 / 8417 | 11741 / 12267 | | | [-55, 69] | [239, 812] | +---------------------+-----------------+----------------+ | Zipfian access | 14576 / 12932 | 15360/ 14181 | | | [-1682, -1607] | [-1361, -996] | +---------------------+-----------------+----------------+ | Gaussian access | 14565 / 11513 | 15993 / 14037 | | | [-3147, -2957] | [-2051, -1861] | +---------------------+-----------------+----------------+ Table 2. Throughput comparison between 1 job per CPU and 2 jobs per CPU Tail Latency ------------ Comparing the patched with the baseline kernel, fio experienced 95% CIs [-41.77, -40.35]% and [6.64, 13.95]% higher latency at the 99th percentile, respectively, for random access and Gaussian access, when the average number of jobs per CPU is 1; 95% CIs [-41.97, -40.59]%, [-47.74, -47.04]% and [-51.32, -50.27]% higher latency at the 99th percentile, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. There were no statistically significant changes in latency at the 99th percentile for the rest of the test matrix. +------------------------------+----------------+------------------+ | 99th percentile latency (us) | 1 job / CPU | 2 jobs / CPU | +------------------------------+----------------+------------------+ | Random access | 12466 / 7347 | 25560 / 15008 | | | [-5207, -5030] | [-10729, -10375] | +------------------------------+----------------+------------------+ | Zipfian access | 3395 / 3382 | 14563 / 7661 | | | [-131, 105] | [-6953,-6850] | +------------------------------+----------------+------------------+ | Gaussian access | 3280 / 3618 | 15611 / 7681 | | | [217, 457] | [-8012, -7848] | +------------------------------+----------------+------------------+ Table 3. Comparison of the 99th percentile latency between the baseline and the patched kernels (lower is better) Metrics collected during each run are available at: https://github.com/zhaishuang1/MglruPerf/tree/master A peek at 5.16-rc6 ------------------ We also ran the benchmark on 5.16-rc6 with swap off. However, we haven't collected enough data points to establish a 95% CI. Here are a few numbers we've collected: +----------------+------------+----------+----------------+----------+ | Access pattern | Jobs / CPU | 5.16-rc6 | 5.16-rc6-mglru | % change | +----------------+------------+----------+----------------+----------+ | Random access | 1 | 7467 | 10440 | 39.8% | +----------------+------------+----------+----------------+----------+ | Random access | 2 | 7504 | 13417 | 78.8% | +----------------+------------+----------+----------------+----------+ | Random access | 3 | 7511 | 13954 | 85.8% | +----------------+------------+----------+----------------+----------+ | Random access | 4 | 7542 | 13925 | 84.6% | +----------------+------------+----------+----------------+----------+ Reference ========= [1] https://kafka.apache.org/documentation/#design_filesystem [2] https://www.postgresql.org/docs/11/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY [3] System Evaluation of the Intel Optane byte-addressable NVM, MEMSYS 2019. Appendix ======== Throughput ---------- $ cat raw_data_fio.r v <- c( # baseline 40 procs random 8467.89, 8428.34, 8383.32, 8253.12, 8464.65, 8307.42, 8424.78, 8434.44, 8474.88, 8468.26, # baseline 40 procs zipf 14570.44, 14598.03, 14550.74, 14640.29, 14591.4, 14573.35, 14503.18, 14613.39, 14598.61, 14522.27, # baseline 40 procs gaussian 14504.95, 14427.23, 14652.19, 14519.47, 14557.97, 14617.92, 14555.87, 14446.94, 14678.12, 14688.33, # baseline 80 procs random 8427.51, 8267.23, 8437.48, 8432.37, 8441.4, 8454.26, 8413.13, 8412.44, 8444.36, 8444.32, # baseline 80 procs zipf 12980.12, 12946.43, 12911.95, 12925.83, 12952.75, 12841.44, 12920.35, 12924.19, 12944.38, 12967.72, # baseline 80 procs gaussian 11666.29, 11624.72, 11454.82, 11482.36, 11462.24, 11379.46, 11691.5, 11471.19, 11402.08, 11494.13, # patched 40 procs random 11706.69, 11778.1, 11774.07, 11750.07, 11744.97, 11766.65, 11727.79, 11708.41, 11745.3, 11716.45, # patched 40 procs zipf 15498.31, 14647.94, 15423.35, 15467.32, 15467.05, 15342.49, 15511.34, 15414.06, 15401.1, 15431.57, # patched 40 procs gaussian 15957.86, 15957.13, 16022.69, 16035.85, 16150.2, 15904.5, 15943.36, 16036.78, 16025.95, 15900.56, # patched 80 procs random 12568.51, 11772.25, 11622.15, 12057.66, 11971.72, 12693.36, 12399.71, 12553.23, 12242.74, 12793.34, # patched 80 procs zipf 14194.78, 14213.61, 14148.66, 14182.35, 14183.91, 14192.23, 14163.2, 14179.7, 14162.12, 14196.34, # patched 80 procs gaussian 14084.86, 13706.34, 14089.42, 14058.4, 14096.74, 14108.06, 14043.41, 14072.15, 14088.44, 14024.51 ) a <- array(v, dim = c(10, 3, 2, 2)) # baseline vs patched for (concurr in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("concurr%d dist%d: no significance", concurr, dist) } else { s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1]) } print(s) } } # low concurr vs high concurr for (kern in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("kern%d dist%d: no significance", kern, dist) } else { s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1]) } print(s) } } $ R -q -s -f raw_data_fio.r Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -132.15, df = 11.177, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3386.514 -3275.766 sample estimates: mean of x mean of y 8410.71 11741.85 [1] "concurr1 dist1: [38.95, 40.26]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -9.5917, df = 9.4797, p-value = 3.463e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -967.8353 -600.7307 sample estimates: mean of x mean of y 14576.17 15360.45 [1] "concurr1 dist2: [4.12, 6.64]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -37.744, df = 17.33, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1508.328 -1348.850 sample estimates: mean of x mean of y 14564.90 15993.49 [1] "concurr1 dist3: [9.26, 10.36]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -30.144, df = 9.3334, p-value = 1.281e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4137.381 -3562.653 sample estimates: mean of x mean of y 8417.45 12267.47 [1] "concurr2 dist1: [42.32, 49.15]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -92.164, df = 13.276, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1279.417 -1220.931 sample estimates: mean of x mean of y 12931.52 14181.69 [1] "concurr2 dist2: [9.44, 9.89]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -49.453, df = 17.863, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2631.656 -2417.052 sample estimates: mean of x mean of y 11512.88 14037.23 [1] "concurr2 dist3: [20.99, 22.86]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = -0.22947, df = 16.403, p-value = 0.8213 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -68.88155 55.40155 sample estimates: mean of x mean of y 8410.71 8417.45 [1] "kern1 dist1: no significance" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 91.86, df = 17.875, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1607.021 1682.287 sample estimates: mean of x mean of y 14576.17 12931.52 [1] "kern1 dist2: [-11.54, -11.02]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 67.477, df = 17.539, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2956.815 3147.225 sample estimates: mean of x mean of y 14564.90 11512.88 [1] "kern1 dist3: [-21.61, -20.30]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = -4.1443, df = 9.0781, p-value = 0.002459 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -812.1507 -239.0833 sample estimates: mean of x mean of y 11741.85 12267.47 [1] "kern2 dist1: [2.04, 6.92]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 14.566, df = 9.1026, p-value = 1.291e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 996.0064 1361.5196 sample estimates: mean of x mean of y 15360.45 14181.69 [1] "kern2 dist2: [-8.86, -6.48]%" Welch Two Sample t-test data: a[, dist, 1, kern] and a[, dist, 2, kern] t = 43.826, df = 15.275, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1861.263 2051.247 sample estimates: mean of x mean of y 15993.49 14037.23 [1] "kern2 dist3: [-12.83, -11.64]%" 99th Percentile Latency ----------------------- $ cat raw_data_fio_lat.r v <- c( # baseline 40 procs random 12649, 12387, 12518, 12518, 12518, 12387, 12518, 12518, 12387, 12256, # baseline 40 procs zipf 3458, 3294, 3425, 3294, 3294, 3359, 3752, 3326, 3294, 3458, # baseline 40 procs gaussian 3326, 3458, 3195, 3392, 3326, 3228, 3228, 3326, 3130, 3195, # baseline 80 procs random 25560, 26084, 25560, 25560, 25297, 25297, 25822, 25560, 25560, 25297, # baseline 80 procs zipf 14484, 14615, 14615, 14484, 14484, 14615, 14615, 14615, 14615, 14484, # baseline 80 procs gaussian 15664, 15664, 15533, 15533, 15533, 15664, 15795, 15533, 15664, 15533, # patched 40 procs random 7439, 7242, 7373, 7373, 7373, 7439, 7242, 7308, 7308, 7373, # patched 40 procs zipf 3261, 3425, 3392, 3294, 3359, 3556, 3228, 3490, 3458, 3359, # patched 40 procs gaussian 3687, 3523, 3556, 3523, 3752, 3654, 3884, 3490, 3392, 3720, # patched 80 procs random 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, 15008, # patched 80 procs zipf 7701, 7635, 7701, 7701, 7635, 7635, 7701, 7635, 7635, 7635, # patched 80 procs gaussian 7635, 7898, 7701, 7635, 7635, 7635, 7635, 7635, 7701, 7701 ) a <- array(v, dim = c(10, 3, 2, 2)) # baseline vs patched for (concurr in 1:2) { for (dist in 1:3) { r <- t.test(a[, dist, concurr, 1], a[, dist, concurr, 2]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("concurr%d dist%d: no significance", concurr, dist) } else { s <- sprintf("concurr%d dist%d: [%.2f, %.2f]%%", concurr, dist, -p[2], -p[1]) } print(s) } } $ R -q -s -f raw_data_fio_lat.r Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 123.52, df = 15.287, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 5030.417 5206.783 sample estimates: mean of x mean of y 12465.6 7347.0 [1] "concurr1 dist1: [-41.77, -40.35]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 0.23667, df = 16.437, p-value = 0.8158 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -104.7812 131.1812 sample estimates: mean of x mean of y 3395.4 3382.2 [1] "concurr1 dist2: no significance" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = -5.9754, df = 16.001, p-value = 1.94e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -457.5065 -217.8935 sample estimates: mean of x mean of y 3280.4 3618.1 [1] "concurr1 dist3: [6.64, 13.95]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 134.89, df = 9, p-value = 3.437e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 10374.74 10728.66 sample estimates: mean of x mean of y 25559.7 15008.0 [1] "concurr2 dist1: [-41.97, -40.59]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 288.1, df = 13.292, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 6849.566 6952.834 sample estimates: mean of x mean of y 14562.6 7661.4 [1] "concurr2 dist2: [-47.74, -47.04]%" Welch Two Sample t-test data: a[, dist, concurr, 1] and a[, dist, concurr, 2] t = 203.64, df = 17.798, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7848.616 8012.384 sample estimates: mean of x mean of y 15611.6 7681.1 [1] "concurr2 dist3: [-51.32, -50.27]%"