Kernel / Apache Spark benchmark with MGLRU TLDR ==== With the MGLRU, Apache Spark took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort 3 billion random integers, respectively, under the medium- and high-concurrency conditions when slightly overcommitting memory. There were no statistically significant changes in wall time when sorting the same dataset under other conditions. Background ========== Memory overcommit can increase utilization and, if carried out properly, can also increase throughput. The challenges are to improve working set estimation and to optimize page reclaim. The risks are performance degradations and OOM kills. Short of overcoming the challenges, the only way to reduce the risks is to underutilize memory. Apache Spark is one of the most popular open-source big-data frameworks. Dataset sorting is the most widely used benchmark for such frameworks. Matrix ====== Kernels: version [+ patchset] * Baseline: 5.14 * Patched: 5.14 + MGLRU Memory conditions: % of memory size * Underutilizing: ~10% on inactive file list * Overcommitting: ~10% swapped out Concurrency conditions: average # of workers per CPU * Low: 1 * Medium: 2 * High: 3 Cluster mode: local Dataset size: 3 billion random integers (57GB text) Total configurations: 12 Data points per configuration: 10 Total run duration (minutes) per data point: ~20 Procedure ========= The latest MGLRU patchset for the 5.14 kernel is available at git fetch https://linux-mm.googlesource.com/page-reclaim \ refs/changes/30/1430/1 Baseline and patched 5.14 kernel images are available at https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp <install and configure OS> spark-shell < gen.scala <for each kernel> grub2-set-default <baseline, patched> <for each memory condition> <update run_spark.sh> <for each concurrency condition> <update run_spark.sh> <for each data point> reboot run_spark.sh <collect wall time> Hardware ======== Memory (GB): 64 CPU (total #): 32 NVMe SSD (GB): 1024 OS == $ cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 (Ootpa) $ cat /proc/swaps Filename Type Size Used Priority /dev/nvme0n1p3 partition 32970748 0 -2 $ cat /proc/cmdline <existing parameters> systemd.unified_cgroup_hierarchy=1 $ cat /sys/fs/cgroup/user.slice/memory.min 4294967296 $ cat /proc/sys/vm/overcommit_memory 1 $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] Apache Spark ============ $ spark-shell --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/ Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.12 Branch HEAD Compiled by user centos on 2021-05-24T04:27:48Z Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8 Url https://github.com/apache/spark Type --help for more information. $ cat gen.scala import java.io._ import scala.collection.mutable.ArrayBuffer object GenData { def main(args: Array[String]): Unit = { val file = new File("dataset.txt") val writer = new BufferedWriter(new FileWriter(file)) val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L) for(_ <- 0 until 300000000) { for (i <- 0 until 10) { buf.update(i, scala.util.Random.nextLong()) } writer.write(s"${buf.mkString(",")}\n") } writer.close() } } GenData.main(Array()) $ cat sort.scala import java.time.temporal.ChronoUnit import org.apache.spark.sql.SparkSession object SparkSort { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().getOrCreate() val file = sc.textFile("dataset.txt", 32) val start = java.time.Instant.now() val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10) val finish = java.time.Instant.now() println(s"wall time: ${ChronoUnit.SECONDS.between(start, finish)}") results.foreach(println) spark.stop() } } SparkSort.main(Array()) $ cat run_spark.sh spark-shell --master local\[<32, 64, 96>\] --driver-memory <52G, 62G> < sort.scala Results ======= Comparing the patched with the baseline kernel, Apache Spark took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort the dataset, respectively, under the medium- and high-concurrency conditions when slightly overcommitting memory. There were no statistically significant changes in wall time under other conditions. +--------------------+-----------------------+-----------------------+ | Mean wall time (s) | Underutilizing memory | Overcommitting memory | | [95% CI] | | | +--------------------+-----------------------+-----------------------+ | Low concurrency | 1037.1 / 1037.0 | 1038.2 / 1036.6 | | | [-1.41, 1.21] | [-3.67, 0.47] | +--------------------+-----------------------+-----------------------+ | Medium concurrency | 1141.8 / 1142.6 | 1297.9 / 1165.1 | | | [-1.35, 2.95] | [-145.21, -120.38] | +--------------------+-----------------------+-----------------------+ | High concurrency | 1239.3 / 1236.4 | 1456.8 / 1259.2 | | | [-7.81, 2.01] | [-217.53, -177.66] | +--------------------+-----------------------+-----------------------+ Table 1. Comparison between the baseline and patched kernels Comparing overcommitting with underutilizing memory, Apache Spark took 95% CIs [12.58, 14.76]% and [15.95, 19.15]% more wall time to sort the dataset, respectively, under the low- and medium-concurrency conditions when using the baseline kernel; 95% CIs [1.78, 2.16]% and [1.42, 2.27]% more wall time, respectively, under the medium- and high-concurrency conditions when using the patched kernel. There were no statistically significant changes in wall time under other conditions. +--------------------+------------------------+----------------------+ | Mean wall time (s) | Baseline kernel | Patched kernel | | [95% CI] | | | +--------------------+------------------------+----------------------+ | Low concurrency | 1037.1 / 1038.2 | 1037.0 / 1036.6 | | | [-0.31, 2.51] | [-2.43, 1.63] | +--------------------+------------------------+----------------------+ | Medium concurrency | 1141.8 / 1297.9 | 1142.6 / 1165.1 | | | [143.68, 168.51] | [20.33, 24.66] | +--------------------+------------------------+----------------------+ | High concurrency | 1239.3 / 1456.8 | 1236.4 / 1259.2 | | | [197.62, 237.37] | [17.55, 28.04] | +--------------------+------------------------+----------------------+ Table 2. Comparison between underutilizing and overcommitting memory Metrics collected during each run are available at https://github.com/ediworks/KernelPerf/tree/master/mglru/spark/5.14 Appendix ======== $ cat raw_data_spark.r v <- c( # baseline 52g 32t 1034, 1036, 1036, 1037, 1037, 1037, 1038, 1038, 1038, 1040, # baseline 52g 64t 1139, 1139, 1140, 1140, 1142, 1143, 1143, 1144, 1144, 1144, # baseline 52g 96t 1236, 1237, 1238, 1238, 1238, 1239, 1240, 1241, 1243, 1243, # baseline 62g 32t 1036, 1036, 1038, 1038, 1038, 1038, 1039, 1039, 1040, 1040, # baseline 62g 64t 1266, 1277, 1284, 1296, 1299, 1302, 1311, 1313, 1314, 1317, # baseline 62g 96t 1403, 1431, 1440, 1447, 1460, 1461, 1467, 1475, 1487, 1497, # patched 52g 32t 1035, 1036, 1036, 1037, 1037, 1037, 1037, 1038, 1038, 1039, # patched 52g 64t 1138, 1140, 1140, 1143, 1143, 1143, 1144, 1145, 1145, 1145, # patched 52g 96t 1228, 1228, 1233, 1234, 1235, 1236, 1236, 1240, 1246, 1248, # patched 62g 32t 1032, 1035, 1035, 1035, 1036, 1036, 1037, 1039, 1040, 1041, # patched 62g 64t 1162, 1164, 1164, 1164, 1164, 1164, 1166, 1166, 1168, 1169, # patched 62g 96t 1252, 1256, 1256, 1258, 1260, 1260, 1260, 1260, 1265, 1265 ) a <- array(v, dim = c(10, 3, 2, 2)) # baseline vs patched for (mem in 1:2) { for (con in 1:3) { r <- t.test(a[, con, mem, 1], a[, con, mem, 2]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("mem%d con%d: no significance", mem, con) } else { s <- sprintf("mem%d con%d: [%.2f, %.2f]%%", mem, con, -p[2], -p[1]) } print(s) } } # 52g vs 62g for (ker in 1:2) { for (con in 1:3) { r <- t.test(a[, con, 1, ker], a[, con, 2, ker]) print(r) p <- r$conf.int * 100 / r$estimate[1] if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) { s <- sprintf("ker%d con%d: no significance", ker, con) } else { s <- sprintf("ker%d con%d: [%.2f, %.2f]%%", ker, con, -p[2], -p[1]) } print(s) } } $ R -q -s -f raw_data_spark.r Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = 0.16059, df = 16.4, p-value = 0.8744 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.21749 1.41749 sample estimates: mean of x mean of y 1037.1 1037.0 [1] "mem1 con1: no significance" Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = -0.78279, df = 17.565, p-value = 0.4442 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.950923 1.350923 sample estimates: mean of x mean of y 1141.8 1142.6 [1] "mem1 con2: no significance" Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = 1.2933, df = 11.303, p-value = 0.2217 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.019103 7.819103 sample estimates: mean of x mean of y 1239.3 1236.4 [1] "mem1 con3: no significance" Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = 1.6562, df = 13.458, p-value = 0.1208 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.4799188 3.6799188 sample estimates: mean of x mean of y 1038.2 1036.6 [1] "mem2 con1: no significance" Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = 24.096, df = 9.2733, p-value = 1.115e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 120.3881 145.2119 sample estimates: mean of x mean of y 1297.9 1165.1 [1] "mem2 con2: [-11.19, -9.28]%" Welch Two Sample t-test data: a[, con, mem, 1] and a[, con, mem, 2] t = 22.289, df = 9.3728, p-value = 1.944e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 177.6666 217.5334 sample estimates: mean of x mean of y 1456.8 1259.2 [1] "mem2 con3: [-14.93, -12.20]%" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = -1.6398, df = 17.697, p-value = 0.1187 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.5110734 0.3110734 sample estimates: mean of x mean of y 1037.1 1038.2 [1] "ker1 con1: no significance" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = -28.33, df = 9.2646, p-value = 2.57e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -168.5106 -143.6894 sample estimates: mean of x mean of y 1141.8 1297.9 [1] "ker1 con2: [12.58, 14.76]%" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = -24.694, df = 9.1353, p-value = 1.12e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -237.3794 -197.6206 sample estimates: mean of x mean of y 1239.3 1456.8 [1] "ker1 con3: [15.95, 19.15]%" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = 0.42857, df = 12.15, p-value = 0.6757 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.630775 2.430775 sample estimates: mean of x mean of y 1037.0 1036.6 [1] "ker2 con1: no significance" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = -21.865, df = 17.646, p-value = 3.151e-14 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -24.66501 -20.33499 sample estimates: mean of x mean of y 1142.6 1165.1 [1] "ker2 con2: [1.78, 2.16]%" Welch Two Sample t-test data: a[, con, 1, ker] and a[, con, 2, ker] t = -9.2738, df = 14.72, p-value = 1.561e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -28.04897 -17.55103 sample estimates: mean of x mean of y 1236.4 1259.2 [1] "ker2 con3: [1.42, 2.27]%"