In our environment we are using systemd portable containers in squashfs formats, convert them into loop device, and mount. NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop5 7:5 0 76.4M 0 loop `-BaseImageM1908 252:3 0 76.4M 1 crypt /BaseImageM1908 loop6 7:6 0 20K 0 loop `-test_launchperf20 252:17 0 1.3M 1 crypt /app/test_launchperf20 loop7 7:7 0 20K 0 loop `-test_launchperf18 252:4 0 1.5M 1 crypt /app/test_launchperf18 loop8 7:8 0 8K 0 loop `-test_launchperf8 252:25 0 28K 1 crypt app/test_launchperf8 loop9 7:9 0 376K 0 loop `-test_launchperf14 252:29 0 45.7M 1 crypt /app/test_launchperf14 loop10 7:10 0 16K 0 loop `-test_launchperf4 252:11 0 968K 1 crypt app/test_launchperf4 loop11 7:11 0 1.2M 0 loop `-test_launchperf17 252:26 0 150.4M 1 crypt /app/test_launchperf17 loop12 7:12 0 36K 0 loop `-test_launchperf19 252:13 0 3.3M 1 crypt /app/test_launchperf19 loop13 7:13 0 8K 0 loop ... We have over 50 loop devices which are mounted during boot. We observed contentions around loop_ctl_mutex. The sample contentions stacks: Contention 1: __blkdev_get() bdev->bd_disk->fops->open() lo_open() mutex_lock_killable(&loop_ctl_mutex); <- contention Contention 2: __blkdev_put() disk->fops->release() lo_release() mutex_lock(&loop_ctl_mutex); <- contention With total time waiting for loop_ctl_mutex ~18.8s during boot (across 8 CPUs) on our machine (69 loop devices): 2.35s per CPU. Scaling this lock eliminates this contention entirly, and improves the boot performance by 2s on our machine. Pavel Tatashin (1): loop: scale loop device by introducing per device lock drivers/block/loop.c | 86 ++++++++++++++++++++++++-------------------- drivers/block/loop.h | 1 + 2 files changed, 48 insertions(+), 39 deletions(-) -- 2.25.1