On 20-Jul-24 1:51 AM, Yu Zhao wrote:
However during the weekend mglru-enabled run (with above fix to
isolate_lru_folios() and also the previous two patches: truncate.patch
and mglru.patch and the inode fix provided by Mateusz), another hard
lockup related to lruvec spinlock was observed.
Thanks again for the stress tests.
I can't come up with any reasonable band-aid at this moment, i.e.,
something not too ugly to work around a more fundamental scalability
problem.
Before I give up: what type of dirty data was written back to the nvme
device? Was it page cache or swap?
This is how a typical dstat report looks like when we start to see the
problem with lruvec spinlock.
------memory-usage----- ----swap---
used free buff cach| used free|
14.3G 20.7G 1467G 185M| 938M 15G|
14.3G 20.0G 1468G 174M| 938M 15G|
14.3G 20.3G 1468G 184M| 938M 15G|
14.3G 19.8G 1468G 183M| 938M 15G|
14.3G 19.9G 1468G 183M| 938M 15G|
14.3G 19.5G 1468G 183M| 938M 15G|
As you can see, most of the usage is in buffer cache and swap is hardly
used. Just to recap from the original post...
====
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
====
Regards,
Bharata.