Hi Theodore, I forgot to include two important details , the stack trace of loop0 driver and sar output, which clearly nail the problem. The losetup with Ubuntu16.05 does not have O_DIRECT support and we were not aware of bypassing of journalling if O_DIRECT combined with preallocated file scenario. Using the loop we could track the swap load using sysstat, but otherwise no other major requirement. With loop I could reproduce the problem only twice using stress-ng being run for 20 time for 15 seconds interval. The problem happens highly random so it may take more number of tries in some cases. With direct swap file I tried many times and could not see the issue. The system continues to function fine after terminating the tests. The loop is now removed and swap file is activated directly in the deployed systems where the issue was seen every couple of weeks. Awaiting for few weeks before concluding about the problem/solution. Logs below. backtrace: 6,1725650,1121675639291,-;loop0 D 0 2090 2 0x80000080 4,1725651,1121675639293,-;Call Trace: 4,1725652,1121675639295,-; __schedule+0x3d6/0x8b0 4,1725653,1121675639296,-; schedule+0x36/0x80 4,1725654,1121675639298,-; wait_transaction_locked+0x8a/0xd0 4,1725655,1121675639300,-; ? wait_woken+0x80/0x80 4,1725656,1121675639302,-; add_transaction_credits+0x1cd/0x2b0 4,1725657,1121675639303,-; ? __wake_up_common_lock+0x8e/0xc0 4,1725658,1121675639305,-; start_this_handle+0x103/0x410 4,1725659,1121675639306,-; ? _cond_resched+0x1a/0x50 4,1725660,1121675639310,-; ? kmem_cache_alloc+0x115/0x1c0 4,1725661,1121675639311,-; jbd2__journal_start+0xdb/0x1f0 4,1725662,1121675639314,-; ? ext4_dirty_inode+0x32/0x70 4,1725663,1121675639317,-; __ext4_journal_start_sb+0x6d/0x120 4,1725664,1121675639318,-; ext4_dirty_inode+0x32/0x70 4,1725665,1121675639322,-; __mark_inode_dirty+0x184/0x3b0 4,1725666,1121675639325,-; generic_update_time+0x7b/0xd0 4,1725667,1121675639326,-; ? current_time+0x32/0x70 4,1725668,1121675639328,-; file_update_time+0xbe/0x110 4,1725669,1121675639330,-; __generic_file_write_iter+0x9d/0x1f0 4,1725670,1121675639331,-; ? kmem_cache_free+0x1d1/0x1e0 4,1725671,1121675639333,-; ext4_file_write_iter+0xc4/0x3b0 4,1725672,1121675639336,-; do_iter_readv_writev+0x111/0x180 4,1725673,1121675639337,-; do_iter_write+0x87/0x1a0 4,1725674,1121675639339,-; vfs_iter_write+0x19/0x30 4,1725675,1121675639343,-; lo_write_bvec+0x69/0x110 4,1725676,1121675639344,-; loop_queue_work+0x8ff/0xa60 4,1725677,1121675639346,-; ? __switch_to_asm+0x35/0x70 4,1725678,1121675639347,-; ? __switch_to_asm+0x35/0x70 4,1725679,1121675639348,-; ? __schedule+0x3de/0x8b0 4,1725680,1121675639350,-; kthread_worker_fn+0x85/0x1f0 4,1725681,1121675639351,-; loop_kthread_worker_fn+0x1e/0x20 4,1725682,1121675639352,-; kthread+0x105/0x140 4,1725683,1121675639353,-; ? loop_get_status64+0x90/0x90 4,1725684,1121675639354,-; ? kthread_bind+0x40/0x40 4,1725685,1121675639356,-; ret_from_fork+0x35/0x40 sar output: root@maglev-master-190:/data/tmp/customers/weber/var/log/sysstat# sar -d -f sa10 | grep dev7 Linux 4.15.0-117-generic (maglev-master-1) 03/10/21 _x86_64_ (88 CPU) 00:00:01 DEV tps rkB/s wkB/s areq-sz aqu-sz await svctm %util 00:05:01 dev7-0 1.74 0.92 6.02 4.00 0.29 184.17 28.01 4.86 00:15:01 dev7-0 4.77 3.35 15.74 4.00 0.53 119.11 14.67 7.00 00:25:01 dev7-0 0.48 1.54 0.37 4.00 0.00 1.31 0.20 0.01 00:35:01 dev7-0 0.48 1.03 0.88 4.00 0.00 4.80 0.25 0.01 00:45:01 dev7-0 0.56 1.43 0.82 4.00 0.00 1.13 0.31 0.02 00:55:01 dev7-0 0.88 2.39 1.13 4.00 0.03 34.22 11.67 1.03 01:05:01 dev7-0 0.88 1.47 2.03 4.00 0.06 75.12 18.69 1.64 01:15:01 dev7-0 1.45 0.08 5.71 4.00 94.30 387.62 605.50 87.62 01:25:01 dev7-0 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00 01:35:01 dev7-0 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00 On Wed, Mar 17, 2021 at 10:36 PM Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Wed, Mar 17, 2021 at 08:30:56PM +0530, Shashidhar Patil wrote: > > Hi Theodore, > > Thank you for the details about the journalling layer and > > insight into the block device layer. > > I think Good luck might have clicked. The swap file in our case is > > attached to a loop block device before enabling swap using swapon. > > Since loop driver processes its IO requests by calling > > vfs_iter_write() the write requests re-enter the ext4 > > filesystem/journalling code. > > Is that right ? There seems to be a possibility of cylic dependency. > > If that hypothesis is correct, you should see an example of that in > one of your stack traces; do you? The loop device creates struct file > where the file is opened using O_DIRECT. In the O_DIRECT code path, > assuming the file was fully allocate and initialized, it shouldn't > involve starting a journal handle. > > That being said, why are you using a loop device for a swap device at > all? Using a swap file directly is going to be much more efficient, > and decrease the stack depth and CPU cycles needed to do a swap out if > nothing else. If you can reliably reproduce the problem, what happens > if you use a swap file directly and cut out the loop device as a swap > device? Does it make the problem go away? > > - Ted