Hi Theodore, Thank you for the details about the journalling layer and insight into the block device layer. I think Good luck might have clicked. The swap file in our case is attached to a loop block device before enabling swap using swapon. Since loop driver processes its IO requests by calling vfs_iter_write() the write requests re-enter the ext4 filesystem/journalling code. Is that right ? There seems to be a possibility of cylic dependency. Thanks -Shashidhar On Sun, Mar 14, 2021 at 9:08 AM Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Sat, Mar 13, 2021 at 01:29:43PM +0530, Shashidhar Patil wrote: > > > From what I can tell zswap is using writepage(), and since the swap > > > file should be *completely* preallocated and initialized, we should > > > never be trying to start a handle from zswap. This should prevent the > > > deadlock from happening. If zswap is doing something which is causing > > > ext4 to start a handle when it tries to writeout a swap page, then > > > that would certainly be a problem. But that really shouldn't be the > > > case. > > > > Yes. But the the first sys_write() called by the application did > > allocate an journal handle as required and since > > this specific request now is waiting for IO to complete the handle is > > not closed. Elsewhere in jbd2 task the commit_transaction is > > blocked since there is one or more open journalling handles. Is my > > understanding correct ? > > Yes, that's correct. When we start a transaction commit, either > because the 5 second commit interval has been reached, or there isn't > enough room in the journal for a particular handle to start (when we > start a file system mutation, we estimate the worst case number of > blocks that might need to be modified, and hence require space in the > journal), we first (a) stop any new handles from being started, and > then (b) wait for all currently running handles to complete. > > If one handle takes a lot longer to complete than all the others, > while we are waiting for that last handle to finish, the commit can > not make forward progress, and no other file system operation which > requires modifying metadata can proceed. As a result, we try to keep > the time between starting a handle and stopping a handle as short as > possible. For example, if possible, we will try to read a block that > might be needed by a mutation operation *before* we start the handle. > That's not always possible, but we try to do that whenever possible, > and there are various tracepoints and other debugging facilities so we > can see which types of file system mutations require holding handles > longest, so we can try to optimize them. > > > 4,1737846,1121675697013,-; schedule+0x36/0x80 > > 4,1737847,1121675697015,-; io_schedule+0x16/0x40 > > 4,1737848,1121675697016,-; blk_mq_get_tag+0x161/0x250 > > 4,1737849,1121675697018,-; ? wait_woken+0x80/0x80 > > 4,1737850,1121675697020,-; blk_mq_get_request+0xdc/0x3b0 > > 4,1737851,1121675697021,-; blk_mq_make_request+0x128/0x5b0 > > 4,1737852,1121675697023,-; generic_make_request+0x122/0x2f0 > > 4,1737853,1121675697024,-; ? bio_alloc_bioset+0xd2/0x1e0 > > 4,1737854,1121675697026,-; submit_bio+0x73/0x140 > > ..... > > So all those IO requests are waiting for response from the raid port, > > is that right ? > > > > But the megaraid_sas driver( the system has LSI MEGARAID port) in most > > cases handles the unresponsive behavior > > by resetting the device. IN this case the reset did not happen, maybe > > there is some other bug in the megaraid driver. > > Yes, it's not necessarily a problem with the storage device or the > host bus adapter; it could also be some kind of bug in the device > driver --- or even the block layer, although that's much, much less > likely (mostly because a lot of people would be complaining if that > were the case). > > If you have access to a SCSI/SATA bus snooper which can be inserted in > between the storage device (HDD/SSD) and the LSI Megaraid, that might > be helpful in terms of trying to figure out what is going on. Failing > that, you'll probably find some way to add/use debugging > hooks/tracepoints in the driver. > > Good luck, > > - Ted