On Sun, Jan 1, 2023 at 4:54 PM Hillf Danton <hdanton@xxxxxxxx> wrote: > > > ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline] Something holds the ni_lock, so this process has blocked on it, and this all happens inside mmap(): > > attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919 > > ntfs_file_mmap+0x4cc/0x780 fs/ntfs3/file.c:296 > > call_mmap include/linux/fs.h:2191 [inline] > > mmap_region+0x1022/0x1e60 mm/mmap.c:2621 > > do_mmap+0x8d9/0xf30 mm/mmap.c:1411 > > vm_mmap_pgoff+0x1e5/0x2f0 mm/util.c:520 so this code holds the mmapo_lock for writing, which is why all those other processes are hung on getting it for reading for page faults etc. End result: ignore all those page fault processes, this mmap_lock -> ni_lock explains them all, and they aren't the cause. > > folio_wait_bit_common+0x8ca/0x1390 mm/filemap.c:1297 > > folio_lock include/linux/pagemap.h:938 [inline] > > truncate_inode_pages_range+0xc8d/0x1650 mm/truncate.c:421 > > truncate_inode_pages mm/truncate.c:448 [inline] > > truncate_pagecache mm/truncate.c:743 [inline] > > truncate_setsize+0xcb/0xf0 mm/truncate.c:768 > > ntfs_truncate fs/ntfs3/file.c:395 [inline] .. and this thread is waiting on the page lock (well, folio, same thing), and the IO apparently isn't completing. And that seems to be because this one is busy reading the page, and blocked on that same ni_lock: > > task:syz-executor394 state:D stack:24072 pid:6048 ppid:5125 flags:0x00004004 > > Call Trace: > > <TASK> > > ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline] > > attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919 > > ntfs_get_block_vbo+0x374/0xd20 fs/ntfs3/inode.c:573 > > do_mpage_readpage+0x98b/0x1bb0 fs/mpage.c:208 > > mpage_read_folio+0x103/0x1d0 fs/mpage.c:379 But our debugging output looks a bit bogus: > > Showing all locks held in the system: > > 3 locks held by syz-executor394/5214: > > #0: ffff88801ee04460 (sb_writers#9){.+.+}-{0:0}, at: do_sendfile+0x61c/0xfd0 fs/read_write.c:1254 > > #1: ffff888073930ca0 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_invalidate_lock_shared include/linux/fs.h:811 [inline] > > #1: ffff888073930ca0 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_update_page+0x72/0x550 mm/filemap.c:2478 > > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline] > > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919 It's showing 394/5214 as "holding" the lock, even though it's just waiting for it - it's the one doing the readpage. I think it's just because lockdep ends up adding the lock to the queue before it actually gets the lock, so anybody pending will be shown as "holding" it. And the 5221 one: > > 2 locks held by syz-executor394/5221: > > #0: ffff88802c7bc758 (&mm->mmap_lock){++++}-{3:3}, at: mmap_write_lock_killable include/linux/mmap_lock.h:87 [inline] > > #0: ffff88802c7bc758 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x18f/0x2f0 mm/util.c:518 > > #1: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline] > > #1: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: attr_data_get_block+0x4a6/0x2e40 fs/ntfs3/attrib.c:919 is that mmap() one, which is waiting for the ni_lock too (while holding the mmap_sem, which is why the page faulters are all blocked). But 5222 is is interesting, it is the truncate one, and it's waiting for the page lock, and it really does seem to hold the ni_lock: > > 3 locks held by syz-executor394/5222: > > #0: ffff88801ee04460 (sb_writers#9){.+.+}-{0:0}, at: mnt_want_write+0x3b/0x80 fs/namespace.c:508 > > #1: ffff888073930b00 (&sb->s_type->i_mutex_key#14){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:756 [inline] > > #1: ffff888073930b00 (&sb->s_type->i_mutex_key#14){+.+.}-{3:3}, at: do_truncate+0x205/0x300 fs/open.c:63 > > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ni_lock fs/ntfs3/ntfs_fs.h:1122 [inline] > > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ntfs_truncate fs/ntfs3/file.c:393 [inline] > > #2: ffff888073930860 (&ni->ni_lock/4){+.+.}-{3:3}, at: ntfs3_setattr+0x596/0xca0 fs/ntfs3/file.c:696 So I think that we have: - ntfs_truncate() gets the ni_lock (fs/ntfs3/file.c:393) - it then - while holding that lock - calls (on line 395): truncate_setsize -> truncate_pagecache -> truncate_inode_pages -> truncate_inode_pages_range -> folio_lock but that deadlocks on another process that wants to read that page, and that needs ni_lock to do so. So yes, it does look like a ntfs3 deadlock involving ni_lock. Anyway, the above is just me trying to make sense of the call traces and trying to cut out all the noise. I might have mis-read something. Linus