On Mon 15-05-17 11:46:34, Johannes Weiner wrote: > We have observed across several workloads situations where kswapd and > direct reclaimers get stuck in the inode shrinker of the ext4 / mount, > causing allocation latencies across tasks in the system, while there > are dozens of gigabytes of clean page cache covering multiple disks. > > The stack traces of such an instance looks like this: > > [<ffffffff812b3225>] jbd2_log_wait_commit+0x95/0x110 > [<ffffffff812b4f29>] jbd2_complete_transaction+0x59/0x90 > [<ffffffff812668da>] ext4_evict_inode+0x2da/0x480 > [<ffffffff811f2230>] evict+0xc0/0x190 > [<ffffffff811f2339>] dispose_list+0x39/0x50 > [<ffffffff811f323b>] prune_icache_sb+0x4b/0x60 > [<ffffffff811dba71>] super_cache_scan+0x141/0x190 > [<ffffffff8116e755>] shrink_slab+0x235/0x440 > [<ffffffff81172b48>] shrink_zone+0x268/0x2d0 > [<ffffffff81172f04>] do_try_to_free_pages+0x164/0x410 > [<ffffffff81173265>] try_to_free_pages+0xb5/0x160 > [<ffffffff811656b6>] __alloc_pages_nodemask+0x636/0xb30 > [<ffffffff811acac8>] alloc_pages_current+0x88/0x120 > [<ffffffff816d4e46>] skb_page_frag_refill+0xc6/0xf0 > [<ffffffff816d4e8d>] sk_page_frag_refill+0x1d/0x80 > [<ffffffff8173f86b>] tcp_sendmsg+0x28b/0xb10 > [<ffffffff81769727>] inet_sendmsg+0x67/0xa0 > [<ffffffff816d0488>] sock_sendmsg+0x38/0x50 > [<ffffffff816d0518>] sock_write_iter+0x78/0xd0 > [<ffffffff811d774e>] do_iter_readv_writev+0x5e/0xa0 > [<ffffffff811d8468>] do_readv_writev+0x178/0x210 > [<ffffffff811d871c>] vfs_writev+0x3c/0x50 > [<ffffffff811d8782>] do_writev+0x52/0xd0 > [<ffffffff811d9810>] SyS_writev+0x10/0x20 > [<ffffffff81002910>] do_syscall_64+0x50/0xa0 > [<ffffffff817eed3c>] return_from_SYSCALL_64+0x0/0x6a > [<ffffffffffffffff>] 0xffffffffffffffff > > The inode shrinker has provisions to skip any inodes that require > writeback, to avoid tarpitting the entire system behind a single > object when there are many other pools to recycle memory from. But > that logic doesn't cover the situation where an ext4 inode is clean > but journaled and tied to a commit that yet needs to hit the platter. > > Add a superblock operation that lets the generic inode shrinker query > the filesystem whether evicting a given inode will require any IO; add > an ext4 implementation that checks whether the journal is caught up to > the commit id associated with the inode. > > Fixes: 2d859db3e4a8 ("ext4: fix data corruption in inodes with journalled data") > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> OK. I have to say I'm somewhat surprised you use data journalling on some of your files / filesystems but whatever - maybe these are long symlink after all which would make sense. And I'm actually doubly surprised you can see these stack traces as these days inode_lru_isolate() checks inode->i_data.nrpages and uncommitted pages cannot be evicted from pagecache (ext4_releasepage() will refuse to free them) so I don't see how such inode can get to dispose_list(). But maybe the inode doesn't really have any pages and i_datasync_tid just happens to be set to the current transaction because it is initialized that way and we are evicting inode that was recently read from disk. Anyway if you add: "&& inode->i_data.nrpages" to the test in ext4_evict_inode() do the stalls go away? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR