On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote: > On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote: > > Hugh, I would love it if you could test this. It didn't introduce any new > > regressions to the xfstests, but shmem does exercise different paths and > > of course I don't have a clean xfstests run yet, so there could easily > > still be bugs. > > I have been trying it, and it's not good yet. I had hoped to work > out what's wrong and send a patch, but I'm not making progress, > so better hand back to you with what I've found. Thank you so much! I've made some progress. > First problem was that it did not quite build on top of 5.8-rc3 > plus your 1-7 THP prep patches: was there some other series we > were supposed to add in too? If so, that might make a big difference, > but I fixed up __add_to_page_cache_locked() as in the diff below > (and I'm not bothering about hugetlbfs, so haven't looked to see if > its page indexing is or isn't still exceptional with your series). Oops. I shifted some patches around and clearly didn't get it quite right. I'll fix it up. > Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages), > after warning from shmem_evict_inode(). Surprisingly, that first > happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set, > while doing an "rm -rf". I've seen that occasionally too. > The original non-THP machine ran the same load for > ten hours yesterday, but hit no problem. The only significant > difference in what ran successfully, is that I've been surprised > by all the non-zero entries I saw in xarray nodes, exceeding > total entry "count" (I've also been bothered by non-zero "offset" > at root, but imagine that's just noise that never gets used). > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to > kmem_cache_zalloc()s, as in the diff below: not suggesting that > as necessary, just a temporary precaution in case something is > not being initialized as intended. Umm. ->count should always be accurate and match the number of non-NULL entries in a node. the zalloc shouldn't be necessary, and will probably break the workingset code. Actually, it should BUG because we have both a constructor and an instruction to zero the allocation, and they can't both be right. You're right that ->offset is never used at root. I had plans to repurpose that to support smaller files more efficiently, but never got round to implementing those plans. > These problems were either mm/filemap.c:1565 find_lock_entry() > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which > (at least the ones that I went on to investigate) turned out also to be > find_lock_entry(), circling around with page_mapping(page) != mapping. > It seems that find_get_entry() is sometimes supplying the wrong page, > and you will work out why much quicker than I shall. (One tantalizing > detail of the bad offset crashes: very often page pgoff is exactly one > less than the requested offset.) I added this: @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) goto repeat; } page = find_subpage(page, offset); + if (page_to_index(page) != offset) { + printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset); + dump_page(page, "index mismatch"); + printk("xa_load %p\n", xa_load(&mapping->i_pages, offset)); + } out: rcu_read_unlock(); and I have a good clue now: 1322 offset 631 xas index 631 offset 48 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276 1322 flags: 0x4000000000002026(referenced|uptodate|active|private) 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141" 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000 1322 page dumped because: index mismatch 1322 xa_load 000000008c9a9bc3 0x276 is decimal 630. So we're looking up a tail page and getting its erstwhile head page. I'll dig in and figure out exactly how that's happening.