On Mon, 4 Jan 2021, Linus Torvalds wrote: > On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > Linus, how confident are you in those wait_on_page_bit_common() > > changes? > > Pretty confident. The atomicity of the bitops themselves is fairly simple. > > But in the writeback bit? No. The old code would basically _loop_ if > it was woken up and the writeback bit was set again, and would hide > any problems with it. > > The new code basically goes "ok, the writeback bit was clear at one > point, so I've waited enough". > > We could easily turn the "if ()" in wait_on_page_writeback() into a "while()". > > But honestly, it does smell to me like the bug is always in the caller > not having serialized with whatever actually starts writeback. High > figured out one such case. > > This code holds the page lock, but I don't see where > set_page_writeback() would always be done with the page lock held. So > what really protects against PG_writeback simply being set again? > > The whole BUG_ON() seems entirely buggy to me. > > In fact, even if you hold the page lock while doing > set_page_writeback(), since the actual IO does *NOT* hold the page > lock, the unlock happens without it. So even if every single case of > setting the page writeback were to hold the page lock, I did an audit when this came up before, and though not 100% confident in my diligence, it certainly looked that way; and others looked too (IIRC Matthew had a patch to add a WARN_ON_ONCE or whatever, but that didn't go upstream). > what keeps this from happening: > > CPU1 = end previous writeback > CPU2 = start new writeback under page lock > CPU3 = write_cache_pages() > > CPU1 CPU2 CPU3 > ---- ---- ---- > > end_page_writeback() > test_clear_page_writeback(page) > ... delayed... > > > lock_page(); > set_page_writeback() > unlock_page() > > > lock_page() > wait_on_page_writeback(); > > wake_up_page(page, PG_writeback); > .. wakes up CPU3 .. > > BUG_ON(PageWriteback(page)); > > IOW, that BUG_ON() really feels entirely bogus to me. Notice how it > wasn't actually serialized with the waking up of the _previous_ bit. Well. That looks so obvious now you suggest it, that I feel very stupid for not seeing it before, so have tried hard to disprove you. But I think you're right. > > Could we make the wait_on_page_writeback() just loop if it sees the > page under writeback again? Sure. > > Could we make the wait_on_page_bit_common() code say "if this is > PG_writeback, I won't wake it up after all, because the bit is set > again?" Sure. > > But I feel it's really that end_page_writeback() itself is > fundamentally buggy, because the "wakeup" is not atomic with the bit > clearing _and_ it doesn't actually hold the page lock that is > allegedly serializing this all. And we won't be adding a lock_page() into end_page_writeback()! > > That raciness was what caused the "stale wakeup from previous owner" > thing too. And I think that Hugh fixed the page re-use case, but the > fundamental problem of end_page_writeback() kind of remained. > > And yes, I think this was all hidden by wait_on_page_writeback() > effectively looping over the "PageWriteback(page)" test because of how > wait_on_page_bit() worked. > > So the one-liner of changing the "if" to "while" in > wait_on_page_writeback() should get us back to what we used to do. I think that is the realistic way to go. > > Except I still get the feeling that the bug really is not in > wait_on_page_writeback(), but in the end_page_writeback() side. > > Comments? I'm perfectly happy doing the one-liner. I would just be > _happier_ with end_page_writeback() having the serialization.. > > The real problem is that "wake_up_page(page, bit)" is not the thing > that actually clears the bit. So there's a fundamental race between > clearing the bit and waking something up. > > Which makes me think that the best option would actually be to move > the bit clearing _into_ wake_up_page(). But that looks like a very big > change. I'll be surprised if that direction is even possible, without unpleasant extra locking. If there were only one wakeup to be done, perhaps, but potentially there are many. When I looked before, it seemed that the clear bit needs to come before the wakeup, and the wakeup needs to come before the clear bit. And the BOOKMARK case drops q->lock. It should be possible to rely on the XArray's i_pages lock rather than the page lock for serialization, much as I did in one variant of the patch I sent originally. Updated version appended below for show (most of it rearrangement+cleanup rather than the functional change); but I think it's slightly incomplete (__test_set_page_writeback() should take i_pages lock even in the !mapping_use_writeback_tags case); and last time around you were worried by the NULL mapping case - which I believe just comes from an abundance of caution (who writes back without any backing? and truncation/eviction wait_on_page_writeback()). But I expect you to take one look and quickly opt instead for the "while" in wait_on_page_writeback(): which is not so bad now that you've shown a scenario which justifies it. --- 5.11-rc2/include/linux/page-flags.h 2020-12-27 20:39:36.819923814 -0800 +++ linux/include/linux/page-flags.h 2021-01-04 17:27:32.771920607 -0800 @@ -549,7 +549,6 @@ static __always_inline void SetPageUptod CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL) -int test_clear_page_writeback(struct page *page); int __test_set_page_writeback(struct page *page, bool keep_write); #define test_set_page_writeback(page) \ --- 5.11-rc2/include/linux/pagemap.h 2020-12-13 14:41:30.000000000 -0800 +++ linux/include/linux/pagemap.h 2021-01-04 17:27:32.775920636 -0800 @@ -660,6 +660,7 @@ static inline int lock_page_or_retry(str */ extern void wait_on_page_bit(struct page *page, int bit_nr); extern int wait_on_page_bit_killable(struct page *page, int bit_nr); +extern void wake_up_page_bit(struct page *page, int bit_nr); /* * Wait for a page to be unlocked. --- 5.11-rc2/kernel/sched/wait_bit.c 2020-03-29 15:25:41.000000000 -0700 +++ linux/kernel/sched/wait_bit.c 2021-01-04 17:27:32.779920665 -0800 @@ -90,9 +90,8 @@ __wait_on_bit_lock(struct wait_queue_hea ret = action(&wbq_entry->key, mode); /* * See the comment in prepare_to_wait_event(). - * finish_wait() does not necessarily takes wwq_head->lock, - * but test_and_set_bit() implies mb() which pairs with - * smp_mb__after_atomic() before wake_up_page(). + * finish_wait() does not necessarily take + * wwq_head->lock, but test_and_set_bit() implies mb(). */ if (ret) finish_wait(wq_head, &wbq_entry->wq_entry); --- 5.11-rc2/mm/filemap.c 2020-12-27 20:39:37.659932212 -0800 +++ linux/mm/filemap.c 2021-01-04 17:28:59.464560914 -0800 @@ -1093,7 +1093,7 @@ static int wake_page_function(wait_queue return (flags & WQ_FLAG_EXCLUSIVE) != 0; } -static void wake_up_page_bit(struct page *page, int bit_nr) +void wake_up_page_bit(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); struct wait_page_key key; @@ -1147,13 +1147,6 @@ static void wake_up_page_bit(struct page spin_unlock_irqrestore(&q->lock, flags); } -static void wake_up_page(struct page *page, int bit) -{ - if (!PageWaiters(page)) - return; - wake_up_page_bit(page, bit); -} - /* * A choice of three behaviors for wait_on_page_bit_common(): */ @@ -1466,40 +1459,6 @@ void unlock_page(struct page *page) } EXPORT_SYMBOL(unlock_page); -/** - * end_page_writeback - end writeback against a page - * @page: the page - */ -void end_page_writeback(struct page *page) -{ - /* - * TestClearPageReclaim could be used here but it is an atomic - * operation and overkill in this particular case. Failing to - * shuffle a page marked for immediate reclaim is too mild to - * justify taking an atomic operation penalty at the end of - * ever page writeback. - */ - if (PageReclaim(page)) { - ClearPageReclaim(page); - rotate_reclaimable_page(page); - } - - /* - * Writeback does not hold a page reference of its own, relying - * on truncation to wait for the clearing of PG_writeback. - * But here we must make sure that the page is not freed and - * reused before the wake_up_page(). - */ - get_page(page); - if (!test_clear_page_writeback(page)) - BUG(); - - smp_mb__after_atomic(); - wake_up_page(page, PG_writeback); - put_page(page); -} -EXPORT_SYMBOL(end_page_writeback); - /* * After completing I/O on a page, call this routine to update the page * flags appropriately --- 5.11-rc2/mm/page-writeback.c 2020-12-13 14:41:30.000000000 -0800 +++ linux/mm/page-writeback.c 2021-01-04 17:28:59.464560914 -0800 @@ -589,7 +589,7 @@ static void wb_domain_writeout_inc(struc /* * Increment @wb's writeout completion count and the global writeout - * completion count. Called from test_clear_page_writeback(). + * completion count. Called from end_page_writeback(). */ static inline void __wb_writeout_inc(struct bdi_writeback *wb) { @@ -2719,49 +2719,69 @@ int clear_page_dirty_for_io(struct page } EXPORT_SYMBOL(clear_page_dirty_for_io); -int test_clear_page_writeback(struct page *page) +/** + * end_page_writeback - end writeback against a page + * @page: the page + */ +void end_page_writeback(struct page *page) { - struct address_space *mapping = page_mapping(page); + struct address_space *mapping; struct mem_cgroup *memcg; struct lruvec *lruvec; - int ret; + unsigned long flags; + int writeback; + + /* + * TestClearPageReclaim could be used here but it is an atomic + * operation and overkill in this particular case. Failing to + * shuffle a page marked for immediate reclaim is too mild to + * justify taking an atomic operation penalty at the end of + * every page writeback. + */ + if (PageReclaim(page)) { + ClearPageReclaim(page); + rotate_reclaimable_page(page); + } + mapping = page_mapping(page); + WARN_ON_ONCE(!mapping); memcg = lock_page_memcg(page); lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + dec_lruvec_state(lruvec, NR_WRITEBACK); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); + inc_node_page_state(page, NR_WRITTEN); + + if (mapping) + xa_lock_irqsave(&mapping->i_pages, flags); + + writeback = TestClearPageWriteback(page); + /* No need for smp_mb__after_atomic() after TestClear */ + if (PageWaiters(page)) + wake_up_page_bit(page, PG_writeback); + if (mapping && mapping_use_writeback_tags(mapping)) { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); - unsigned long flags; - xa_lock_irqsave(&mapping->i_pages, flags); - ret = TestClearPageWriteback(page); - if (ret) { - __xa_clear_mark(&mapping->i_pages, page_index(page), + __xa_clear_mark(&mapping->i_pages, page_index(page), PAGECACHE_TAG_WRITEBACK); - if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) { - struct bdi_writeback *wb = inode_to_wb(inode); + if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) { + struct bdi_writeback *wb = inode_to_wb(inode); - dec_wb_stat(wb, WB_WRITEBACK); - __wb_writeout_inc(wb); - } + dec_wb_stat(wb, WB_WRITEBACK); + __wb_writeout_inc(wb); } + if (inode && !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) + sb_clear_inode_writeback(inode); + } - if (mapping->host && !mapping_tagged(mapping, - PAGECACHE_TAG_WRITEBACK)) - sb_clear_inode_writeback(mapping->host); - + if (mapping) xa_unlock_irqrestore(&mapping->i_pages, flags); - } else { - ret = TestClearPageWriteback(page); - } - if (ret) { - dec_lruvec_state(lruvec, NR_WRITEBACK); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); - inc_node_page_state(page, NR_WRITTEN); - } __unlock_page_memcg(memcg); - return ret; + BUG_ON(!writeback); } +EXPORT_SYMBOL(end_page_writeback); int __test_set_page_writeback(struct page *page, bool keep_write) {