Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: > On Fri, Jul 23, 2021 at 01:23:07PM -0700, Hugh Dickins wrote: >> I was wary because, if the (never observed) race to be fixed is in >> swap_cluster_readahead(), why was shmem_swapin_page() being patched? >> Not explained in its commit message, probably a misunderstanding of >> how mm/shmem.c already manages races (and prefers not to be involved >> in swap_info_struct stuff). >> >> But why do I now say it's bad? Because even if you correct the EINVAL >> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is >> not surprising, -ENOSPC can need consideration, but -EIO and anything >> else just end up as SIGBUS when faulting (or as error from syscall). >> So, 2efa33fc7f6e converts a race with swapoff to SIGBUS: not good, >> and I think much more likely than the race to be fixed (since >> swapoff's percpu_ref_kill() rightly comes before synchronize_rcu()). > > Yes, I think a lot more thought was needed here. And I would have > preferred to start with a reproducer instead of "hey, this could > happen". Maybe something like booting a 1GB VM, adding two 2GB swap > partitions, swapon(partition A); run a 2GB memhog and then > > loop: > swapon(part B); > swapoff(part A); > swapon(part A); > swapoff(part B); > > to make this happen. > > but if it does happen, why would returning EINVAL be the right thing > to do? We've swapped it out. It must be on swap somewhere, or we've > really messed up. So I could see there being a race where we get > preempted between looking up the swap entry and calling get_swap_device(). > But if that does happen, then the page gets brought in, and potentially > reswapped to the other swap device. > > So returning -EEXIST here would actually work. That forces a re-lookup > in the page cache, so we'll get the new swap entry that tells us which > swap device the page is now on. Yes. -EEXIST is the right error code. We use that in shmem_swapin_page() to deal with race condition. > But I REALLY REALLY REALLY want a reproducer. Right now, I have a hard > time believing this, or any of the other races can really happen. I think the race is only theoretical too. Firstly, swapoff is a rare operations in practice; secondly, the race window is really small. Best Regards, Huang, Ying >> 2efa33fc7f6e was intending to fix a race introduced by two-year-old >> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested >> or not"), which added a call to inode_read_congested(). Certainly >> relying on si->swap_file->f_mapping->host there was new territory: >> whether actually racy I'm not sure offhand - I've forgotten whether >> synchronize_rcu() waits for preempted tasks or not. >> >> But if it is racy, then I wonder if the right fix might be to revert >> 8fd2e0b505d1 too. Convincing numbers were offered for it, but I'm >> puzzled: because Matthew has in the past noted that the block layer >> broke and further broke bdi congestion tracking (I don't know the >> relevant release numbers), so I don't understand how checking >> inode_read_congested() is actually useful there nowadays. > > It might be useful for NFS? I don't think congestion is broken there > (except how does the NFS client have any idea whether the server is > congested or not?)