On Wed 29-06-22 09:33:23, Qu Wenruo wrote: > > > On 2022/6/28 16:00, Jan Kara wrote: > > On Tue 28-06-22 08:24:07, Qu Wenruo wrote: > > > On 2022/6/27 18:19, Jan Kara wrote: > > > > On Sat 25-06-22 11:11:43, Christoph Hellwig wrote: > > > > > On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote: > > > > > > I'm not sure I get the context 100% right but pages getting randomly dirty > > > > > > behind filesystem's back can still happen - most commonly with RDMA and > > > > > > similar stuff which calls set_page_dirty() on pages it has got from > > > > > > pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should > > > > > > be usable within filesystems to detect such cases and protect the > > > > > > filesystem but so far neither me nor John Hubbart has got to implement this > > > > > > in the generic writeback infrastructure + some filesystem as a sample case > > > > > > others could copy... > > > > > > > > > > Well, so far the strategy elsewhere seems to be to just ignore pages > > > > > only dirtied through get_user_pages. E.g. iomap skips over pages > > > > > reported as holes, and ext4_writepage complains about pages without > > > > > buffers and then clears the dirty bit and continues. > > > > > > > > > > I'm kinda surprised that btrfs wants to treat this so special > > > > > especially as more of the btrfs page and sub-page status will be out > > > > > of date as well. > > > > > > > > I agree btrfs probably needs a different solution than what it is currently > > > > doing if they want to get things right. I just wanted to make it clear that > > > > the code you are ripping out may be a wrong solution but to a real problem. > > > > > > IHMO I believe btrfs should also ignore such dirty but not managed by fs > > > pages. > > > > > > But I still have a small concern here. > > > > > > Is it ensured that, after RDMA dirtying the pages, would we finally got > > > a proper notification to fs that those pages are marked written? > > > > So there is ->page_mkwrite() notification happening when RDMA code calls > > pin_user_pages() when preparing buffers. > > I'm wondering why page_mkwrite() is only called when preparing the buffer? Because that's the moment when the page fault happens. After this moment we simply give the page physical address to the HW card and the card is free to modify that memory as it wishes without telling the kernel about it. That is simply how the HW is designed. > Wouldn't it make more sense to call page_mkwrite() when the buffered is > released from RDMA? Well, but this is long after the page contents have been modified and in fact the page need not be mapped to process' virtual address space anymore by that time (it is perfectly fine to do: addr = mmap(file), pass addr to HW, munmap(addr)). So we don't have enough context for page_mkwrite() callback anymore. Essentially all we can provide is already provided in the ->set_page_dirty() callback the filesystem gets. > Sorry for all these dumb questions, as the core-api/pin_user_pages.rst > still doesn't explain thing to my dumb brain... Yeah, these things are subtle and somewhat hard to grasp... > Another thing is, RDMA doesn't really need to respect things like page > locked/writeback, right? Correct. > As to RDMA calls, all pages should be pinned and seemingly exclusive to > them. > > And in that case, I think btrfs should ignore writing back those pages, > other than doing fixing ups. > > As the btrfs csum requires everyone modifying the page to wait for > writeback, or the written data will be out-of-sync with the calculated > csum and cause future -EIO when reading it from disk. Yes, I know. Ignoring writeback of page_maybe_dma_pinned() pages is a reasonable choice the fs can do. The only exception tends to be data integrity writeback - stuff like fsync(2) or sync(2). There the filesystem might need to writeback the page to make sure everything is consistent on disk (and stale data is not exposed) in case of a crash. So in these special cases it may be necessary to use bounce pages for submitting the IO (and computing checksums etc.) so that inconsistencies you mention above are not possible. > > The trouble is that although later > > page_mkclean() makes page not writeable from page tables, it may be still > > written by RDMA code (even hours after ->page_mkwrite() notification, RDMA > > buffers are really long-lived) and that's what eventually confuses the > > filesystem. Otherwise set_page_dirty() is the notification that page > > contents was changed and needs writing out... > > Another thing I still didn't get is, is there any explicit > mkwrite()/set_page_dirty() calls when those page are unpinned. > > If no such explicit calls, these dirty pages caused by RDMA would always > be ignored by fses (except btrfs), and would never got proper written back. When the pages are unpinned the holder must call set_page_dirty() to let the rest of the kernel know that the hardware may be modified the page contents. The filesystem can hook there with ->set_page_dirty() hook if it needs to do some action. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR