Hi Jan, On Wed, Jun 10, 2020 at 10:21 AM Jan Kara <jack@xxxxxxx> wrote: > > Hello! > > Firstly, thanks for the patches and I'm sorry that it took me so long to > get to this. > No worries at all. Thank you very much for reviewing, and explaining your understanding of the problem and why this patchset doesn't fully address it. I believe I understood the rationale and the pieces involved in the solution you suggested (thanks for the level of detail), and should be able to work on it and send something within a few weeks. cheers, Mauricio > On Thu 23-04-20 20:36:54, Mauricio Faria de Oliveira wrote: > > Ted Ts'o explains, in the linux-ext4 thread [1], that currently > > data=journal + journal_checksum + mmap is not handled correctly, > > and outlines the changes to fix it (+ items/breaks for clarity): > > > > """ > > The fix is going to have to involve > > - fixing __ext4_journalled_writepage() to call set_page_writeback() > > before it unlocks the page, > > - adding a list of pages under data=journalled writeback > > which is attached to the transaction handle, > > - have the jbd2 commit hook call end_page_writeback() > > on all of these pages, > > - and then in the places where ext4 calls wait_for_stable_page() > > or grab_cache_page_write_begin(), we need to add: > > > > if (ext4_should_journal_data(inode)) > > wait_on_page_writeback(page); > > > > It's all relatively straightforward except for the part > > where we have to attach a list of pages to the currently > > running transaction. That will require adding some > > plumbing into the jbd2 layer. > > """ > > So I was pondering about this general design for some time. First let me > state here how I see the problem you are hitting in data=journal mode: > > The journalling machinery expects the buffer contents cannot change from > the moment transaction commit starts (more exactly from the moment > transaction exists LOCKED state) until the moment transaction commit > finishes. Places like jbd2_journal_get_write_access() are there exactly to > ascertain this - they copy the "to be committed" contents into a bounce > buffer (jh->b_frozen_data) or wait for commit to finish (if it's too late > for copying and the data is already in flight). data=journal mode breaks > this assumption because although ext4_page_mkwrite() calls > do_journal_get_write_access() for each page buffer (and thus makes sure > there's no commit with these buffers running at that moment), the commit > can start the moment we call ext4_journal_stop() in ext4_page_mkwrite() and > then this commit runs while buffers in the committing transaction are > writeably mapped to userspace. > > However I don't think Ted's solution actually fully solves the above > problem. Think for example about the following scenario: > > fd = open('file'); > addr = mmap(fd); > addr[0] = 'a'; > -> caused page fault, ext4_page_mkwrite() is called, page is > dirtied, all buffers are added to running transaction > pwrite(fd, buf, 1, 1); > -> page dirtied again, all buffer are dirtied in the running > transaction > > jbd2 thread commits transaction now > -> the problem with committing buffers that are writeably mapped is still > there and your patches wouldn't solve it because > ext4_journalled_writepage() didn't get called at all. > > Also, as you found out, leaving pages under writeback while we didn't even > start transaction commit that will end writeback on them is going to cause > odd stalls in various unexpected places. > > So I'd suggest a somewhat different solution. Instead of trying to mark, > when page is under writeback, do it the other way around and make jbd2 kick > ext4 on transaction commit to writeprotect journalled pages. Then, because > of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages > cannot be writeably mapped into page tables until transaction commit > finishes (or we'll copy data to commit to b_frozen_data). > > Now let me explain a bit more the "make jbd2 kick ext4 on transaction > commit to writeprotect journalled pages" part. I think we could mostly > reuse the data=ordered machinery for that. data=ordered machinery tracks > with each running transaction also a list of inodes for which we need to > flush data pages before doing commit of metadata into the journal. Now we > could use the same mechanism for data=journal mode - we'd track in the > inode list inodes with writeably mapped pages and on transaction commit we > need to writeprotect these pages using clear_page_dirty_for_io(). Probably > the best place to add inode to this list is ext4_journalled_writepage(). > To do the commit handling we probably need to introduce callbacks that jbd2 > will call instead of journal_submit_inode_data_buffers() in > journal_submit_data_buffers() and instead of > filemap_fdatawait_range_keep_errors() in > journal_finish_inode_data_buffers(). In data=ordered mode ext4 will just do > what jbd2 does in its callback, when inode's data should be journalled, the > callback will use write_cache_pages() to iterate and writeprotect all dirty > pages. The writepage callback just returns AOP_WRITEPAGE_ACTIVATE, some > care needs to be taken to redirty a page in the writepage callback if not > all its underlying buffers are part of the committing transaction or if > some buffer already has b_next_transaction set (since in that case the page > was already dirtied also against the following transaction). But it should > all be reasonably doable. > > Honza > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- Mauricio Faria de Oliveira