Hi Mauricio! On Fri 20-11-20 12:27:56, Mauricio Faria de Oliveira wrote: > On Tue, Oct 27, 2020 at 1:10 PM Mauricio Faria de Oliveira > <mfo@xxxxxxxxxxxxx> wrote: > > On Tue, Oct 27, 2020 at 10:27 AM Jan Kara <jack@xxxxxxx> wrote: > > > > > > Commit afb585a97f81 "ext4: data=journal: write-protect pages on > > > j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write() > > > to track inode ranges whose mappings need to get write-protected during > > > transaction commits. However the added calls use wrong start of a range > > > (0 instead of page offset) and so write protection is not necessarily > > > effective. Use correct range start to fix the problem. > > > > > > Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()") > > > Signed-off-by: Jan Kara <jack@xxxxxxx> > > > --- > > > fs/ext4/inode.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > Mauricio, I think this could be the reason for occasional test failures you > > > were still seeing. Can you try whether this patch fixes those for you? Thanks! > > > > > > > Thanks! Nice catch. Sure, I'll give it a try and follow up. > > TL;DR: > > 1) Thanks! The patch fixed almost 100% of the checksum failures. > 2) I can send a debug patch to verify buffer checksums before write-out. > 3) The remaining, rare checksum failures seem to be due to > a race between commit/write-protect and page writeback > related to PageChecked(), clearing pagecache dirty tag used to > write-protect. > 4) Test results statistics confirm that the occurrence of checksum > failures is really low. Glad to hear that! > Thanks for the patch! The results with v5.10-rc4 are almost 100%: > > There are now _very rare_ occasions of journal checksum failures detected; with > _zero_ recovery failures in stress-ng/crash/reboot/mount in 1187 loops > overnight. > (Previously I'd get 3-5 out of 10.) > > I plan to send the debug patch used to verify the buffer checksum in the tag > before write-out (catches the checksum failures that fail recovery in advance), > if you think it might be useful. I thought of it under CONFIG_JBD2_DEBUG. > > ... > > The remaining checksum changes due to write-protect failures _seem_ to be > a race between our write-protect with write_cache_pages() and writeback/sync. > But I'm not exactly sure as it's been hard to find the timing/steps > for both threads. > The idea is, > > Our write_cache_pages() during commit / > ext4_journalled_submit_inode_data_buffers() > depends on PAGECACHE_TAG_DIRTY being set, for pagevec_lookup_range_tag() > to put such pages in the array to be write-protected with > clear_page_dirty_for_io(). > > With a debug patch to dump_stack() callers of > __test_set_page_writeback(), that can > xas_clear_mark() it, _while_ the page is still going in our call to > write_cache_pages(), > we see this: wb_workfn() -> ext4_writepage() -> ext4_ext4_bio_write_page(), I guess there was something between wb_workfn() and ext4_writepage(), wasn't there? There should be write_cache_pages()... > i.e., _not_ going through ext4_journalled_writepage(), which > knows/waits on journal. > The other leg of ext4_writepage() _doesn't_, and thus can clear the > tag and prevent > write-protect while the journal commit / our write_cache_pages() is running. So I don't think this is quite it. If there are two writebacks racing, either of them will writeprotect the page which is enough for commit to be safe (as the mapped write will then trigger ext4_page_mkwrite() which will wait for the end of running commit in jbd2_journal_get_write_access()). Or am I missing something? But there must be still some race as you can still see occasional checksum changes... So I must be missing something. > Since the switch to either leg is PageChecked(), I guess this might be > due to clearing it right away in ext4_journalled_writepage(), before > write-protection. The write-protection happens in clear_page_dirty_for_io() call so that's before our ->writepage() callback is even called. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR