On Wed, Sep 4, 2013 at 7:57 AM, Jan Kara <jack@xxxxxxx> wrote: > On Thu 22-08-13 17:03:19, Andy Lutomirski wrote: >> Filesystems that defer cmtime updates should update cmtime when any >> of these events happen after a write via a mapping: >> >> - The mapping is written back to disk. This happens from all kinds >> of places, most of which eventually call ->writepages. (The >> exceptions are vmscan and migration.) >> >> - munmap is called or the mapping is removed when the process exits >> >> - msync(MS_ASYNC) is called. Linux currently does nothing for >> msync(MS_ASYNC), but POSIX says that cmtime should be updated some >> time between an mmaped write and the subsequent msync call. >> MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. >> >> Filesystems are responsible for checking for pending deferred cmtime >> updates in .writepages (a helper is provided for this purpose) and >> for doing the actual update in .update_cmtime_deferred. >> >> These changes have no effect by themselves; filesystems must opt in >> by implementing .update_cmtime_deferred and removing any >> file_update_time call in .page_mkwrite. >> >> This patch does not implement the MS_ASYNC case; that's in the next >> patch. >> >> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx> > ... >> +/** >> + * generic_update_cmtime_deferred - update cmtime after an mmapped write >> + * @mapping: The mapping >> + * >> + * This library function implements .update_cmtime_deferred. It is unlikely >> + * that any filesystem will want to do anything here except update the time >> + * (using this helper) or nothing at all (by leaving .update_cmtime_deferred >> + * NULL). >> + */ >> +void generic_update_cmtime_deferred(struct address_space *mapping) >> +{ >> + struct blk_plug plug; >> + blk_start_plug(&plug); >> + inode_update_time_writable(mapping->host); >> + blk_finish_plug(&plug); >> +} >> +EXPORT_SYMBOL(generic_update_cmtime_deferred); >> + > You can remove the pluggin here. Inode update will likely result in a > single write so there's no point. > >> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait) >> } >> EXPORT_SYMBOL(write_one_page); >> >> +void mapping_flush_cmtime(struct address_space *mapping) >> +{ >> + if (mapping_test_clear_cmtime(mapping) && >> + mapping->a_ops->update_cmtime_deferred) >> + mapping->a_ops->update_cmtime_deferred(mapping); >> +} >> +EXPORT_SYMBOL(mapping_flush_cmtime); > Hum, is there a reason for update_cmtime_deferred() operation? I can > hardly imagine anyone will want to do anything else than what > inode_update_time_writable() does so why bother? You mention tmpfs & co. > don't fit into your scheme well with which I agree so let's just keep > file_update_time() in their page_mkwrite() operation. But I don't see a > real need for avoiding the deferred cmtime logic... I think there might be odd corner cases. For example, mmap a tmpfs file, write it, and unmap it. Then, an hour later, maybe the system will be under memory pressure and page out the file. This could trigger a surprising time update. (I'm not sure this can actually happen on tmpfs, but maybe it would on some other filesystem.) Does this actually matter? A flag to turn the feature on or off would do the trick, but I don't think there's precedent for sticking a flag in a_ops. > >> + >> +void mapping_flush_cmtime_nowb(struct address_space *mapping) >> +{ >> + /* >> + * We get called from munmap and msync. Both calls can race >> + * with fs freezing. If the fs is frozen after >> + * mapping_test_clear_cmtime but before the time update, then >> + * sync_filesystem will miss the cmtime update (because we >> + * just cleared it) and we don't be able to write (because the >> + * fs is frozen). On the other hand, we can't just return if >> + * we're in the SB_FREEZE_PAGEFAULT state because our caller >> + * expects the timestamp to be synchronously updated. So we >> + * get write access without blocking, at the SB_FREEZE_FS >> + * level. If the fs is already fully frozen, then we already >> + * know we have nothing to do. >> + */ >> + >> + if (!mapping_test_cmtime(mapping)) >> + return; /* Optimization: nothing to do. */ >> + >> + if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) { >> + mapping_flush_cmtime(mapping); >> + __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS); >> + } >> +} > This is wrong because SB_FREEZE_FS level is targetted for filesystem > internal use. Also it is racy. mapping_flush_cmtime() ends up calling > mark_inode_dirty() and filesystems such as ext4 or xfs will start a > transaction to store inode in the journal. This gets freeze protection at > SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to > SB_FREEZE_FS before this second protection, things will deadlock. Whoops -- I assumed that it was safe to recursively take freeze protection at the same level. I'm worried about the following race: Thread 1 (in munmap): Check AS_CMTIME set sb_start_pagefault Thread 2 (freezing the fs): frozen = SB_FREEZE_PAGEFAULT; sync_filesystem() Thread 1 is now stuck. It doesn't need to be, because sync_filesystem will flush out the cmtime write. But there doesn't seem to be a clean mechanism to wait for the freeze to finish. Is there a clean way to avoid this? I don't want to return immediately if a freeze is in progress, because userspace expects that munmap will update cmtime synchronously. And ugly but simple solution is: if (!mapping_test_cmtime(mapping)) return; /* Optimization: nothing to do. */ if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) { mapping_flush_cmtime(mapping); __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS); } else { /* Freeze is or was in progress. The part of freezing from SB_FREEZE_PAGEFAULT through sync_filesystem holds s_umount for write, so we can wait for it to finish by taking s_umount for read. */ down_read(&sb->s_umount); up_read(&sb->s_umount); } --Andy > > Since the callers of this function hold mmap_sem, using SB_FREEZE_PAGEFAULT > protection would be appropriate. Also since there are just two places that > need the freeze protection I'd be inclined to open code the protection in > the two places rather than hiding it in a special function. Given that this is rather subtle (I've gotten it wrong multiple times), I'd rather leave it in one place and comment it well. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html