Re: [PATCH] ext4: guarantee already started handles to successfully finish while ro remounting

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 6 May 2016 14:01:17 -0600

On May 6, 2016, at 7:00 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> 
> On Fri, May 06, 2016 at 06:01:11AM +0000, Daeho Jeong wrote:
>>> Hmm, I'm not really comfortable with putting this hack in, since this
>>> is papering over the real problem, which is that Android is trying to
>>> use the emergency remount read-only sysrq option and this is
>>> fundamentally unsafe.  I'm not sure what else could break if it is
>>> situation normal that there is active processes busily writing to the
>>> file system and sysrq-u followed by reboot is the normal way the
>>> Android kernel does a reboot.
>> 
>>> A much better solution would be to change the Android userspace to
>>> call the FIFREEZE ioctl on each mounted file system, and then call for
>>> a reboot.
>> 
>> I agree with you. I know that current Android shutdown procedure is
>> not a safe way. But, without this patch, "even not in Android system",
>> when we trigger the emergency read-only remount while evicting inodes,
>> i_size of the inode becomes zero and the inode is not in orphan list,
>> but blocks of the inode are still allocated to the inode, because
>> ext4_truncate() will fail while stating the handle which was already started
>> by ext4_evict_inode(). This causes the filesystem inconsistency and
>> we will encounter an ext4 kernel panic in the next boot by this problem.
>> 
>> I think that this kind of filesystem crash can occur anywhere that
>> the same journal handle is repeatly used. During an atomic filesystem
>> operation, a part of the operation will succeed and the other one will fail.
> 
> Sure, but if this were a bug, then it could happen under a normal
> crash scenario.  In this particular case, under what circumstances
> could i_size be set to zero *and* the inode is not on the orphan list
> *and* ext4_evict_inode() is getting called?
> 
> If that could happen, then we could have problems without emergency
> unmount, and we should fix that problem.
> 
> The real problem here is that we want emergency unmount to be
> completely safe, but setting MS_RDONLY randomly isn't safe against file
> system corruption.  The idea is you do this only when you have no
> other choice, and the consequences would be worse --- and where you
> would be prepared to do a file system consistency check afterwards.
> 
> We can either fix Android userspace, or we could add a per-file system
> callback which tries (as much as possible) to make an emergency
> unmount safe.  In this particular case, it would probably involve
> setting the "file system is corrupt flag" to force a file system
> consistency check at the next reboot.  Because if you use emergency
> unmount, all bets are off, and there may be other problems....

The problem is that emergency remount-ro doesn't block in-progress writes,
since most operations only check the MS_RDONLY at the start of an operation.
It would be possible for do_emergency_remount() call ->freeze_fs() first for
all the filesystems, then doing the remount read-only (would need a change to
do_remount_ro() to allow this)?

That ensures the filesystem is in a (more) consistent state when force
remount-ro is called (i.e. which doesn't block or return an error if there
are writers on the filesystem).  I don't think this is an issue for normal
remount-ro, since there can't be any writes in progress.

This would also avoid the need for Android to know more about the internals
of how to remount read-only, and probably help normal sysadmins too.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail