Re: strange ext{3,4}_settattr logic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mar 15, 2008  19:07 +0300, Dmitri Monakhov wrote:
> I've found what ext3_setattr() code has some strange logic. I'm talking
> about truncate path. 
> 
> int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> {
> ...
> 	if (S_ISREG(inode->i_mode) &&
>             attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
>                 handle_t *handle;
> <<< This is shrinking case, and according to function comments:
> <<< "In particular, we want to make sure that when the VFS
> <<< * shrinks i_size, we put the inode on the orphan list and modify
> <<< * i_disksize immediately"
> <<< we about to write i_disksize. But WHY do we have to do it explicitly?
> <<< Later inode_setattr() will call ext3_truncate() which will do it
> <<< this work for us.

The reason that i_disksize is written to disk here immediately is that the
journal is stopped.  Once that is done then in case of a crash the orphan
recovery code will detect the unfinished truncate and complete it before
mounting the filesystem.

Without this it is possible to get a partial truncate after a crash because
the truncate may span several transactions due to the potentially large
number of blocks that need to be modified.  What is important with ext3
is that because e2fsck is not run on each boot whatever is on disk needs
to be consistent after a crash.

If there is a file being truncated or unlinked that needs to be completed
after a crash or the blocks will be leaked.  To ensure this happens, there
is a singly-linked list of inodes on the disk called the "orphan list"
that keeps track of all inodes currently undergoing truncate or unlink.
After a crash the kernel or e2fsck will walk this list and finish the
truncate or unlink of the inode, freeing the blocks.

>         rc = inode_setattr(inode, attr);
> <<< Now the most interesting question. What we have to do now in 
> <<< case of error? We are in tricky situation. Truncate not happened,
> <<< and blocks visible to the user, but i_disksize was already written,
> <<< so later memory reclaiming/ read_inode will result in unexpected
> <<< updating i_size.

The only ways inode_setattr() can fail are:
- expanding vmtruncate hits EFBIG, but we checked that above
- shrinking vmtruncate on a swapfile returns ETXTBUSY.  This was added
  after the ext3_setattr() code was written.

If the ext3_truncate() or mark_inode_dirty() call fails, it does not
return an error code.  For ext3 the only way this can fail is if the
journal is aborted, which means the filesystem is already in read-only
mode and nothing can be done to clean up the truncate until the next
mount, at which point the orphan recovery code discussed above will
finish the operation.

>         /* If inode_setattr's call to ext3_truncate failed to get a
>          * transaction handle at all, we need to clean up the in-core
>          * orphan list manually. */
> <<< Following code will remove inode only from in memory(because handle = NULL)
> <<< orphan list. Please someone explain me what this lines suppose to do
> <<< actually.
>         if (inode->i_nlink)
>                 ext3_orphan_del(NULL, inode);

This will only be important in the case of a failed operation above.
The ext3_truncate() code will normally have already removed the inode
from the orphan list when it is finished, but we aren't sure whether
that code was called so we need to do it again here (it is safe to call
even if the inode is not on the list) to ensure we don't hit a J_ASSERT()
that the orphan list is empty in the unmount code (ext3_put_super()).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux