Re: an issue of ext4

"Dilger, Andreas" <andreas.dilger@xxxxxxxxx> · Thu, 6 Mar 2014 23:57:15 +0000

On 2014/03/05, 5:51 AM, "Theodore Ts'o" <tytso@xxxxxxx> wrote:

>On Wed, Mar 05, 2014 at 12:33:32PM +0000, Zhang, Hongchao wrote:
>> 
>> in ext4_fill_super, the variables related to statfs should be
>> initialized after journal recovery is completed.  otherwise, if a
>> large number of blocks were being allocated before the filesystem
>> crashed, then the blocks and inode counters may become negative
>> during use and report incorrect values to statfs call.
>
>The ext4_statfs() doesn't use the free blocks and inodes count from
>the superblock.  For scalability reasons, we no longer update the
>journal values in the superblock while they are in use, but rather
>compute them from the sum of the values from the blockgroup
>descriptors, and then track them via percpu counters.

Ted,
This doesn't relate to using the summary counters in the superblock.

The problem is that the percpu counters are initialized from the
group descriptors (or block and inode bitmaps if EXT4_DEBUG is on)
at mount time _before_ the journal has been replayed.  That means
journal replay can still change the group descriptors (or bitmaps)
after the counters are initialized, and statfs(), allocators, etc.
will use the wrong values for the rest of the mount.

If the journal is large, and there is heavy allocation happening
before the reboot then the counters can be significantly incorrect.

However, looking more closely at the upstream kernel, I see that this
code was changed by Dmitry Monakhov in v2.6.34-rc7-16-g84061e0 to
move the counters after journal init (almost the same as Hongchao's
patch does) but then you submitted a patch v2.6.37-rc1-3-gce7e010
to initialize the percpu counters are both before and after the
journal is loaded.  It isn't clear from your commit comment why
the patch to load them both before and after was needed?

It seems we hit this problem in the RHEL6 (which is missing both of
these changes), and your patch made upstream look like the original
unpatched code was loading the counters only before the journal is
replayed, so Hongchao's patch still applied to upstream.

So I guess upstream is OK, with the exception that it isn't clear
why commit ce7e010 was made.  Need to ask Eric to backport 84061e0
and ce7e010 to RHEL6 I guess, and use those patches in place of
our own in the meantime.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html