Re: howto downgrade ext4 to ext3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Theodore Tso a écrit :
On Fri, Sep 18, 2009 at 11:21:08PM +0200, jehan procaccia wrote:
I would love to test that option (-o nodelalloc) instead of move back to ext3. however I don't understand what it is ... Am I taking risk in term of integrity of data if I set it ?, or just losing performances ? anyway, I'am not sure it is available, when I search it in "man mount", I can't find it, is it an undocumennted option ?

The mount man page is part of the util-linux package, and so it tends
to get updated a bit slower than the kernel.  The ext4 mount options
are fully documented in the kernel documentation; so if you install
the kernel-doc RPM, and look in the Documentation/filesystems/ext4.txt
you'll get a comprehensive list of ext4 mount options.  (Well, as
comprehensive as we can make it; occasionally we forget to update it,
but in general we've been pretty good at documenting everything.)

(Checking....)

Ugh, the description for nodelalloc in ext4.txt is pretty horrible;
it doesn't even parse as a valid English sentence.  I don't know how
that slipped by me (Mingming, Eric; can either of you see if your
respective companies can snag us a tech writer resource for a day or
two?), but I'll get that one fixed up.
indeed, there's not a lot, and not very well understandable :
$ less /usr/share/doc/kernel-doc-2.6.18/Documentation/filesystems/ext4.txt
delalloc        (*)     Deferring block allocation until write-out time.
nodelalloc              Disable delayed allocation. Blocks are allocation
                       when data is copied from user to page cache.
Anyway, delayed allocation is a feature of ext4 which allow us to
delay allocating blocks until the very last minute --- when the VM
page writeback routine decides it times to write dirty pages to disk
(aka "cleaning pages", or "when the page cleaner runs" --- yeah, OS
programmers sometimes like to perpetuate some really horrible puns),
or when a program explicitly forces a file to be written to disk via
the fsync() system call.  This allows the block allocator to make more
intelligent decisions, which tends to avoid disk fragmentation and
tends to increase performance.  Delayed allocation is one of the
reasons why simply mounting an uncoverted ext2 or ext3 filesystem
using the ext4 file system driver can result in better performance.

OK, understood ...
The problem is that in older kernel programs, we didn't properly
account for quota.  Since we don't attempt to allocate files until
when the page cleaner runs, which could potentially be well after the
program which wrote the program has exited, the out-of-quota error
only gets noticed when the delayed allocation writepages function is
trying to clean up dirty pages.  This is a "should never happen
situation", and to avoid causing the VM to loop forever to write pages
where the write operation would never succeed, the writepages program
prints an extremely scary message and --- and then throws away the
user's data.
That chapter becomes a bit obscure to me ... If I well understood, you described the situation I ran into ?
By using the nodelalloc mount option, ext4 will try to allocate blocks
while processing each and every write(2) system call.  This allows
quota to be checked right away, and if the user is over quota, the
write system call will return an error right away.  This is less
efficient in terms of CPU usage, and the block allocater will not be
able to do as good of a job, since it doesn't know how big the file
will ultimately be when it is doing block-by-block allocation.
However, it avoids the nasty bug that happens when the user has a
over-quota situation in the delalloc writepage function --- and it's
no worse than what ext3 does.
Ok, that's where I should go now by mounting with nodelalloc, lower performances, but no more "should never happen situation" ;-) .
In more modern kernels, we've added quota checking in the write(2)
system call such that if we're not allocating the blocks right away,
so we don't know where the block will be located on disk, we charge
the block against user's quota right away, so the write(2) system call
can signal the over quota situation to the user program.
Unfortunately, these patches aren't present in the version of ext4
that was backported to RHEL 5.4.

From which kernel version you " 've added quota checking in the write(2) system call" ?, the problem should not arise anymore with recent kernel, and still using delalloc ? 2.6.30 should be OK ? for RHEL, there's fedora project that has more recent kernel packages in source RPMS:
kernel-2.6.29.4-167.fc11.src.rpm or kernel-2.6.31-33.fc12.src.rpm
probably recompiling these for rhel 5.4 could be a workaround instead of using nodelalloc ?
but now, how can I check that there's no more pb on that specific partition( /disk00)?
when kernel complains this way for example:
Sep 16 18:06:45 gizeh kernel: mpage_da_map_blocks block allocation failed for inode 39419 at logical offset 0 with max blocks 2 with error -122
Sep 16 18:06:45 gizeh kernel: This should not happen.!! Data will be lost
I've no indication from which partition that inode is. there's so many error message like this that is won't be easy to tell that none comes from /disk00 .

Well, error code 122 is EDQUOT, or "Quota exceeded".  So it's very
likely that this some other partition.  This is a bug; we really
should print the disk that was involved, and not just inode number.
I'll fix that in future kernels (but of course that won't help you for
RHEL 5.4).  What you can do to prove this is to check a quota report,
and see which users are over quota.  You can then check all of your
ext4 partitions to see which has an inode 39419 which is owned by one
of your over-quota users, using debugfs:

   debugfs -c -R "stat <39419>" /dev/sdXXX

good, indeed, I only get -122 errors, and thanks to the search example I noticed that those error do happened only for apparently over-quota users, here's an example:

gizeh kernel: mpage_da_map_blocks block allocation failed for inode 3542694 at logical offset 0 with max blocks 1 with error -122
Message from syslogd@ at Sat Sep 19 21:08:03 2009 ...

[root@gizeh ~]
$ debuge4fs -c -R "stat <3542694>" /dev/mapper/VolGroup02S2IA-LVVG02Users07
debuge4fs 1.41.5 (23-Apr-2009)
/dev/mapper/VolGroup02S2IA-LVVG02Users07: catastrophic mode - not reading inode or group bitmaps
Inode: 3542694   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 2336084861    Version: 0x00000000:00000001
User: 42658   Group:   426   Size: 0
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x4ab52c13:81a9f0d4 -- Sat Sep 19 21:08:03 2009
atime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
mtime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
crtime: 0x4ab52c13:812fde04 -- Sat Sep 19 21:08:03 2009
Size of extra inode fields: 28
BLOCKS:

[root@gizeh ~]
$ getent passwd |grep 42658
karipha:x:42658:426:Karipha BOUMER:/mci/mast2008/karipha:/usr/local/bin/bash
[root@gizeh ~]
$ quota -s karipha
Disk quotas for user karipha (uid 42658):
Filesystem blocks quota limit grace files quota limit grace
/dev/mapper/VolGroup02S2IA-LVVG02Users07
                  603M*   489M    538M   39:07    6622   50000   55000

$ find /disk07 -inum 3542694
/disk07/mast2008/karipha/.recently-used.xbel

Other inodes incriminated showed the same result -> over-quota . So if user data finally cannot be written, after all ... quota wouldn't allow it anyway .

Hope this helps you understand what's going on.
							- Ted
Yes, thanks for that detailed answer.
regards , jehan .
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux