That sounds a lot like the floating point rounding error I encountered
last year.
On Mar 20, 2007, at 6:59 PM, Theodore Tso wrote:
Well, keep in mind that the float is just as an optimization to doing
a simple binary search. So it doesn't have to be precise; an
approximation is fine, except when mid ends up being larger than
high.
But it's simple enough to catch that particular case where the
division going to 1 instead of 0.99999 as we might expect. Catching
that should be enough, I expect.
- Ted
With a float, you're still trying to cram 32 bits into a 24 bit
mantissa (23 bits + implicit bit). If nothing else, the float
should get changed to a double which has a 53 bit mantissa (52 +
implicit bit). Just catching the case where division goes to one
causes it to do a linear search. Given that this only occurs on
really big filesystems, that's probably not what you want to do...
Brian
Here's the patch I applied to e2fsck to get around the issue:
This patch does the trick.
--- e2fsprogs-1.39/lib/ext2fs/icount.c 2005-09-06
05:40:14.000000000 -0400
+++ e2fsprogs-1.39-test/lib/ext2fs/icount.c 2007-03-13
10:56:19.000000000 -0400
@@ -251,6 +251,10 @@
range = ((float) (ino - lowval)) /
(highval - lowval);
mid = low + ((int) (range * (high-low)));
+ if (mid > high)
+ mid = high;
+ if (mid < low)
+ mid = low;
}
#endif
if (ino == icount->list[mid].ino) {
Our inode count is 732,577,792 on a 5.4 TB filesystem with 5.0 TB in
use (94% use). It took about 9 hours to run, and used of 4GB of
memory.
Hope this helps.
On Apr 8, 2008, at 5:15 PM, Justin Hahn wrote:
Hello all,
I recently encountered a problem that I thought I should bring to
the ext3 devs. I've seen some evidence of similar issues in the
past, but it wasn't clear that anyone had experienced it at quite
this scale.
The short summary is that I let 'e2fsck -C 0 -y -f' run for more
than 24 hours on a 4.25Tb filesystem before having to kill it. It
had been stuck at "70.1%" in Pass 2 (checking directory structure)
for about 10 hours. e2fsck was using about 4.4Gb of RAM and was
maxing out 1 CPU core (out of 8).
This filesystem is used for disk-to-disk backups with dirvish[1]
The volume was 4.25Gb large, and about 90% full. I was doing an fsck
prior to running resize2fs, as required by said tool. (I ended up
switching to ext2online, which worked fine.)
I suspect the large # of hard links and the large file system size
are what did me in. Fortunately, my filesystem is clean for now.
What I'm worried about is the day when it actually needs a proper
fsck to correct problems. I have no idea how long the fsck would
have taken had I not cancelled it. I fear it would have been more
than 48hours.
Any suggestions (including undocumented command line options) I can
try to accelerate this in the future would be welcome. As this
system is for backups and is idle for about 12-16 hours a day, I can
un-mount the volume and perform some (non-destructive!!) tests if
there is interest. Unfortunately, I cannot provide remote access to
the system for security reasons as this is our backup archive.
I'm using CentOS 4.5 as my distro.
'uname -a' reports:
Linux backups-00.dc-00.rbm.local 2.6.9-55.0.12.ELsmp #1 SMP Fri Nov
2 12:38:56 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
The underlying hardware is a Dell PE 2950, with a PERC 5i RAID
controller and 6x 1Tb SATA drives and 8Gb of RAM. I/O performance
has been fine for my purposes, but I have not benchmarked, tuned or
tweaked it in any way.
Thanks!
--jeh
[1] Dirvish is an rsync/hardlink based set of perl scripts -- see http://www.dirvish.org/
for more details.
_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users
_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users