Re: bug: xfs_repair becomes very slow when file system has a large sparse file

Joe Landman <landman@xxxxxxxxxxxxxxxxxxxxxxx> · Fri, 19 Aug 2011 20:38:53 -0400

On 8/19/2011 8:26 PM, Dave Chinner wrote:
On Fri, Aug 19, 2011 at 12:37:05PM -0400, Joe Landman wrote:
(If you prefer we file this on a bug reporting system, please let me
know where and I'll do this).

Scenario:  xfs_repair being run against an about 17TB volume,
containing 1 large sparse file.  Logical size of 7 PB, actual size,
a few hundred GB.

Metadata:  Kernel = 2.6.32.41, 2.6.39.4, and others. Xfstools 3.1.5.
Hardware RAID ~17TB LUN.  Base OS: Centos 5.6 + updates + updated
xfs tools + our kernels.  Using external journal on a different
device

What we observe:

Running xfs_repair

	xfs_repair -l /dev/md2 -vv /dev/sdd2

can you post the actual output of xfs_repair?

[root@jr4-2 ~]# xfs_repair -l /dev/md2 -vv /dev/sdd2
Phase 1 - find and verify superblock...
- max_mem = 37094400, icount = 1346752, imem = 5260, dblock =
4391112384, dmem = 2144097
- block cache size set to 4361880 entries
Phase 2 - using external log on /dev/md2
- zero log...
zero_log: head block 126232 tail block 126232
- scan filesystem freespace and inode maps...
agf_freeblks 11726908, counted 11726792 in ag 1
sb_ifree 2366, counted 2364
sb_fdblocks 2111548832, counted 2111548716
- found root inode chunk
libxfs_bcache: 0x8804c0
Max supported entries = 4361880
Max utilized entries = 4474
Active entries = 4474
Hash table size = 545235
Hits = 0
Misses = 4474
Hit ratio = 0.00
MRU 0 entries = 4474 (100%)
MRU 1 entries = 0 ( 0%)
MRU 2 entries = 0 ( 0%)
MRU 3 entries = 0 ( 0%)
MRU 4 entries = 0 ( 0%)
MRU 5 entries = 0 ( 0%)
MRU 6 entries = 0 ( 0%)
MRU 7 entries = 0 ( 0%)
MRU 8 entries = 0 ( 0%)
MRU 9 entries = 0 ( 0%)
MRU 10 entries = 0 ( 0%)
MRU 11 entries = 0 ( 0%)
MRU 12 entries = 0 ( 0%)
MRU 13 entries = 0 ( 0%)
MRU 14 entries = 0 ( 0%)
MRU 15 entries = 0 ( 0%)
Hash buckets with 0 entries 541170 ( 0%)
Hash buckets with 1 entries 3765 ( 84%)
Hash buckets with 2 entries 242 ( 10%)
Hash buckets with 3 entries 15 ( 1%)
Hash buckets with 4 entries 36 ( 3%)
Hash buckets with 5 entries 6 ( 0%)
Hash buckets with 6 entries 1 ( 0%)
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
bad magic number 0xc88 on inode 5034047
bad version number 0x40 on inode 5034047
bad inode format in inode 5034047
correcting nblocks for inode 5034046, was 185195 - counted 0
bad magic number 0xc88 on inode 5034047, resetting magic number
bad version number 0x40 on inode 5034047, resetting version number
bad inode format in inode 5034047
cleared inode 5034047

(then it hangs here)

the system gets to stage 3 and the first ag.  Then it appears to
stop. After an hour or so, we strace it, and we see

	pread(...) = 4096

and the same for the strace, along with syscall completion time?
(i.e. strace -ttt -T .....) That will tell us if the time is spend
doing IO or in the repair binary.

Unfortunately, we fixed the original system.  We'll see if we can create 
this condition in the lab, and then post this.

What is the CPU usage when this happens? How much memory do you

Very low.  The machine is effectively idle, user load of 0.01 or so.

48 GB ram in these machine

have? Is the machine swapping while it is slowing down? A couple of

No.  I've not seen xfs_repair use more than a few hundred MB on this problem

minutes output of 'vmstat 5' when it is in this state would be handy.

occurring about 2-3 per second.  An hour later, its down to 1 per
second.   An hour after that, its once every 2 seconds.

Also, somewhere on this disk, someone has created an unfortunately
large file

[root@jr4-2 ~]# ls -alF /data/brick-sdd2/dht/scratch/xyzpdq
total 4652823496
d---------   2 1232 1000               86 Jun 27 20:31 ./
drwx------ 104 1232 1000            65536 Aug 17 23:53 ../
-rw-------   1 1232 1000               21 Jun 27 09:57 Default.Route

-rw-------   1 1232 1000              250 Jun 27 09:57 Gau-00000.inp
-rw-------   1 1232 1000                0 Jun 27 09:57 Gau-00000.d2e
-rw-------   1 1232 1000 7800416534233088 Jun 27 20:18 Gau-00000.rwf

[root@jr4-2 ~]# ls -ahlF /data/brick-sdd2/dht/scratch/xyzpdq
total 4.4T
d---------   2 1232 1000   86 Jun 27 20:31 ./
drwx------ 104 1232 1000  64K Aug 17 23:53 ../
-rw-------   1 1232 1000   21 Jun 27 09:57 Default.Route
-rw-------   1 1232 1000  250 Jun 27 09:57 Gau-00000.inp
-rw-------   1 1232 1000    0 Jun 27 09:57 Gau-00000.d2e
-rw-------   1 1232 1000 7.0P Jun 27 20:18 Gau-00000.rwf

This isn't a 7PB file system, its a 100TB file system across 3
machines, roughly 17TB per brick or OSS.  The Gau-00000.rwf is
obviously a sparse file, as could be seen with an ls -alsF

What does du tell you about it?  xfs_io -f -c "stat"<large file>?
xfs_bmap -vp<large file>?

ls -alsF told me it was a few hundred GB.  Du gave a similar number.

Upon removing that file, the xfs_repair completes within ~10
minutes. Leaving that file on there, the xfs_repair does not
terminate, it just gets asymptotically slower.

That could simply be the memory footprint causing more swapping per
operation to occur. Or it could be that something is simply getting
too large for the index type being used. If the machine is not
swapping, can you point 'perf top -p<pid of xfs_repair>' at it so
we might see where that CPU time is being spent?  (you might need to
use a non-stripped version of the binary to get any useful
information)

Ok, we will try to recreate the condition in the lab and generate this.

Please let me know if you need more information, or if you would
like me to file this somewhere else for official reportage.

This is the right place to let us know about problems.

Thanks!

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.,
email: landman@xxxxxxxxxxxxxxxxxxxxxxx
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs