Re: managing GFS corruption on large FS

Robert Peterson <rpeterso@xxxxxxxxxx> · Wed, 29 Nov 2006 10:04:47 -0600

Riaan van Niekerk wrote:
hi all

We have a large GFS consisting of 4 TB of maildir data. We have a 
corruption on this GFS which causes nodes to be withdrawn intermittently.

The cause of the fs corruption is due to user error and lack of 
documentation (initially not having the clustered flag enabled on the 
VG when growing the LV/GFS). We now know better, and will avoid this 
particular cause of corruption. However, management wants to know from 
us how we can prevent corruption, or minimize the downtime incurred if 
this should happen again.

For the current problem, since a gfs_fsck will take too long (we 
cannot afford the 1 - 3 days of downtime it will take to complete the 
fsck), we are planning to migrate the data to a new GFS, and at the 
same time set up the new environment optimally to cause the minimum of 
downtime, if a corruption were to happen again.

One option is to split the one big GFS into a number of smaller GFS's. 
Unfortunately, our environment does not lean itself to being split up 
in (for example) a number of 200GB GFS's. Also, this negates a lot of 
the advantages of GFS (e.g. having your storage consolidated onto one 
big GFS, and scaling it out by growing the GFS and adding nodes).

I would really like to know how others on this list manage the 
threat/risk of FS corruption, and the corruption itself, if it does 
happen. Also, w.r.t. data protection, if you do snapshots, SAN-based 
mirroring, backup to disk/tape, I  would really appreciate it if you 
could give me detail information like
a) mechanism (e.g snaps, backup, etc)
b) type of data (e.g. many small files)
c) size of GFS
d) the time it takes to perform the action

thank you
Riaan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
Hi Riaan,

You've raised a good question, and I thought I'd address some of your 
issues.
I'm just throwing these out in no particular order.

Running gfs_fsck is understandably slow, but there are a few things to bear
in mind:

1. A 4TB file system is not excessive by any means.  As I stated in the 
cluster
  faq, a customer reported running gfs_fsck on a 45TB and it only took 48
  hours, and that was slower than it should have been because it was 
running
  out of memory and started swapping to disk.  Your 4TB file system should
  take a lot less time since it's a tenth of the size.  That depends, 
of course, on
  hardware issues as well.  See:

http://sources.redhat.com/cluster/faq.html#gfs_fsck1

2. I've recently figured out a couple of ways to improve the speed
   of gfs_fsck.  For example, for a recent bugzilla, I patched a memory 
leak
   and combined passes through the file system inside the duplicate 
checking
   code, pass1b.  For a list of improvements, see this bugzilla, especially
   comment #33:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208836

   I think this should be available in Rhel4 U5.

3. gfs_fsck takes a lot of memory to run, and when it runs out of memory,
   it will start swapping to disk, and that will slow it down considerably.
   So be sure to run it on a system with lots of memory.

4. We're continuing to improve the gfs_fsck code all the time.
   Jon Brassow and I have done some brainstorming and hope to keep
   making it faster.  I've come up with some more memory saving ideas
   that might make it faster, but I have yet to try them out.  Maybe soon.

5. Another thing that slows down gfs_fsck is running it in verbose mode.
   Sometimes it's useful to have the verbose mode, but it will slow you
   down considerably.  Don't use -v or -vv unless you have to.
   If you're only using -v to figure out where fsck is in the process, 
I have
   a couple of improvements:  In the most recent version of gfs_fsck (for
   the bugzilla above) I've added more "% complete" messages.  Also, if
   you interrupt that version by hitting <ctrl-c> it will tell you what 
block
   it's currently working on and allow you to continue.  Again, I think 
this
   should be in RHEL4 U5.

6. I recently discovered an issue that impacts GFS performance for large
   file systems, not only for gfs_fsck but for general performance as well.
   The issue has to do with the size of the GFS resource groups, which is
   an internal GFS structure for managing the data.  This is an internal
   GFS structure, not to be confused with rgmanager's Resource Groups.
   Some file system slowdown can be blamed on having a large number
   of RGs.  The bigger your file system, the more RGs you need.  By 
default,
   gfs_mkfs carves your file system into 256MB RGs, but it allows you to
   specify a preferred RG size.  The default, 256MB, is good for average
   size file systems, but you can increase performance on a bigger file
   system by using a bigger RG size.  For example, my 40TB file system
   requires approximately 156438 RGs of 256MB each.  Whenever GFS
   has to run that linked list, it takes a long time.  The same 40TB 
file system
   can be created with bigger RGs--2048MB--requiring only 19555 of them.
   The time savings is dramatic: It took nearly 23 minutes for my system
   to read in all 156438 RG Structures (with 256MB RGs), but only 4
   minutes to read in the 19555 RG Structures for my 2048MB RGs.
   The time to do an operation like df on an empty file system dropped from
   24 seconds with 256MB RGs, to under a second with 2048MB RGs.
   I'm sure that increasing the size of the RGs would help gfs_fsck's
   performance as well.  I can't make any performance promises; I can only
   tell you what I observed in this one case.  The issue is documented 
in this
   bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213763

   I'm going to try to see if I can get a KnowledgeBase article written up
   about this, by the way, and I'll try to put something into the FAQ too.

   For RHEL5, I'm changing gfs_mkfs so that it picks a more intelligent
   RG size based on the file system size, to let users take advantage 
of this
   performance benefit without ever knowing or caring about the RG size.

   Unfortunately, there's no way to change the RG size once a file system
   has been made.  It only happens at gfs_mkfs time.

7. As for file system corruption, that's a tough issue.  First of all, 
it's very
   rare. In virtually all the cases I've seen it was caused by 
influences outside
   of GFS itself, like the case you mentioned:  (1) someone swapping a
   hard drive that resided in the middle of a GFS logical volume, (2) 
someone
   running gfs_fsck while the volume was still mounted by a node, or (3)
   someone messing with the SAN from a machine outside of GFS.
   If there are other ways to cause GFS file corruption, we need the users
   to open bugzillas up so we can work on the problem, and even so, it's
   nearly impossible to tell how corruption occurs unless it can be
   recreated here in our lab.

I'm going to continue to search for ways to improve the performance of
GFS and gfs_fsck because you're right:  the needs of our users are 
increasing
and people are using bigger and bigger file systems all the time.

Regards,

Bob Peterson
Red Hat Cluster Suite

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster