Re: External Journal scenario - good idea?

Vinnie <listacct1@lvwnet.com> · Fri, 01 Nov 2002 07:42:38 -0500

      Jeremy Rumpf wrote:

200210301403.47650.jrumpf@heavyload.net">
  On Wednesday 30 October 2002 07:44 am, Vinnie wrote:

  200210301403.47650.jrumpf@heavyload.net">

      Currently, the array is partitioned with a /boot partition, and a /
partition, each as ext3 with the default data="ordered" journaling mode.
 I have begun to realize gradually why it is a decent idea to break up
the filesystem into separate mount points and partitions, and may yet
end up doing that.  But that's a rabbit to hunt another day, unless
taking care of this is also required to solve this problem.

This is _very_ adviseable.

 Yep now (I think) I understand.  Since I have one large / filesystem, all 
writes go through the same "funnel".  All writes have to use the same journal, 
going to the same "drive" (array).  Since the same drives are involved writing 
to the shared dirs for SMB clients, as those which are involved with reads/writes 
to NFS mailbox dirs and other stuff, NFS requests and MySQL requests have 
to "get in line" with SMB requests when it's busy.

 But if these other requests (NFS mailboxes, MySQL, etc.) are on separate 
spindles, drives which are not part of the RAID5 array, they are in a different 
line waiting to be processed.  This makes sense.

      200210301403.47650.jrumpf@heavyload.net">

          This file server performs 5 key fileserver-related roles, due to its
having the large RAID5 file storage for the network:

1. Serves the mailboxes for our domain to the two frontend mail/web
servers via NFS mount

2. Runs the master SQL server - the two mail/web servers run local slave
copies of the mail account databases

3. Stores the master copy of web documents served by the web servers
(and will replicate them to web servers when documents change, still
working on this though)

4. Samba file server for storage needs on the network

5. Limited/restricted-access FTP server for web clients

Do any of these require more than 120GB of storage (meaning are they too large 
to fit on a single 120GB RAID1 set)?

 Currently our complete usage of the single RAID5 array is right around 100GB. 
 It is mostly file storage/backups from other hosts on the network.  This 
will no doubt represent the largest file storage requirements of all the fileserver
functions for this machine.

 In light of the smaller amount of space really needed for all of the other 
functions (combined), and the fact that for each 120GB drive we pull off the
RAID5 array we will lose around 100GB of RAID5 storage capacity (though the
drives would have to be removed from the array in PAIRS for each RAID1 array
we were to create in this external 8-bay unit), it seems that the best usage
of the external RAID enclosure and the 120GB drives we have in it, would
be to create the other arrays elsewhere, and keep the large array for file
storage.  If I am to keep a RAID5 array going - I'm going to have to think
about this some and decide if I can settle for something else, like a RAID0+1
array, or smaller RAID1 arrays.

 As you said, using a pair of 120GB drives for each RAID1 array used for
other storage purposes (mailboxes, ftp, SQL database) would be a really big
waste of space.

 Also, I'm not so sure I would be gaining much advantage to make RAID1 arrays 
in the same external unit, assuming I still had a RAID5 array in the same 
unit.  That is, if what I am seeing has much or anything to do with the parity 
calculation speed of the RAID controller in this external subsystem.  If it
is swamped with XOR calculations while writing to a 7 drive array, it would
probably not be much less swamped calculating parity data for a 4-5 drive
array, and even a separate RAID1 array working behind the same RAID controller
may suffer write performance issues because the data has to be processed
by the same RAID controller to actually get written to the RAID1 drives.

 But I am really not even sure that what we're seeing here is a problem with 
the speed of the RAID controller.  From some other reading I have done, it
seems that grabbing up RAM to cache writes and combine it all into one big
write is something that the 2.4 kernel series is rather notorious for.  I
saw an article/review of external RAID subsystems (both SCSI and ATA-to-SCSI
type) which said the same thing - that Windows 2000 servers were a lot better
at asynchronous I/O than kernel 2.4-based Linux, and proceeded to describe
much of the same malady I have been seeing here.  They did say that a lot
of work is going into newer Linux kernels to make it better at async disk
I/O.

I did try building a 2.4.19 kernel this past weekend, and it crashed MISERABLY 
during a large write test.  Major SCSI driver error messages, and it hung
the SCSI bus to the point that I had to not only hit the reset button on
the server, but also cycle the power on the RAID unit, before I could successfully
RE-boot.  I saw in the Changelogs for 2.4.19 that the Adaptec 78xx drivers
have been revamped a couple times since 2.4.18.  I guess I'm just going to
have to stay with 2.4.18 for a while.

 I have performed the recommended bdflush sysctl tweek to try to make the 
kernel write dirty buffers more often, and while I am seeing a marked increase 
in SCSI bus activity, write performance doesn't seem to have improved a great 
deal.  But from the "free" command (and this has always been the deal), it's
not the "buffers" RAM usage that is so high when heavy disk write I/O is
going on, its the "cached" RAM usage that hits the roof.

I am going to split up the single large filesystem into multiple mounts as
you suggested, as this much more clearly (thanks to your reply) is a good
idea. But I am concerned that even after doing this, since it is the same
kernel with its same "cache it first, then write it all at once" semantics,
that I may not be in much better shape.

It's really a shame to suspect so strongly that I would get the most improved
write performance out of this machine by dropping from 2GB of RAM to 256MB.
;)  Operating on the concept that if it has nowhere to cache it, it HAS to
write more often... ;)
          200210301403.47650.jrumpf@heavyload.net">

            200210301403.47650.jrumpf@heavyload.net">

Remember though, you can move the journal to an external device at any time. I 
would heavily recommend that you break up your spindles and allocate the 
journal with the filesystem (a large journal with the filesystem) to start 
out with. Then if performance still demands it, grab some small(er) disks and 
move the journals off to them.

When I say large journal, I usually think around the 250MB range. I personally 
wouldn't recommend allocating a super large (greater than 1GB), but I'll 
reside and let the FS experts advise on that issue.

I was considering the massive journal size for the samba share mount on the
idea that if the journal is big enough to be a "staging area" for file copy
operations from clients that may total out around 2GB or more (possibly),
that we could keep the journal commit activity largely an asynchronous operation,
rather than a chain of panic-mode synchronous operations because we are straddling
that 25-50 percent full trigger until the data stops coming from the client
machine..  But I'm not 100% sure I understand how it all works just yet,
I have to do some more reading.  It could actually be counter-productive
to have such a large journal.

              200210301403.47650.jrumpf@heavyload.net">

                  CAN WE CHANGE JOURNAL LOCATION ON EXISTING EXT3 PARTITIONS?
One other snag it seems we may run into is the fact that the / partition
already has a journal (/.journal, I presume), since it's already an ext3
partition.  Is it possible to tell the system we want the journal
somewhere else instead?  Strikes me that when we're ready to move to the
external journal, we may have to mount the / partition ext2, then remove
the journal, and create the new one and point the / partition to it with
the e2fs tools?

Yes, except I would _not_ advise moving the / partition journal to an external 
device. The / partition should have very little activity (assuming /var or 
/var/log is a separate file system). This is the prime reason you should not 
be allocating one huge / filesystem. Break it up into something like:

/
/var
/tmp
/usr
/usr/local

So on these (above), have them at least on separate partitions.  Possibly
the same drive, but at least separate partitions? (which would give them
separate journals).  And on the ones below:

                  200210301403.47650.jrumpf@heavyload.net">

and create special mounts for your samba, mysql, webroot (NFS), mail (NFS), 
stuff.

/usr/local/mysql
/usr/local/webs
/usr/local/filestore

since this is where the majority of the real file activity is going on, put
each of these on separate drives (or RAID1 arrays), so we not only have separate
journals, but separate spindles too) ?

Jeremy thank you so much for your reply.  This has really given me a lot
to chew on.  And looking at my watch I see that it's Friday again.. meaning
I can actually work on this for a few days... <grin>.

TTYL,

vinnie