Re: External Journal scenario - good idea?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



  Jeremy Rumpf wrote:

>On Wednesday 30 October 2002 07:44 am, Vinnie wrote:
>
>>Currently, the array is partitioned with a /boot partition, and a /
>>partition, each as ext3 with the default data=ordered journaling mode.
>> I have begun to realize gradually why it is a decent idea to break up
>>the filesystem into separate mount points and partitions, and may yet
>>end up doing that.  But that's a rabbit to hunt another day, unless
>>taking care of this is also required to solve this problem.
>>
>
>This is _very_ adviseable.
>
Yep now (I think) I understand.  Since I have one large / filesystem, 
all writes go through the same "funnel".  All writes have to use the 
same journal, going to the same "drive" (array).  Since the same drives 
are involved writing to the shared dirs for SMB clients, as those which 
are involved with reads/writes to NFS mailbox dirs and other stuff, NFS 
requests and MySQL requests have to "get in line" with SMB requests when 
it's busy.

But if these other requests (NFS mailboxes, MySQL, etc.) are on separate 
spindles, drives which are not part of the RAID5 array, they are in a 
different line waiting to be processed.  This makes sense.

>>This file server performs 5 key fileserver-related roles, due to its
>>having the large RAID5 file storage for the network:
>>
>>1. Serves the mailboxes for our domain to the two frontend mail/web
>>servers via NFS mount
>>
>>2. Runs the master SQL server - the two mail/web servers run local slave
>>copies of the mail account databases
>>
>>3. Stores the master copy of web documents served by the web servers
>>(and will replicate them to web servers when documents change, still
>>working on this though)
>>
>>4. Samba file server for storage needs on the network
>>
>>5. Limited/restricted-access FTP server for web clients
>>
>
>Do any of these require more than 120GB of storage (meaning are they too large 
>to fit on a single 120GB RAID1 set)?
>
Currently our complete usage of the single RAID5 array is right around 
100GB.  It is mostly file storage/backups from other hosts on the 
network.  This will no doubt represent the largest file storage 
requirements of all the fileserver functions for this machine.

In light of the smaller amount of space really needed for all of the 
other functions (combined), and the fact that for each 120GB drive we 
pull off the RAID5 array we will lose around 100GB of RAID5 storage 
capacity (though the drives would have to be removed from the array in 
PAIRS for each RAID1 array we were to create in this external 8-bay 
unit), it seems that the best usage of the external RAID enclosure and 
the 120GB drives we have in it, would be to create the other arrays 
elsewhere, and keep the large array for file storage.  If I am to keep a 
RAID5 array going - I'm going to have to think about this some and 
decide if I can settle for something else, like a RAID0+1 array, or 
smaller RAID1 arrays.

As you said, using a pair of 120GB drives for each RAID1 array used for 
other storage purposes (mailboxes, ftp, SQL database) would be a really 
big waste of space.

Also, I'm not so sure I would be gaining much advantage to make RAID1 
arrays in the same external unit, assuming I still had a RAID5 array in 
the same unit.  That is, if what I am seeing has much or anything to do 
with the parity calculation speed of the RAID controller in this 
external subsystem.  If it is swamped with XOR calculations while 
writing to a 7 drive array, it would probably not be much less swamped 
calculating parity data for a 4-5 drive array, and even a separate RAID1 
array working behind the same RAID controller may suffer write 
performance issues because the data has to be processed by the same RAID 
controller to actually get written to the RAID1 drives.

But I am really not even sure that what we're seeing here is a problem 
with the speed of the RAID controller.  From some other reading I have 
done, it seems that grabbing up RAM to cache writes and combine it all 
into one big write is something that the 2.4 kernel series is rather 
notorious for.  I saw an article/review of external RAID subsystems 
(both SCSI and ATA-to-SCSI type) which said the same thing - that 
Windows 2000 servers were a lot better at asynchronous I/O than kernel 
2.4-based Linux, and proceeded to describe much of the same malady I 
have been seeing here.  They did say that a lot of work is going into 
newer Linux kernels to make it better at async disk I/O.

I did try building a 2.4.19 kernel this past weekend, and it crashed 
MISERABLY during a large write test.  Major SCSI driver error messages, 
and it hung the SCSI bus to the point that I had to not only hit the 
reset button on the server, but also cycle the power on the RAID unit, 
before I could successfully RE-boot.  I saw in the Changelogs for 2.4.19 
that the Adaptec 78xx drivers have been revamped a couple times since 
2.4.18.  I guess I'm just going to have to stay with 2.4.18 for a while.

I have performed the recommended bdflush sysctl tweek to try to make the 
kernel write dirty buffers more often, and while I am seeing a marked 
increase in SCSI bus activity, write performance doesn't seem to have 
improved a great deal.  But from the "free" command (and this has always 
been the deal), it's not the "buffers" RAM usage that is so high when 
heavy disk write I/O is going on, its the "cached" RAM usage that hits 
the roof.

I am going to split up the single large filesystem into multiple mounts 
as you suggested, as this much more clearly (thanks to your reply) is a 
good idea. But I am concerned that even after doing this, since it is 
the same kernel with its same "cache it first, then write it all at 
once" semantics, that I may not be in much better shape.

It's really a shame to suspect so strongly that I would get the most 
improved write performance out of this machine by dropping from 2GB of 
RAM to 256MB. ;)  Operating on the concept that if it has nowhere to 
cache it, it HAS to write more often... ;)

>
>Remember though, you can move the journal to an external device at any time. I 
>would heavily recommend that you break up your spindles and allocate the 
>journal with the filesystem (a large journal with the filesystem) to start 
>out with. Then if performance still demands it, grab some small(er) disks and 
>move the journals off to them.
>
>When I say large journal, I usually think around the 250MB range. I personally 
>wouldn't recommend allocating a super large (greater than 1GB), but I'll 
>reside and let the FS experts advise on that issue.
>
I was considering the massive journal size for the samba share mount on 
the idea that if the journal is big enough to be a "staging area" for 
file copy operations from clients that may total out around 2GB or more 
(possibly), that we could keep the journal commit activity largely an 
asynchronous operation, rather than a chain of panic-mode synchronous 
operations because we are straddling that 25-50 percent full trigger 
until the data stops coming from the client machine..  But I'm not 100% 
sure I understand how it all works just yet, I have to do some more 
reading.  It could actually be counter-productive to have such a large 
journal.

>
>>CAN WE CHANGE JOURNAL LOCATION ON EXISTING EXT3 PARTITIONS?
>>One other snag it seems we may run into is the fact that the / partition
>>already has a journal (/.journal, I presume), since it's already an ext3
>>partition.  Is it possible to tell the system we want the journal
>>somewhere else instead?  Strikes me that when we're ready to move to the
>>external journal, we may have to mount the / partition ext2, then remove
>>the journal, and create the new one and point the / partition to it with
>>the e2fs tools?
>>
>
>Yes, except I would _not_ advise moving the / partition journal to an external 
>device. The / partition should have very little activity (assuming /var or 
>/var/log is a separate file system). This is the prime reason you should not 
>be allocating one huge / filesystem. Break it up into something like:
>
>/
>/var
>/tmp
>/usr
>/usr/local
>
So on these (above), have them at least on separate partitions. 
 Possibly the same drive, but at least separate partitions? (which would 
give them separate journals).  And on the ones below:

>
>and create special mounts for your samba, mysql, webroot (NFS), mail (NFS), 
>stuff.
>
>/usr/local/mysql
>/usr/local/webs
>/usr/local/filestore
>
since this is where the majority of the real file activity is going on, 
put each of these on separate drives (or RAID1 arrays), so we not only 
have separate journals, but separate spindles too) ?

Jeremy thank you so much for your reply.  This has really given me a lot 
to chew on.  And looking at my watch I see that it's Friday again.. 
meaning I can actually work on this for a few days... <grin>.

TTYL,
vinnie




_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users

[Index of Archives]         [Linux RAID]     [Kernel Development]     [Red Hat Install]     [Video 4 Linux]     [Postgresql]     [Fedora]     [Gimp]     [Yosemite News]

  Powered by Linux