Odd result of increasing journal size?

Matthew Berg <galt@xxxxxxxxxxxxxx> · Wed, 04 Feb 2004 12:51:46 -0500

[NOTE: I apologize in advance if this shows up as a duplicate.  I sent
it once by accident from the wrong account, so the message has been
waiting moderation.  If the moderator reads the list, hopefully they'll
notice I already posted :)]

I have a number of machines which are used for mail storage.  We've had
issues with sporadic slow connections to the machine, seemingly blocking
on I/O.

After running some tests it seemed as if we might be filling the
journal.  In some synthetic testing we did, increasing journal size
eliminated the spikes we saw at regular intervals with a large amount of
simultaneous reading and writing.

This weekend we increased journal size from 32MB to 256MB on a group of
machines.  There was one with the following configuration:

        2 x P3/667
        256MB
        ServeRAID 4L (16MB cache, writethrough)
        5 x 18GB 10k Ultra160 (RAID5, 8KB stripe)
        Red Hat 7.2 w/ kernel 2.4.20-18.7

The rest are:

        2 x Xeon/2.4
        1024MB
        ServeRAID 6i (128MB cache, writethrough)
        5 x 36GB 15k Ultra320 (RAID5EE, 8KB stripe)
        Red Hat 7.2 w/ kernel 2.4.20-24.7 (addl path: ips 6.10 driver)

This actually seemed to fix the problem on the older machine.  The slow
connections are pretty much eliminated.  

The newer machines, on the other hand, are getting *more* slow
connections, and load average, which previously never exceeded 6, has
been seeing occasional quick spikes as high as 30 (but by the time the
machine is actively viewed, the run queue is pretty much empty).

It's worth noting that the newer systems didn't exhibit any problematic
behaviour with the arrays in write-through mode.  However, a power
outage on one of those machines had previously resulted in an
exceedingly large amount of data corruption on files that hadn't been
modified in as long as an hour, despite the daemon calling fsync after
write.  Since the drives were already in writethrough, and the files had
both been fsynced and were old enough they should have been flushed
anyways, it was assumed that the contents of the controllers cache was
the likely culprit (since the controller has battery backed cache, we're
inquiring with IBM why it might be that cache wasn't flushed when the
array was brought back online).

Anyways, any ideas on why an increased journal would cause decreased
performance?  Nothing I've read in the archives would suggest that
should happen.

-- 
Matthew Berg <galt@xxxxxxxxxxxxxx>

_______________________________________________

Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users