Re: ext3 performance issue with a Berkeley db application

Andrew Morton <akpm@digeo.com> · Mon, 3 Feb 2003 17:48:32 -0800

Matthias Andree <ma+ext3@dt.e-technik.uni-dortmund.de> wrote:
>
> 
> Hoho. Seems the kernel doesn't like full write queues too much.

ext3 is making it write much _more_ data.  Due to the 5-second commit,
the application's redirtying and an ext3 gremlin.

> 
> > Now, looking at the enormous amount of system time which the commit=120 run
> > took, I assume that the application is doing a _ton_ of overwriting. 
> > Redirtying the same pages again and again and again.  So poor old ext3 keeps
> > rewriting them again and again.
> 
> The profile says c. 99% overwrites vs 1% writes to new pages.

Ow.

> However,
> in my experiments and AFAIR in Greg's, the system times were quite
> reasonable. I'm going with the default commit interval (5 s if I read my
> logs right). Killing my test program after a minute:
> 
> real    0m57.872s
> user    0m1.750s
> sys     0m4.920s

It's the ratio between system and user which shows that it's doing a
lot of overwrite.

> This is an AMD Duron 700 MHz with PC-133 mem, but I don't recall if I run it
> as PC-133 CL3 or PC-100 CL2.
> 
> > You'll hit similar problems with ext2 - on a slower computer, or on a larger
> > database, or on a system with the kupdate interval decreased from the 30
> > second default.
> 
> decreased or increased?

Decreased.  If you decrease the ext2 or reiserfs writeout expiry time to
the same as ext2, you may see similar problems.

See, your test takes 25 seconds, which just squeezes inside the default
writeback timeout..   If it happened to take 40 seconds, you may hit this problem
on other filesystems.   Or if the amount of dirty data exceeds 40% of physical
memory.

What is happening is that once writeback kicks in, that slows the userspace
application down because in certain circumstances, userspace has to wait on
writeout before it can get access to a buffer.  And slowing down userspace
in this way cause an exponential increase in runtime, because the longer 
userspace takes to run, the more commit intervals that run will span.  It
feeds on itself.

Now, generally the kernel will attempt to prevent serialising userspace
behind background writeout.  But there's one spot in do_get_write_access():

                if (jh->b_jlist == BJ_Shadow) {

where a random mark_inode_dirty() call will serialise behind the ongoing
transaction commit.  This, and the 5-second commit, is the crux of the problem.

If another filesystem (or the VFS) happens to run a random lock_buffer(),
the same could happen there.  It _shouldn't_, but it might.  Testing
those filesystems with a larger dataset, or a slower computer, or with the
kupdate intervals wound down would tell that.

> So what's special about the combination of "ext3fs and IDE"?

Nothing.  Possibly you got lucky on SCSI, and the serialisation against
an under-commit buffer did not happen.  Or the scsi disk has a larger
writeback cache.  Don't know.  It will happen on SCSI as well.

> The interesting thing in my test is vmstat 1 -- with SCSI, I get some
> hundred blocks trickled out every once in a while. With IDE, I get a
> constant write rate of some hundred blocks per second. (IDE lacks the
> big write at the end because I abort it prematurely and the fsync() is
> missed therefore).

That's because for some unknown reason, IDE triggered the regenerative
slowdown and SCSI didn't.  Try varying a few things and you'll see
scsi do the same thing.

> Seriously, as long as ext3 + IDE is a problem and ext2 + IDE isn't (with
> 2.4 at least), reiserfs + IDE isn't, ext3 + SCSI isn't, there's no
> compelling reason to change the application code.

Try larger datasets.  ext2 _may_ be OK; it's pretty good about avoiding
serialisation behind I/O.

> Is there anything that might get in the way? Write barrier code?

The locked shadow buffer.

> Is the ext3fs jdb code shared with other file system types I could test?

No.

> I hope my vmstat data is useful. I can compile and test a specific
> kernel version if needed.

Probably ext3 needs to be changed to take a copy of the buffer in there
rather than waiting on the commit.

But you're only writing 24 megs of data!  Delay the writeout.

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users