Re: ext3 performance issue with a Berkeley db application

Matthias Andree <matthias.andree@gmx.de> · Mon, 3 Feb 2003 12:08:28 +0100

On Sun, 02 Feb 2003, Andrew Morton wrote:

> > Here's the ext3 part.  On another machine I got the following times to
> > build that list of 530,000 tokens, starting by creating empty db files
> > on a partition mounted as:
> 
> How large are the generated output files?

Some ten MB, like 30.

> > Some further findings:
> > 
> > o  It doesn't matter if the source (mbox) file is on ext3/ordered or
> >    ext2; the difference in time is insignificant.
> > 
> > o  Just processing a message to classify it takes about four times as
> >    long if the .db files are on ext3/ordered as it does if the .db
> >    files are on ext2.
> 
> I would be suspecting that the database is opening the files with O_SYNC or
> is running fsync or such.  Maybe.

The graphs on my web site have been created by running strace against
bogofilter and taking the pwrite() offsets, divided by 4096 to give the
"page number", and plot the page numbers over the line number in the
strace. There is exactly one fsync(), and it directly precedes the
close().

In either case, the data base file is opened with O_RDWR|O_LARGEFILE, no
O_SYNC (I straced to figure this). No fdatasync().

> > o  Dumping the tokens and counts from the database in text form and
> >    reloading them into a new database file is not subject to serious
> >    performance problems; on the machine that needs 24 minutes to
> >    build from a 200-Mb mbox, rebuilding the database from a list of
> >    tokens took eight seconds -- this was on ext3 in ordered mode.
> 
> How does this operation differ from the operation which is "slow"?

The access pattern changes a lot, because the data base is dumped in
traversal order which makes reinserting them into a fresh tree have MUCH
better data locality. Most writes are then in sequential order in
respect to the file offset, with some excursions to offset #0 and #4096
(pages #0 and #1), as you can see on

http://mandree.home.pages.de/bogofilter/bogoutil.png   <- write positions
http://mandree.home.pages.de/bogofilter/bogoutil-f.png <- page frequency

There are also fewer write accesses altogether.

> > These data were obtained on machines running linux kernels
> > 2.4.21-pre3-ac4 and -ac5 and 2.4.21-pre4-ac1; kernel 2.4.20-ac2 appears
> > to give similar results though this has not been thorougly tested. 
> > Results like those reported were initially obtained with db-3.1.17; the
> > tests shown here used db-4.1.25.
> > 
> > More info available on request; tuning hints most gratefully received
> > and tested.
> 
> If you can suggest an easy way in which I can reproduce this, that would be
> efficient.

If it's acceptable for you to build the current bogofilter package
http://bogofilter.sourceforge.net/ then Greg could provide you with a
Perl script to create a proper input.

If that's too much an effort for you which I'd perfectly understand,
just state so and I'll ask Greg to send me an strace of his "slow"
program and create a short monolithic C or perhaps Perl program that
just exactly reproduces the scattered pwrite() sequence pattern we
observe in our application.

Are you aware of a module that applies to recent kernel versions and
that traces block numbers of ll_rw_block()? It might turn up some useful
information -- then we'd easier know the ext3 "output" to the hard disk;
we already know the "input" from the application at the syscall level.

BTW: what's the status of the dirsync patches in respect to 2.4.21-pre?
Is further testing needed or just a "ping" to get them merged? Or
does Marcelo not want the patch?

-- 
Matthias Andree

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users