RE: Mainline kernel OLTP performance update

"Ma, Chinang" <chinang.ma@xxxxxxxxx> · Thu, 15 Jan 2009 00:11:51 -0700

Trying to answer to some of the question below:

-Chinang

>-----Original Message-----
>From: Steven Rostedt [mailto:srostedt@xxxxxxxxxx]
>Sent: Wednesday, January 14, 2009 6:27 PM
>To: Andrew Morton
>Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux-
>kernel@xxxxxxxxxxxxxxx; Tripathi, Sharad C; arjan@xxxxxxxxxxxxxxx; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@xxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx;
>Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter
>Zijlstra; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>(added Ingo, Thomas, Peter and Gregory)
>
>On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
>> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@xxxxxx> wrote:
>>
>> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> > > On Tue, 13 Jan 2009 15:44:17 -0700
>> > > "Wilcox, Matthew R" <matthew.r.wilcox@xxxxxxxxx> wrote:
>> > > >
>> > >
>> > > (top-posting repaired.  That @intel.com address is a bad influence ;))
>> >
>> > Alas, that email address goes to an Outlook client.  Not much to be
>done
>> > about that.
>>
>> aspirin?
>>
>> > > (cc linux-scsi)
>> > >
>> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare
>to
>> > > > > 2.6.24.2 the regression is around 3.5%.
>> > > > >
>> > > > > Linux OLTP Performance summary
>> > > > > Kernel#            Speedup(x)   Intr/s  CtxSw/s us%  sys%   idle%
>iowait%
>> > > > > 2.6.24.2                1.000   21969   43425   76   24     0
>0
>> > > > > 2.6.27.2                0.973   30402   43523   74   25     0
>1
>> > > > > 2.6.29-rc1              0.965   30331   41970   74   26     0
>0
>> >
>> > > But the interrupt rate went through the roof.
>> >
>> > Yes.  I forget why that was; I'll have to dig through my archives for
>> > that.
>>
>> Oh.  I'd have thought that this alone could account for 3.5%.
>>
>> > > A 3.5% slowdown in this workload is considered pretty serious, isn't
>it?
>> >
>> > Yes.  Anything above 0.3% is statistically significant.  1% is a big
>> > deal.  The fact that we've lost 3.5% in the last year doesn't make
>> > people happy.  There's a few things we've identified that have a big
>> > effect:
>> >
>> >  - Per-partition statistics.  Putting in a sysctl to stop doing them
>gets
>> >    some of that back, but not as much as taking them out (even when
>> >    the sysctl'd variable is in a __read_mostly section).  We tried a
>> >    patch from Jens to speed up the search for a new partition, but it
>> >    had no effect.
>>
>> I find this surprising.
>>
>> >  - The RT scheduler changes.  They're better for some RT tasks, but not
>> >    the database benchmark workload.  Chinang has posted about
>> >    this before, but the thread didn't really go anywhere.
>> >    http://marc.info/?t=122903815000001&r=1&w=2
>
>I read the whole thread before I found what you were talking about here:
>
>http://marc.info/?l=linux-kernel&m=122937424114658&w=2
>
>With this comment:
>
>"When setting foreground and log writer to rt-prio, the log latency reduced
>to 4.8ms. \
>Performance is about 1.5% higher than the CFS result.
>On a side note, we had been using rt-prio on all DBMS processes and log
>writer ( in \
>higher priority) for the best OLTP performance. That has worked pretty well
>until \
>2.6.25 when the new rt scheduler introduced the pull/push task for lower
>scheduling \
>latency for rt-task. That has negative impact on this workload, probably
>due to the \
>more elaborated load calculation/balancing for hundred of foreground rt-
>prio \
>processes. Also, there is that question of no production environment would
>run DBMS \
>with rt-prio. That is why I am going back to explore CFS and see whether I
>can drop \
>rt-prio for good."
>

>A couple of questions:
>
>1) how does the latest rt scheduler compare?  There has been a lot of
>improvements.

It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done. 

>2) how many rt tasks?   
	Around 250 rt tasks.

>3) what were the prios, producer compared to consumers, not actual numbers
	I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data.

>4) have you tried pinning tasks?
>
We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server. 

>RT is more about determinism than performance. The old scheduler
>migrated rt tasks the same as other tasks. This helps with performance
>because it will keep several rt tasks on the same CPU and cache hot even
>when a rt task can migrate. This helps performance, but kills
>determinism (I was seeing 10 ms wake up times from the next-highest-prio
>task on a cpu, even when another CPU was available).
>
>If you pin a task to a cpu, then it skips over the push and pull logic
>and will help with performance too.
>
>-- Steve
>
>
>
>>
>> Well.  It's more a case that it wasn't taken anywhere.  I appear to
>> have recently been informed that there have never been any
>> CPU-scheduler-caused regressions.  Please persist!
>>
>> > SLUB would have had a huge negative effect if we were using it -- on
>the
>> > order of 7% iirc.  SLQB is at least performance-neutral with SLAB.
>>
>> We really need to unblock that problem somehow.  I assume that
>> enterprise distros are shipping slab?
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html