Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Wed, 31 Aug 2022 13:06:17 +0530



On Mon, Aug 29, 2022 at 11:04:34AM +0200, Jan Kara wrote:
> On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > > Hi Stefan,
> > > 
> > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > > can see whether plain untar is indeed making the difference or there's
> > > > > something else influencing the result as well (I have checked and
> > > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > > update)? Thanks.
> > > > 
> > > > mb_optimize_scan=0 -> almost 5 minutes
> > > > 
> > > > mb_optimize_scan=1 -> almost 18 minutes
> > > > 
> > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> > > 
> > > Thanks! So now the iostat data indeed looks substantially different.
> > > 
> > > 			nooptimize	optimize
> > > Total written		183.6 MB	190.5 MB
> > > Time (recorded)		283 s		1040 s
> > > Avg write request size	79 KB		41 KB
> > > 
> > > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > > reproduce as well, let's see what he can see from block location info.
> > > Thanks again for help with debugging this and enjoy your vacation!
> > > 
> > 
> > Hi Jan and Stefan,
> > 
> > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> > 
> > So I was able to collect the block numbers using the method you suggested. I converted the 
> > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can 
> > find them here:
> > 
> > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> > 
> > Observations:
> > * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> >   40 different BGs.
> > * With the patch, we still allocate in 36 different BGs but majority happen in
> >   just 1 or 2 BGs.
> > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> >   explain why this is faster.
> 
> Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
> confident the spread between block groups is responsible for the
> performance regression. Iostats show pretty clearly that the write
> throughput is determined by the average write request size which is
> directly related to the number of block groups we allocate from.
> 
> Your stats for patched kernel show that there are two block groups which
> get big part of allocations (these are likely the target block groups) but
> then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
> guess I will fiddle some more with my test VM and try to reproduce these
> allocation differences (on my test server the allocation pattern on patched
> kernel is very similar with mb_optimize_scan=0/1).
> 
> > Also, one strange thing I'm seeing is that the perfs don't really show any
> > particular function causing the regression, which is surprising considering
> > mb_optimize_scan=1 almost takes 10 times more time.
> 
> Well, the time is not spent by CPU. We spend more time waiting for IO which
> is not visible in perf profiles. You could plot something like offcpu flame
> graphs, there the difference would be visible but I don't expect you'd see
> anything more than just we spend more time in functions waiting for
> writeback to complete.
Ahh I see, that makes sense.
> 
> > Lastly, FWIW I'm not able to replicate the regression when using loop devices
> > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> > as well). Not sure if this is related to the issue or just some side effect of
> > using loop devices.
> 
> This is because eMMC devices seem to be very sensitive to IO pattern (and
> write request size). For loop devices, we don't care about request size
> much so that's why mb_optimize_scan makes no big difference. But can you
> still see the difference in the allocation pattern with the loop device?
So i tested with loop devices and yes I do see a similar allocation
pattern althout the regression is not there.

Thanks,
Ojaswin