On Mon, Aug 29, 2022 at 11:04:34AM +0200, Jan Kara wrote: > On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote: > > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: > > > Hi Stefan, > > > > > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > > > > > Perhaps if you just download the archive manually, call sync(1), and measure > > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > > > > can see whether plain untar is indeed making the difference or there's > > > > > something else influencing the result as well (I have checked and > > > > > rpi-update does a lot of other deleting & copying as the part of the > > > > > update)? Thanks. > > > > > > > > mb_optimize_scan=0 -> almost 5 minutes > > > > > > > > mb_optimize_scan=1 -> almost 18 minutes > > > > > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > > > > > > Thanks! So now the iostat data indeed looks substantially different. > > > > > > nooptimize optimize > > > Total written 183.6 MB 190.5 MB > > > Time (recorded) 283 s 1040 s > > > Avg write request size 79 KB 41 KB > > > > > > So indeed with mb_optimize_scan=1 we do submit substantially smaller > > > requests on average. So far I'm not sure why that is. Since Ojaswin can > > > reproduce as well, let's see what he can see from block location info. > > > Thanks again for help with debugging this and enjoy your vacation! > > > > > > > Hi Jan and Stefan, > > > > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. > > > > So I was able to collect the block numbers using the method you suggested. I converted the > > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can > > find them here: > > > > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png > > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png > > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png > > > > Observations: > > * Before the patched mb_optimize_scan=1 allocations were way more spread out in > > 40 different BGs. > > * With the patch, we still allocate in 36 different BGs but majority happen in > > just 1 or 2 BGs. > > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could > > explain why this is faster. > > Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively > confident the spread between block groups is responsible for the > performance regression. Iostats show pretty clearly that the write > throughput is determined by the average write request size which is > directly related to the number of block groups we allocate from. > > Your stats for patched kernel show that there are two block groups which > get big part of allocations (these are likely the target block groups) but > then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I > guess I will fiddle some more with my test VM and try to reproduce these > allocation differences (on my test server the allocation pattern on patched > kernel is very similar with mb_optimize_scan=0/1). > > > Also, one strange thing I'm seeing is that the perfs don't really show any > > particular function causing the regression, which is surprising considering > > mb_optimize_scan=1 almost takes 10 times more time. > > Well, the time is not spent by CPU. We spend more time waiting for IO which > is not visible in perf profiles. You could plot something like offcpu flame > graphs, there the difference would be visible but I don't expect you'd see > anything more than just we spend more time in functions waiting for > writeback to complete. Ahh I see, that makes sense. > > > Lastly, FWIW I'm not able to replicate the regression when using loop devices > > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches > > as well). Not sure if this is related to the issue or just some side effect of > > using loop devices. > > This is because eMMC devices seem to be very sensitive to IO pattern (and > write request size). For loop devices, we don't care about request size > much so that's why mb_optimize_scan makes no big difference. But can you > still see the difference in the allocation pattern with the loop device? So i tested with loop devices and yes I do see a similar allocation pattern althout the regression is not there. Thanks, Ojaswin