Hi Christian, On 29/10/2024 05:57, Cristian Prundeanu wrote: > Hi Gautham, > > On 2024-10-25, 09:44, "Gautham R. Shenoy" <gautham.shenoy@xxxxxxx <mailto:gautham.shenoy@xxxxxxx>> wrote: > >> On Thu, Oct 24, 2024 at 07:12:49PM +1100, Benjamin Herrenschmidt wrote: >>> On Sat, 2024-10-19 at 02:30 +0000, Prundeanu, Cristian wrote: >>>> >>>> The hammerdb test is a bit more complex than sysbench. It uses two >>>> independent physical machines to perform a TPC-C derived test [1], aiming >>>> to simulate a real-world database workload. The machines are allocated as >>>> an AWS EC2 instance pair on the same cluster placement group [2], to avoid >>>> measuring network bottlenecks instead of server performance. The SUT >>>> instance runs mysql configured to use 2 worker threads per vCPU (32 >>>> total); the load generator instance runs hammerdb configured with 64 >>>> virtual users and 24 warehouses [3]. Each test consists of multiple >>>> 20-minute rounds, run consecutively on multiple independent instance >>>> pairs. >>> >>> Would it be possible to produce something that Prateek and Gautham >>> (Hi Gautham btw !) can easily consume to reproduce ? >>> >>> Maybe a container image or a pair of container images hammering each >>> other ? (the simpler the better). >> >> Yes, that would be useful. Please share your recipe. We will try and >> reproduce it at our end. In our testing from a few months ago (some of >> which was presented at OSPM 2024), most of the database related >> regressions that we observed with EEVDF went away after running these >> the server threads under SCHED_BATCH. > > I am working on a repro package that is self contained and as simple to > share as possible. > > My testing with SCHED_BATCH is meanwhile concluded. It did reduce the > regression to less than half - but only with WAKEUP_PREEMPTION enabled. > When using NO_WAKEUP_PREEMPTION, there was no performance change compared > to SCHED_OTHER. Which tasks did you set SCHED_BATCH here? I'm assuming the mysql 'connection' tasks on the SUT (1 task for each virtual user I guess). I did this and see that the regression goes away. I'm using a similar test setup (hammerdb - mysql on AWS EC2 instances). I'm not sure yet how reliable my results are. The big unknown is the host system when I use AWS EC2 instances for hammerdb (Load Gen) and mysql (server). In case I gather test results over multiple days, the host system might have changed? I also tried the (not-mainlined) RESPECT_SLICE (NO_RUN_TO_PARITY) features which shows similar results compared to SCHED_BATCH for those threads. IIRC, RESPECT_SLICE was also helping Gautham to get the performance back for his 'sysbench + mysql' workload: OSPM 24 link to his presentation: https://youtu.be/jrEN4pJiRWU?t=1115 > (At the risk of stating the obvious, using SCHED_BATCH only to get back to > the default CFS performance is still only a workaround, just as disabling > PLACE_LAG+RUN_TO_PARITY is; these give us more room to investigate the > root cause in EEVDF, but shouldn't be seen as viable alternate solutions.) > > Do you have more detail on the database regressions you saw a few months > ago? What was the magnitude, and which workloads did it manifest on? > > -Cristian >