On Fri, Mar 07, 2025 at 08:31:26AM -0500, Theodore Ts'o wrote: > On Fri, Mar 07, 2025 at 06:51:23AM -0500, Kent Overstreet wrote: > > > > Better bisection algorithm? Standand bisect does really badly when fed > > noisy data, but it wouldn't be hard to fix that: after N successive > > passes or fails, which is unlikely because bisect tests are coinflips, > > backtrack and gather more data in the part of the commit history where > > you don't have much. > > My general approach when handling some test failure is to try running > the reproducer 5-10 times on the original commit where the failure was > detected, to see if the reproducer is reliable. Once it's been > established whether the failure reproduces 100% of the time, or some > fraction of the time, say 25% of the time, then we can estalbish how > times we should try running the reproducer before we can conclude the > that a particular commit is "good" --- and the first time we detect a > failure, we can declare the commit is "bad", even if it happens on the > 2nd out of the 25 tries that we might need to run a test if it is > particularly flaky. That does sound like a nice trick. I think we'd probably want both approaches though, I've seen cases where a test starts out failing perhasp 5% of the time and then jumps up to 40% later on - some other behavioural change makes your race or what have you easier to hit. Really what we're trying to do is determine the shape of an unknown function sampling; we hope it's just a single stepwise change but if not we need to keep gathering more data until we get a clear enough picture (and we need a way to present that data, too). > > Maybe this is something Syzbot could implement? Wouldn't it be better to have it in 'git bisect'? > And if someone is familiar with the Go language, patches to implement > this in gce-xfstests's ltm server would be great! It's something I've > wanted to do, but I haven't gotten around to implementing it yet so it > can be fully automated. Right now, ltm's git branch watcher reruns > any failing test 5 times, so I get an idea of whether a failure is > flaky or not. I'll then manually run a potentially flaky test 30 > times, and based on how reliable or flaky the test failure happens to > be, I then tell gce-xfstests to do a bisect running each test N times, > without having it stop once the test fails. It wasts a bit of test > resources, but since it doesn't block my personal time (results land > in my inbox when the bisect completes), it hasn't risen to the top of > my todo list. If only we had interns and grad students for this sort of thing :)