Hello everyone, Decided to share some ideas before jumping into them and how they align with the TODO list. Lets just assume for now that we have something similar to sockets cache if FDs (i.e., they are also reproducible to some extent). I am planning to work on it rather sooner than later, but it is not crucial for the current discussion. Here is my thoughts: One runs trinity either with a single child process or with multiple child processes. In a single child approach we don't catch cases where a race condition occurs, however, we do catch everything else (like usual bugs, when param's value handled incorrectly). [Q: Am I missing something here?] Now assume that we caught a bug, how can we reproduce it? What is the minimal set of details we need? I think, in general, we need only rand seed. In rerun we will use that number to reseed the prng and trinity should follow the same execution path (i.e., all the places where rand() func is called for branching) and generate the same input parameters. Of course, having the iteration number where kernel oopsed would help improving the speed of replay, like a fast forward. Although, the way it is implemented right now is a bit risky, since the rand functions are called in place, and we cannot guarantee that no other code (like gcc libraries) hasn't made a call to rand() function, thus moving the rand queue forward. If such calls repeats all the times its not a big deal, it is a problem, however, if that is a stochastic behavior. The idea I have it to have a buffer with rand numbers, which is filled by parent process, and in the logs we report details about what is in the beginning of that buffer and what is in the end. So that later, when we use "replay" function we are sure that the same rand buffer is used. If we need several buffers, we can always use these details in the logs to fast forward rand function if we see that it does not match with the buffer being used according the logs. Now with multiple child approach. It gets even more interesting. When we fork a child, we reseed it. This approach is very good from simplicity point of view, but there is a small caveat to it, in particular we do not explore every seed deep enough into its random queue. And also, now we have to keep track of all child seeds, which makes the replay function a bit more complicated. The idea I had is that in the SHM struct we will have a buffer of rand numbers generated before we fork a child, and each child uses its own buffer to be predictable. Once the buffer is empty, parent process regenerates it. I.e., basically avoid call to rand function from a child completely. This makes the support of randomness uniform of whether you go with a single child or multiple child processes. For the replay function in multiple child processes to be more successful it should follow the exact call sequence as it was during original run when the bug was discovered, e.g., if child A executed syscalls 1,2,3,4,4,4 and only then child B executed syscalls 5 and 6, the replay has to do exactly the same. Doing it just as is, is also OK, but in my view, such approach has less chances to reproduce race condition types of bugs. Of course, the exact repetition of the sequence does not guarantee us in repeating the bug, still, it should have a bit higher chances. Summarizing this, for multiple child processes replay we need that "call script" which stores the sequence of syscall execution. Writing such into a text file like logs is bad, since it will definitely become an IO problem. Having a binary logs, like you mentioned in the TODO seems to be the only option. Then we can add the parameter --parselog=mylog.bin to it to rerun the exact things. Of course this log should also store all the config params. Another benefit of having such call script is that we can put syscall result into it as well as an int, and then use that date to do analysis on which syscalls fails constantly. This is inline with TODO in terms of gathering more stats on current syscalls success/failure rate. Having these stats might reveal syscalls that need a better call handling (i.e., input parameter guessing and other related parts). So that's it for now. My gut feeling is that first I should look into rand function use, then work on FDs and then only go to bin-logs. What do you think of it? Thanks, Ildar -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html