On Mon, 18 Oct 2021 16:41:14 +0800, Dongliang Mu said: > I want to log all the executed instructions of a user process (e.g., > poc.c in syzkaller) in the kernel mode and then would like to leverage > backward analysis to capture the root cause of kernel panic/crash. > Therefore, I need the instruction-level tracing mechanisms or tools. Tracing just the instructions won't get you where you want to be if you're going through this approach. You *also* need to track all the data - the instruction path inside two different runs of syzkaller may be essentially identical, but pass 2 different values as the 3rd parameter of a syscall. You may also have to deal with insane amounts of data - the actual error could have been minutes or even hours before, or the interaction between two different processes. You probably want to take a *really* close look at how prof and friends avoid infinite regress when code execution drops inside the prof code, because you're going to hit the same issues. Or.... You can work smarter rather than harder, and ask yourself what's the minimum amount and type of additional information to make a significant improvement in the debugging of system crashes. For example, 95% of the time, you can figure out what the bug is by merely looking at the stack traceback. For most of the rest of the cases, simply capturing the parameter values from the syscall and the basic info for page faults and other interrupts is probably sufficient, and you can probably leverage the audit subsystem for most of that. It can already record syscall parameters, while logging page faults and other interrupts can probably be done with prof. At that point, you don't actually *need* every instruction - only tracing branch and call instructions is sufficient, because you already know that each instruction between the target of a branch/call and the next branch/call will be executed. Similarly, the lockdep code will catch most locking issues. But it won't flag issues with data that should be protected with a lock, but are bereft of any locking. So ask yourself: What ways are there to analyze the code and detect critical sections prone to race conditions? Is there a sparse-on-steroids approach that wil do the heavy lifting for those? (Note that this isn't an easy task for the general case, but identifying two or three specific common patterns and finding a way to detect them may be worthwhile) And many of the rest of crashes are timing related, and "let's trace every single instruction" is almost guaranteed to make things slow enough to change/bypass the timing issue. So... What's left that would be the most helpful with the least amount of data? Go look at some threads on linux-kernel. Look at the kernel bugs that were the result of a Homer Simpson "D'oh!" moment. What can we do to make those bugs less likely to make it into the code in the first place? For the more subtle bugs, what data finally made the debugging come together?
Attachment:
pgpu1KqXNoPhW.pgp
Description: PGP signature
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies