On Tue, Dec 21, 2021 at 05:08:04PM +0800, Hillf Danton wrote: > > I am trying to find the cause of same jbd2 journal thread blocked for > more than 120 seconds on a customer's system of linux-4.18 with RT > turned on and 12 CPUs in total bootup. So here's the tricky bit with trying to use ext4 (or any file system, really; the details will be different but the fundamental issues will remain the same). When a thread either calls fsync(2) or tries to initiate a flie system mutation by creating a jbd2 handle and there isn't enough space, the process will wake up the jbd2 thread and block until a new transaction has been started. In the jbd2 thread, the first thing it does is wait for all currently open handles to close, since it can only commit the current transaction when all handles attached to the current transaction have been closed. If some non-real-time process happens to have an open handle, but it can't make forward progress for some reason, then this will prevent the commit from completing, and this in turn will cause any other process which needs to make changes to the file system from making forward progress, since they will be blocked by jbd2 commit thread, which in turn is blocked waiting low-priority process to make forward progress --- and if that process is blocked behind some high priority process, then that's the classic definition of "priority inversion". > Without both access to it and > clue of what RT apps running in the system, what I proposed is to > launch one more FIFO task of priority MAX_RT_PRIO-1 in the system like > > for (;;) { > unsigned long i; > > for (;;) /* spin for 150 seconds */ > i++; > sleep a second; > } > > in bid to observe the changes in behavior of underlying hardware using > the diff below. I'm not sure what you hope to learn by doing something like that. That will certainly perturb the system, but every 150 seconds, the task is going to let other tasks/threads run --- but it will be whatever is the next highest priority thread. What you want to do is to figure out which thread is still holding a handle open, and why it can't run --- is it because there are sufficient higher priority threads that are running that it can't get a time slice to run, so it can complete its file system operation and release its handle? Is it blocked behind a memory allocation (perhaps because it is in a memory-constrained cgroup)? Is it blocked waiting on some mutex perhaps because it's doing something crazy like sendfile()? Or some kind of I/O Uring system call? Etc, Etc., Etc. What would probably make sense is to use "ps -eLcl" before the system hangs so you can see what processes and threads are running with which real-time or non-real-time priorities. Or if the system has hung, uses the magic sysrq key to find out what threads are running on each CPU, and grab a stack trace from all of the running processes so you can figure out where some task might be blocked, and figure out which task might be blocked inside a codepath where it would be holding a handle. If that level of access means you have to get a government security clearance, or get permission from a finance company's vice president for that kind of access --- get that clearance ahead of time, even if it takes months and involves background investigations and polygraph tests. Because you *will* need that kind of access to debug these sorts of real-time locking issues. There is an extremely high (99.9%) probability that the bug is in the system configuration or application logic, so you will need full source code access to the workload to understand what might have gone wrong. It's almost never a kernel bug, but rather a fundamental application design or system configuration problem. > Is it a well-designed system in general if it would take more than > three seconds for the IO to complete with hardware glitch ruled out? Well, it depends on your definition of "well-designed" and "the I/O", doesn't it? If you are using a cost-optimized cheap-sh*t flash device from Shenzhen, it can minutes for I/O to complete. Just try copying DVD's worth of data using buffered writes to said slow USB device, and run "sync" or "umount /mnt", and watch the process hang for a long, long time. Or if you are using a cloud environment, and you are using virtual block device which is provisioned for a small number of IOPS, whose fault is it? The cloud system, for throttling I/O the IOPS that was provisioned for the device? The person who created the VM, for not appropriately provisioning enough IOPS? Or the application programmer? Or the kernel programmer? (And if you've ever worked at a Linux distro or a cloud provider, you can be sure that at least one platinum customer will try to blame the kernel programmer. :-) Or if you are using a storage area network, and you have a real time process which is logging to a file, and battle damage takes out part of the storage area network, and now the real-time process (which might be responsible for ship navigation or missle defense) hangs because file write is hanging, is that a "well-designed system"? This is why you never, never, *NEVER* write to a file from a mission or life-critical real-time thread. Instead you log to a ring buffer which is shared by non-real-time process, and that non-realtime process will save the log information to a file. And if the non-real-time process can't keep up with writing the log, which is worse? Missing log information? Or a laggy or deadlocked missile defense system? The bottom line is especially if you are trying to use real-time threads, the entire system configuration, including choice of hardware, as well as the overall system architecture, all needs to be part of a holistic design. You can't just be "the OS person" but instead you need to be part of the overall system architecture team. Cheers, - Ted