On Sun, Apr 11, 2021 at 03:45:01AM +0800, Wen Yang wrote: > At this time, some logs are lost. It is suspected that the hard disk itself > is faulty. If you have a kernel crash dump, that means you can extract out the dmesg buffer, correct? Is there any I/O error messages in the kernel log? What is the basis of the suspicion that the hard drive is faulty? Kernel dmesg output? Error reporting from smartctl? > There are many hard disks on our server. Maybe we should not occupy 100% CPU > for a long time just because one hard disk fails. It depends on the nature of the hard drive failure. How is it failing? One thing which we do need to be careful about is when focusing on how to prevent a failure caused by some particular (potentially extreme) scenarios, that we don't cause problems on more common scenarios (for example a heavily loaded server, and/or a case where the file system is almost full where we have multiple files "fighting" over a small number of free blocks). In general, my attitude is that the best way to protect against hard drive failures is to have processes which are monitoring the health of the system, and if there is evidence of a failed drive, that we immediately kill all jobs which are relying on that drive (which we call "draining" a particular drive), and/or if a sufficiently large percentage of the drives have failed, or the machine can no longer do its job, to automatically move all of those jobs to other servers (e.g., "drain" the server), and then send the machine to some kind of data center repair service, where the failed hard drives can be replaced. I'm skeptical of attempts to try to make the file system to somehow continue to be able to "work" in the face of hard drive failures, since failures can be highly atypical, and what might work well in one failure scenario might be catastrophic in another. It's especially problematic if the HDD is not explcitly signalling an error condition, but rather being slow (because it's doing a huge number of retries), or the HDD is returning data which is simply different from what was previously written. The best we can do in that case is to detect that something is wrong (this is where metadata checksums would be very helpful), and then either remount the file system r/o, or panic the machine, and/or signal to userspace that some particular file system should be drained. Cheers, - Ted