Thank you for the clear explanation. Best regards, Daegyu 2019-09-17 21:54 GMT+09:00, Theodore Y. Ts'o <tytso@xxxxxxx>: > On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote: >> It started with my curiosity. >> I know this is not the right way to use a local filesystem and someone >> would feel weird. >> I just wanted to organize the situation and experiment like that. >> >> I thought it would work if I flushed Node B's cached file system >> metadata with the drop cache, but I didn't. >> >> I've googled for something other than the mount and unmount process, >> and I saw a StackOverflow article telling file systems to sync via >> blockdev --flushbufs. >> >> So I do the blockdev --flushbufs after the drop cache. >> However, I still do not know why I can read the data stored in the >> shared storage via Node B. > > There are many problems, but the primary one is that Node B has > caches. If it has a cached version of the inode table block, why > should it reread it after Node A has modified it? Also, the VFS also > has negative dentry caches. This is very important for search path > performance. Consider for example the compiler which may need to look > in many directories for a particular header file. If the C program has: > > #include "amazing.h" > > The C compiler may need to look in a dozen or more directories trying > to find the header file amazing.h. And each successive C compiler > process will need to keep looking in all of those same directories. > So the kernel will keep a "negative cache", so if > /usr/include/amazing.h doesn't exist, it won't ask the file system > when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open > /usr/include/amazing.h. > > You can disable all of the caches, but that makes the file system > terribly, terribly slow. What network file systems will do is they > have schemes whereby they can safely cache, since the network file > system protocol has a way that the client can be told that their > cached information must be reread. Local disk file systems don't have > anything like this. > > There are shared-disk file systems that are designed for > multi-initiator setups. Examples of this include gfs and ocfs2 in > Linux. You will find that they often trade performance for > scalability to support multiple initiators. > > You can use ext4 for fallback schemes, where the primary server has > exclusive access to the disk, and when the primary dies, the fallback > server can take over. The ext4 multi-mount protection scheme is > designed for those sorts of use cases, and it's used by Lustre > servers. But only one system is actively reading or writing to the > disk at a time, and the fallback server has to replay the journal, and > assure that primary server won't "come back to life". Those are > sometimes called STONITH schemes ("shoot the other node in the head"), > and might involve network controlled power strips, etc. > > Regards, > > - Ted >