Re: Sharing ext4 on target storage to multiple initiators using NVMeoF

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Tue, 17 Sep 2019 08:54:23 -0400

On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote:
> It started with my curiosity.
> I know this is not the right way to use a local filesystem and someone
> would feel weird.
> I just wanted to organize the situation and experiment like that.
> 
> I thought it would work if I flushed Node B's cached file system
> metadata with the drop cache, but I didn't.
> 
> I've googled for something other than the mount and unmount process,
> and I saw a StackOverflow article telling file systems to sync via
> blockdev --flushbufs.
> 
> So I do the blockdev --flushbufs after the drop cache.
> However, I still do not know why I can read the data stored in the
> shared storage via Node B.

There are many problems, but the primary one is that Node B has
caches.  If it has a cached version of the inode table block, why
should it reread it after Node A has modified it?  Also, the VFS also
has negative dentry caches.  This is very important for search path
performance.  Consider for example the compiler which may need to look
in many directories for a particular header file.  If the C program has:

#include "amazing.h"

The C compiler may need to look in a dozen or more directories trying
to find the header file amazing.h.  And each successive C compiler
process will need to keep looking in all of those same directories.
So the kernel will keep a "negative cache", so if
/usr/include/amazing.h doesn't exist, it won't ask the file system
when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open
/usr/include/amazing.h.

You can disable all of the caches, but that makes the file system
terribly, terribly slow.  What network file systems will do is they
have schemes whereby they can safely cache, since the network file
system protocol has a way that the client can be told that their
cached information must be reread.  Local disk file systems don't have
anything like this.

There are shared-disk file systems that are designed for
multi-initiator setups.  Examples of this include gfs and ocfs2 in
Linux.  You will find that they often trade performance for
scalability to support multiple initiators.

You can use ext4 for fallback schemes, where the primary server has
exclusive access to the disk, and when the primary dies, the fallback
server can take over.  The ext4 multi-mount protection scheme is
designed for those sorts of use cases, and it's used by Lustre
servers.  But only one system is actively reading or writing to the
disk at a time, and the fallback server has to replay the journal, and
assure that primary server won't "come back to life".  Those are
sometimes called STONITH schemes ("shoot the other node in the head"),
and might involve network controlled power strips, etc.

Regards,

						- Ted