Re: Sharing ext4 on target storage to multiple initiators using NVMeoF

Daegyu Han <hdg9400@xxxxxxxxx> · Wed, 18 Sep 2019 00:38:16 +0900

Thank you for the clear explanation.

Best regards,

Daegyu

2019-09-17 21:54 GMT+09:00, Theodore Y. Ts'o <tytso@xxxxxxx>:
> On Tue, Sep 17, 2019 at 09:44:00AM +0900, Daegyu Han wrote:
>> It started with my curiosity.
>> I know this is not the right way to use a local filesystem and someone
>> would feel weird.
>> I just wanted to organize the situation and experiment like that.
>>
>> I thought it would work if I flushed Node B's cached file system
>> metadata with the drop cache, but I didn't.
>>
>> I've googled for something other than the mount and unmount process,
>> and I saw a StackOverflow article telling file systems to sync via
>> blockdev --flushbufs.
>>
>> So I do the blockdev --flushbufs after the drop cache.
>> However, I still do not know why I can read the data stored in the
>> shared storage via Node B.
>
> There are many problems, but the primary one is that Node B has
> caches.  If it has a cached version of the inode table block, why
> should it reread it after Node A has modified it?  Also, the VFS also
> has negative dentry caches.  This is very important for search path
> performance.  Consider for example the compiler which may need to look
> in many directories for a particular header file.  If the C program has:
>
> #include "amazing.h"
>
> The C compiler may need to look in a dozen or more directories trying
> to find the header file amazing.h.  And each successive C compiler
> process will need to keep looking in all of those same directories.
> So the kernel will keep a "negative cache", so if
> /usr/include/amazing.h doesn't exist, it won't ask the file system
> when the 2nd, 3rd, 4th, 5th, ... compiler process tries to open
> /usr/include/amazing.h.
>
> You can disable all of the caches, but that makes the file system
> terribly, terribly slow.  What network file systems will do is they
> have schemes whereby they can safely cache, since the network file
> system protocol has a way that the client can be told that their
> cached information must be reread.  Local disk file systems don't have
> anything like this.
>
> There are shared-disk file systems that are designed for
> multi-initiator setups.  Examples of this include gfs and ocfs2 in
> Linux.  You will find that they often trade performance for
> scalability to support multiple initiators.
>
> You can use ext4 for fallback schemes, where the primary server has
> exclusive access to the disk, and when the primary dies, the fallback
> server can take over.  The ext4 multi-mount protection scheme is
> designed for those sorts of use cases, and it's used by Lustre
> servers.  But only one system is actively reading or writing to the
> disk at a time, and the fallback server has to replay the journal, and
> assure that primary server won't "come back to life".  Those are
> sometimes called STONITH schemes ("shoot the other node in the head"),
> and might involve network controlled power strips, etc.
>
> Regards,
>
> 						- Ted
>