Re: Question about GFS2 and mmap

Scooter Morris <scooter@xxxxxxxxxxxx> · Mon, 17 Jan 2011 11:06:53 -0800

Steven,
    Thanks for getting back to me.  Yes, I've checked and noatime is 
definitely set.  While blast was running, I did a lockdump and the 
mmaped files had EX locks on them:

G:  s:EX n:2/5497229 f:q t:EX d:EX/0 l:0 a:0 r:3
 I: n:1055314/88699433 t:8 f:0x10 d:0x00000000 s:55237024/55237024

where inode 88699433 is one of the mapped files:

[root@crick blast]# ls -li /databases/mol/blast/db_current/nr.01.pin
88699433 -rw-r--r-- 1 rpcuser sacs 55237024 Jan 17 02:53 
/databases/mol/blast/db_current/nr.01.pin

so that explains the behavior.  What I don't understand is why they had 
EX locks.  I did an strace of the blast, and what I see when the files 
are mmaped is something like:

stat("/databases/mol/blast/db/nr.01.pin", {st_mode=S_IFREG|0644, 
st_size=55237024, ...}) = 0
open("/databases/mol/blast/db/nr.01.pin", O_RDONLY) = 8
mmap(NULL, 55237024, PROT_READ, MAP_SHARED, 8, 0) = 0x2b9ec1a14000

Where /databases/mol/blast is the gfs2 filesystem.  So, the files are 
not opened read/write, and the mmap'ed segment is not read/write.  It's 
not clear why gfs2 would create an exclusive glock for this file?  Does 
this make any sense to you?

-- scooter

On 01/16/2011 07:32 AM, Steven Whitehouse wrote:
Hi,

On Sat, 2011-01-15 at 16:46 -0800, Scooter Morris wrote:
We have a RedHat cluster (5.5 currently) with 3 nodes, and are sharing a
number of gfs2 filesystems across all nodes.  One of the applications we
run is a standard bioinformatics application called BLAST that searches
large indexed files to find similar dna (or protein) sequences.  BLAST
will typically mmap a fair amount of data into memory from the index
files.  Normally, this significantly speeds up subsequent executions of
BLAST.  This doesn't appear to work on gfs2, however, when I involve
other nodes.  For example, if I run blast three times on a single node,
the first execution is very slow, but subsequent executions are
significantly quicker.  If I then run it on another node in the cluster
(accessing the same data files over gfs2), the first execution is slow,
and subsequent executions are quicker.  This makes sense.  The problem
is that when I run it on multiple nodes, the speeds of subsequent runs
on the same node are no quicker.  It almost seems as if gfs2 is flushing
the in-memory copy (which is read only) immediately when the file is
accessed on another node.  Is this the case?  If so, is there a reason
for this, or is it a bug?  If it's a known bug, is there a workaround?

Any help would be appreciated!  This is a critical application for us.

Thanks in advance,

-- scooter

Are you sure that the noatime mount option has been used? I can't figure
out why that shouldn't work if the BLAST processes are really only
reading the files and not writing to them.

GFS2 is able to tell the difference between read and write accesses to
shared, writable mmap()ed files (unlike GFS which has to assume that all
accesses are write accesses). Some early versions of GFS2 did that too,
but anything recent (has ->page_mkwrite() in the source) and certainly
5.5 does, should be ok.

You can use the glock dump to see what mode the glock associated with
the mmap()ed inode is in. With RHEL6/Fedora/upstream you can use the
tracepoints to watch the state dynamically during the operations. I'm
afraid that isn't available on RHEL5. All you need to know is the inode
number of the file in question and then look for a type 2 glock with the
same number.

Let us know if that helps narrow down the issue. BLAST is something that
I'd like to see running well on GFS2,

Steve.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster