Re: strange slowness of ls with 1 newly created file on gfs 1 or 2

Wendy Cheng <wcheng@xxxxxxxxxx> · Wed, 11 Jul 2007 18:03:37 -0400

Christopher Barry wrote:
On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:

Christopher Barry wrote:

On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:

Pavel Stano wrote:

and then run touch on node 1:
serpico# touch /d/0/test

and ls on node 2:
dinorscio:~# time ls /d/0/
test

What have you expected from a cluster filesystem ? When you touch a file 
on node 1, it is a "create" that requires at least 2 exclusive locks 
(directory lock and the file lock itself, among many other things). On a 
local filesystem such as ext3, disk activities are delayed due to 
filesystem cache where "touch" writes the data into cache and "ls" reads 
it from cache on the very same node - all memory operations.  On cluster 
filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
release the locks (few ping-pong messages between two nodes and lock 
managers via network), the contents inside node 1's cache need to get 
synced to the shared storage. After node 2 gets the locks, it  has to 
read contents from the disk.

I hope the above explanation is clear.

and last thing, i try gfs2, but same result

-- Wendy

This seems a little bit odd to me. I'm running a RH 7.3 cluster,
pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
since ~2002.

Here's the timing I get for the same basic test between two nodes:

[root@sbc1 root]# cd /mnt/gfs/workspace/cbarry/
[root@sbc1 cbarry]# mkdir tst
[root@sbc1 cbarry]# cd tst
[root@sbc1 tst]# time touch testfile

real    0m0.094s
user    0m0.000s
sys     0m0.000s
[root@sbc1 tst]# time ls -la testfile
-rw-r--r--    1 root     root            0 Jul 11 12:20 testfile

real    0m0.122s
user    0m0.010s
sys     0m0.000s
[root@sbc1 tst]#

Then immediately from the other node:

[root@sbc2 root]# cd /mnt/gfs/workspace/cbarry/
[root@sbc2 cbarry]# time ls -la tst
total 12
drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
-rw-r--r--    1 root     root            0 Jul 11 12:20 testfile

real    0m0.088s
user    0m0.010s
sys     0m0.000s
[root@sbc2 cbarry]#

Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
just does not fly. My guess is DLM is causing problems.

 From previous post, we really can't tell since the network and disk 
speeds are variables and unknown. However, look at your data:

local "ls" is 0.122s
remote "ls" is 0.088s

I bet the disk flushing happened during first "ls" (and different base 
kernels treat their dirty data flush and IO scheduling differently). I 
can't be convinced that DLM is an issue - unless the experiment has 
collected enough sample that has its statistical significance.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm 
saying one single command doesn't imply anything (since there are so 
many variables there). You need to try out few more runs before 
concluding anything is wrong.
Where is all the time being spent? Certainly, it should not take 10
seconds.

Let me see if I get the series of events correct here, and you can
correct me where I'm wrong.

Node1:
touch is run, and asks (indirectly) for 2 exclusive write locks.
dlm grants the locks.
File is created into cache.
locks are released (now?)

Not necessarily (if there is no other request pending, GFS caches the 
locks assuming next request will be most likely from this node).
local ls is run, and asks for read lock
dlm grants lock.
reads cache.
returns results to screen
lock is released

In your case, the lock was downgraded from write to read; file was 
flushed; all within local node before remote "ls" was issued. This is 
different from previous post. Previous poster didn't do an "ls" so he 
paid the price for extra network traffic, plus the synchronization 
(wait) cost (waiting for lock manager to communicate and file sync to 
disk). And remember lock manager is implemented as daemon. You send the 
daemon a message and it may not be waken up in time to receive the 
message . A lot of variables there.
Node2:
remote ls is run, and asks for read lock
... what happens here?

DLM sends messages (via network) to node 1 to ask for lock. After lock 
is granted, GFS reads the file from the disk.
I think your saying dlm looks at the lock request, and says I can't give
it to you, because the buffer has not been sync'd to disk yet.

No, DLM says I need to ask whoever is holding the lock to release the 
lock. And GFS waits until lock is granted. Whoever owns the lock needs 
to do its action accordingly. If it is an exclusive lock, the file needs 
to get flushed before the lock can be shared.
Does node2 wait, and retry asking for the lock after some time period,
and do this in loop? Does the dlm on Node1 request the data be sync'd so
that the requesting Node2 can access the data faster?

It is not in a loop. It is an event-wait-wakeup logic.
If Pavel used dd to create a file, rather than touch, with a size larger
than the buffer, and then used ls on Node2, would this show far better
performance? Is the real issue the corner-case of a 0 byte file being
created?

No, I don't think so. Not sure how "dd" is implemented internally from 
top of my head. However, remember "create" competes with "ls" for 
directory lock. But a file write itself doesn't compete with "ls" since 
it only requires file lock.  On the other hand, "ls -la" is another 
story - it requires file size so it will need the file (inode locks). So 
there is another variation there.
Basically, I think you're saying that the kernel is keeping the 0 byte
touched file in cache, and GFS and/or dlm cannot help with this
situation. Is that correct?

No, I'm not saying that. Again, I'm saying you need to run the command 
few times, instead of one time shot before concluding anything. Since 
there are simply too many variations and variables under-neath these 
simple "touch" and "ls" commands in a cluster environment.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster