On Wed, 2007-07-11 at 18:03 -0400, Wendy Cheng wrote: > Christopher Barry wrote: > > On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote: > > > >> Christopher Barry wrote: > >> > >>> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote: > >>> > >>> > >>>> Pavel Stano wrote: > >>>> > >>>> > >>>> > >>>>> and then run touch on node 1: > >>>>> serpico# touch /d/0/test > >>>>> > >>>>> and ls on node 2: > >>>>> dinorscio:~# time ls /d/0/ > >>>>> test > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> What have you expected from a cluster filesystem ? When you touch a file > >>>> on node 1, it is a "create" that requires at least 2 exclusive locks > >>>> (directory lock and the file lock itself, among many other things). On a > >>>> local filesystem such as ext3, disk activities are delayed due to > >>>> filesystem cache where "touch" writes the data into cache and "ls" reads > >>>> it from cache on the very same node - all memory operations. On cluster > >>>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to > >>>> release the locks (few ping-pong messages between two nodes and lock > >>>> managers via network), the contents inside node 1's cache need to get > >>>> synced to the shared storage. After node 2 gets the locks, it has to > >>>> read contents from the disk. > >>>> > >>>> I hope the above explanation is clear. > >>>> > >>>> > >>>> > >>>>> and last thing, i try gfs2, but same result > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> -- Wendy > >>>> > >>>> > >>> This seems a little bit odd to me. I'm running a RH 7.3 cluster, > >>> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been > >>> since ~2002. > >>> > >>> Here's the timing I get for the same basic test between two nodes: > >>> > >>> [root@sbc1 root]# cd /mnt/gfs/workspace/cbarry/ > >>> [root@sbc1 cbarry]# mkdir tst > >>> [root@sbc1 cbarry]# cd tst > >>> [root@sbc1 tst]# time touch testfile > >>> > >>> real 0m0.094s > >>> user 0m0.000s > >>> sys 0m0.000s > >>> [root@sbc1 tst]# time ls -la testfile > >>> -rw-r--r-- 1 root root 0 Jul 11 12:20 testfile > >>> > >>> real 0m0.122s > >>> user 0m0.010s > >>> sys 0m0.000s > >>> [root@sbc1 tst]# > >>> > >>> Then immediately from the other node: > >>> > >>> [root@sbc2 root]# cd /mnt/gfs/workspace/cbarry/ > >>> [root@sbc2 cbarry]# time ls -la tst > >>> total 12 > >>> drwxr-xr-x 2 root root 3864 Jul 11 12:20 . > >>> drwxr-xr-x 4 cbarry cbarry 3864 Jul 11 12:20 .. > >>> -rw-r--r-- 1 root root 0 Jul 11 12:20 testfile > >>> > >>> real 0m0.088s > >>> user 0m0.010s > >>> sys 0m0.000s > >>> [root@sbc2 cbarry]# > >>> > >>> > >>> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That > >>> just does not fly. My guess is DLM is causing problems. > >>> > >>> > >>> > >> From previous post, we really can't tell since the network and disk > >> speeds are variables and unknown. However, look at your data: > >> > >> local "ls" is 0.122s > >> remote "ls" is 0.088s > >> > >> I bet the disk flushing happened during first "ls" (and different base > >> kernels treat their dirty data flush and IO scheduling differently). I > >> can't be convinced that DLM is an issue - unless the experiment has > >> collected enough sample that has its statistical significance. > >> > >> -- Wendy > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster@xxxxxxxxxx > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > > ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm > saying one single command doesn't imply anything (since there are so > many variables there). You need to try out few more runs before > concluding anything is wrong. > > Where is all the time being spent? Certainly, it should not take 10 > > seconds. > > > > Let me see if I get the series of events correct here, and you can > > correct me where I'm wrong. > > > > Node1: > > touch is run, and asks (indirectly) for 2 exclusive write locks. > > dlm grants the locks. > > File is created into cache. > > locks are released (now?) > > > Not necessarily (if there is no other request pending, GFS caches the > locks assuming next request will be most likely from this node). > > local ls is run, and asks for read lock > > dlm grants lock. > > reads cache. > > returns results to screen > > lock is released > > > In your case, the lock was downgraded from write to read; file was > flushed; all within local node before remote "ls" was issued. This is > different from previous post. Previous poster didn't do an "ls" so he > paid the price for extra network traffic, plus the synchronization > (wait) cost (waiting for lock manager to communicate and file sync to > disk). And remember lock manager is implemented as daemon. You send the > daemon a message and it may not be waken up in time to receive the > message . A lot of variables there. > > Node2: > > remote ls is run, and asks for read lock > > ... what happens here? > > > DLM sends messages (via network) to node 1 to ask for lock. After lock > is granted, GFS reads the file from the disk. > > I think your saying dlm looks at the lock request, and says I can't give > > it to you, because the buffer has not been sync'd to disk yet. > > > No, DLM says I need to ask whoever is holding the lock to release the > lock. And GFS waits until lock is granted. Whoever owns the lock needs > to do its action accordingly. If it is an exclusive lock, the file needs > to get flushed before the lock can be shared. > > Does node2 wait, and retry asking for the lock after some time period, > > and do this in loop? Does the dlm on Node1 request the data be sync'd so > > that the requesting Node2 can access the data faster? > > > It is not in a loop. It is an event-wait-wakeup logic. > > If Pavel used dd to create a file, rather than touch, with a size larger > > than the buffer, and then used ls on Node2, would this show far better > > performance? Is the real issue the corner-case of a 0 byte file being > > created? > > > No, I don't think so. Not sure how "dd" is implemented internally from > top of my head. However, remember "create" competes with "ls" for > directory lock. But a file write itself doesn't compete with "ls" since > it only requires file lock. On the other hand, "ls -la" is another > story - it requires file size so it will need the file (inode locks). So > there is another variation there. > > Basically, I think you're saying that the kernel is keeping the 0 byte > > touched file in cache, and GFS and/or dlm cannot help with this > > situation. Is that correct? > > > > > No, I'm not saying that. Again, I'm saying you need to run the command > few times, instead of one time shot before concluding anything. Since > there are simply too many variations and variables under-neath these > simple "touch" and "ls" commands in a cluster environment. > > -- Wendy > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster Thank You for the lesson Wendy. ;^) Another question you'll likely know the answer to. Is there a preferred IO Scheduler to use with GFS? -- Regards, -C -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster