On Tue, Jun 12, 2007 at 04:01:04PM +0200, Ferenc Wagner wrote: > Hi David, > > Sorry if all what follows is misguided nonsense. I'm eager to learn... > > David Teigland <teigland@xxxxxxxxxx> writes: > > > The new code has much better caching in the dlm which will benefit flocks, > > look at these flock numbers I sent before: [...] > > > > This is testing raw flock performance. The dlm locks for normal file > > operations should be cached and locally mastered also, so I'm not sure > > what's causing the long times. Make sure that drop_count is zero again, > > now it's in sysfs: > > echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count > > > > Also, mount debugfs so we can check some stuff later: > > mount -t debugfs none /sys/kernel/debug > > > > Then run some tests: > > - mount on nodeA > > - run the test on nodeA > > - count locks on nodeA > > (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l) > > - mount on nodeB (don't do anything on this node) > > - run the test again on nodeA > > - count locks on nodeA and nodeB (see above) > > - mount on nodeC (don't do anything on nodes B or C) > > - run the test again on nodeA > > - count locks on nodes A, B and C (see above) > > > > We're basically trying to produce the best-case performance from one node, > > nodeA. That means making sure that nodeA is mastering all locks and doing > > maximum caching. That's why it's important that we not do anything at all > > that accesses the fs on nodes B or C, or do any extra mounts/unmounts. > > I made all the above tests and composed the reply a long time ago, but > now, getting back to it after that long time, I decided to satisfy your > curiosity, behold... > > > Plocks will be much slower and are probably not interesting to test, but > > I'm curious if you added the "-l0" option to gfs_controld? That option > > turns off the code that intentionally limits the rate of plocks. See the > > old results again: [...] > > Now, that switch makes ALL the difference. With a single node > switched on, I get results like this (with abbreviated strace -c > output appended): > > without -l0: > > filecount=500 > iteration=0 elapsed time=10.444446 s > iteration=1 elapsed time=9.693618 s > iteration=2 elapsed time=10.520073 s > iteration=3 elapsed time=10.521504 s > iteration=4 elapsed time=10.520183 s > total elapsed time=51.699824 s > Process 5265 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 83.27 0.048525 6 7551 read > 6.73 0.003923 2 2502 fcntl64 > 4.47 0.002606 1 2528 close > 3.09 0.001801 1 2551 23 open > 0.74 0.000432 0 2507 write > 0.71 0.000415 0 5033 mmap2 > 0.41 0.000237 0 12528 3 _llseek > 0.31 0.000178 0 5001 munmap > 0.18 0.000107 0 5015 fstat64 > 0.08 0.000049 0 2506 gettimeofday > 0.00 0.000000 0 16 14 ioctl > 0.00 0.000000 0 202 182 stat64 > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.058273 47974 229 total > > with -l0: > > filecount=500 > iteration=0 elapsed time=5.966146 s > iteration=1 elapsed time=0.582058 s > iteration=2 elapsed time=0.528272 s > iteration=3 elapsed time=0.936438 s > iteration=4 elapsed time=0.528147 s > total elapsed time=8.541061 s > Process 10030 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.17 0.016527 2 7551 read > 21.49 0.006213 2 2528 close > 8.16 0.002358 1 2502 fcntl64 > 6.59 0.001904 1 2551 23 open > 2.21 0.000638 0 2507 write > 1.46 0.000421 0 5033 mmap2 > 0.86 0.000249 249 1 execve > 0.73 0.000212 0 5001 munmap > 0.65 0.000187 0 12528 3 _llseek > 0.57 0.000165 0 5015 fstat64 > 0.12 0.000034 0 2506 gettimeofday > 0.00 0.000000 0 16 14 ioctl > 0.00 0.000000 0 202 182 stat64 > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.028908 47974 229 total > > Looks like the bottleneck isn't the explicit locking (be it plock or > flock), but something else, like the built-in GFS locking. I'm guessing that these were run with a single node in the cluster? The second set of numbers (with -l0) wouldn't make much sense otherwise. I think if you add nodes to the cluster, the -l0 numbers will go up quite a bit. In the end I expect that flocks are still going to be the fastest for you. > Similar dramatic speedup can be achieved (with a single node switched > on, again), by the lockproto=lock_nolock mount option, even if used > together with ignore_local_fs. It I understand it right, this > combination leaves the cluster-wide [pf]locks alone, just eliminates > the GFS internal locking, which guards the internal consistency of the > file system (please correct me if I'm wrong). With nolock there is no cluster (lock_nolock just returns 0 for everything), so the cluster-wide [pf]locks have zero cost. So this test doesn't tell you anything. > What's strange, is that gfs_controld -l0 seems like a perfectly safe > invocation (what's the catch, ie. why was the artifical limit > introduced?), The rate limit was introduced to prevent bad programs from flooding the network with plock operations. It may not be a very real problem, though, so we might eventually disable it (-l0) by default. > still it achieves almost the same speedup like using > lock_nolock, which would be a disaster with more than one node > mounting the fs. (Also this trick scales pretty well to 4000 files.) No, -l0 is not going to give you the performance of nolock. I think you must have been running with a single node in the cluster. In that case there are no other nodes to send/recv messages to/from, so the plock messages are very fast. > Again, the above tests were done with a single node switched on, and > I'm not sure whether the results carry over to the real cluster setup, > will test is soon. Ah, yep. When you add nodes the plocks will become much slower. Again, I think you'll have better luck with flocks. > I didn't touch drop_count either, everything was > left as default, except for the mount options and the -l option. > > Also, I can send the results of the scenario suggested by you, if it's > still relevant. In short: the locks are always mastered on node A > only, but the performance is poor nevertheless. Poor even in the first step when you're just mounting on nodeA? Dave -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster