Vincent Fox <vbfox@xxxxxxxxxxx> wrote: > Our working hypothesis is that CYRUS is what is choking up at a certain > activity level due to bottlenecks with simultaneous access to some shared > resource for each instance. Did you do a lockstat -Pk sleep 30 (with "-x destructive" when it complains about the system being unresponsive)? We had that result, among others: Adaptive mutex block: 2339 events in 30.052 seconds (78 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 778 79% 79% 0.00 456354473 0xffffffffa4867730 zfs_zget 61 6% 85% 0.00 466021696 0xffffffffa4867130 zfs_zget 8 1% 87% 0.00 748812180 0xffffffffa4867780 zfs_zget 26 1% 88% 0.00 200187703 0xffffffff9cf97598 dmu_object_alloc 2 1% 89% 0.00 1453472066 0xffffffffa4867de0 zfs_zget 12 1% 89% 0.00 204437906 0xffffffffa4863ad8 dmu_object_alloc 4 1% 90% 0.00 575866919 0xffffffffa4867838 zfs_zinactive 5 1% 90% 0.00 458982547 0xffffffffa48677b8 zfs_zget 4 1% 91% 0.00 563367350 0xffffffffa4867868 zfs_zinactive 3 0% 91% 0.00 629688255 0xffffffffa48677b0 zfs_zinactive Nearly all locks caused by zfs. The Disk SAN system is NOT the bottleneck though, having average service times from 5-8 ms, and no wait queue. 456354473 nsecs are 0,456 secs, that is *LONG*. What's also interestring is tracing open()-calls via dtrace. Just use this: #!/usr/sbin/dtrace -s #pragma D option destructive #pragma D option quiet syscall::open:entry { self->ts=timestamp; self->filename=arg0; } syscall::open:return /self->ts > 0/ { zeit=timestamp - self->ts; printf("%10d %s\n",zeit,copyinstr(self->filename)); @["open duration"] = quantize(zeit); self->ts=0; } It will show you all files opened and the time needed (in nanosecs) to accomplish that. After hitting CTRL-C, it will summarize: open duration value ------------- Distribution ------------- count 1024 | 0 2048 |@ 80 4096 |@@@@@@@@@@@@@@@@@@@@@ 1837 8192 |@@@@@@ 521 16384 |@@@@@@@ 602 32768 |@@@ 229 65536 |@ 92 131072 | 2 262144 | 0 524288 | 1 1048576 | 1 2097152 | 1 4194304 | 3 8388608 | 12 16777216 |@ 51 33554432 | 38 67108864 | 25 134217728 | 9 268435456 | 2 536870912 | 3 1073741824 | 0 You see the arc memory activity from 4-65 mikroseconds and disk activity from 8-33ms. And you see some "big hits" from 0,13 - 0,5 secs (!). This is far too much and I did not figure out why this is happening. As more users are connecting this "really long opens" become more and more. We have a Postfix spool running on the same machine and we got some relief in deactivating its directory hashing scheme. ZFS is very "angry" about having a deep directory structure it seems. But still, these "long opens" do occur. Pascal ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html