hello ssh list, long time user of openssh, but relatively new to the concept of ssh multiplexing. i'm experiencing some issues and i haven't figured out how to troubleshoot it just yet. would appreciate some help if possible. i'm using ssh as a communications mechanism to pass text file based messages between 2 hosts. There are programs on each side that send and receive these messages. When I found out about ssh multiplexing, i was excited to use it because we were seeing several hundred ssh connections going back and forth between the 2 hosts. when i tried ssh multiplexing, the message latency dropped dramatically by 10 fold! however, now that this mechanism has been in use for a week, I'm starting to see some problems. First, this is the .ssh/config contents: Host * ControlPath ~/.ssh/cm-%r@%h:%p ControlMaster auto ControlPersist 10m Everything seems to work for a few days, but then ssh starts to hang, and we start seeing several hundred ssh processes all trying to send their message but cannot. When i try to run ssh by hand, this is what i get: $ ssh -vvv boss@ui1 OpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013 debug1: Reading configuration data /var/lib/worker/.ssh/config debug1: /var/lib/worker/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 56: Applying options for * debug1: auto-mux: Trying existing master And it hangs at that point indefinitely until Ctrl-C. At this point in time, we do see the ssh mux process still running: $ ps -eo pid,user,args | awk '$2=="worker" && $3=="ssh:" && $5=="[mux]" {print}' 29305 worker ssh: /var/lib/worker/.ssh/cm-boss@ui1:22 [mux] I tried to attach strace to the ssh mux process, and this is what i see when the problem is happening: select(1024, [3 5 9], [], NULL, {0, 11336}) = 0 (Timeout) clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778030739}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778085461}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778109973}) = 0 select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4]) clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778186890}) = 0 accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files) clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778263743}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778298340}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778343707}) = 0 select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout) clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778457543}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778518096}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778546349}) = 0 select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4]) clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778627517}) = 0 accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files) clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778693493}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778725395}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778749417}) = 0 select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout) clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778904087}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778963540}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778988943}) = 0 select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4]) clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779072887}) = 0 accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files) clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779158255}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779191597}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779216201}) = 0 select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout) clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779334945}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779393178}) = 0 clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779418473}) = 0 Does this indicate a open file limit for this user? Or is this something else? This is ulimit -a for that user: -bash-4.2$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2062375 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to troubleshoot this further? Thanks in advance... _______________________________________________ openssh-unix-dev mailing list openssh-unix-dev@xxxxxxxxxxx https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev