Hello everyone, I hope this is the correct mailing list to post this problem on. I'm seeing some weird behaviour with the pop3 daemon on a GFS HA cluster with load balancing. The general situation is as follows: I have 3 servers here, everyone installed with CentOS 5.1 and the latest RedHat cluster. On every server is a cyrus 2.3.12p2 from the Invoca distribution. he The servers share two common partitions for data storage on an SAN, one 1 GB partition mounted on /var/lib/imap, and one 1.2TB partition mounted on /var/spool/imap. On the /var/lib/imap partition I have set up the following directories so they point to individual directories for each node: backup, proc and socket. The backup directory was made separately because some cron.daily entries locked each other up in the night, rendering the cluster useless. In front of the three backend servers is a load balancer, which balances pop3, imap, lmtp and timsieved on a round robin basis to each node. The load balancer is used (or will be used ;) ) by two perdition servers which connect to the pop or imap port on the LB, which distributes them to a running node. The idea behind this is that we can shut down any node without a notable service interruption, and we only have one backend system instead of several one. We want to migrate away from a murder based setup, so any comments in that direction won't be very useful for me at this stage ;) The problematic behaviour I see at the moment: I have migrated ~100 test mailboxes from the old backend system, and I'm in the process of performing load tests on the new system to get an impression how the performance will be, and if we are on the right track. From the mailboxes around 80 are empty, 10 are medium filled and 10 are filled to the maximum storage, which is about the distribution we will be talking about after putting the system live. The load test is performed with jakarta-jmeter from apache.org, which chooses one of the mailboxes, and performs either a pop-3 or imap login to the backend, using the load balancer. The distribution is roughly that I do 5 pop3 logins for 1 imap login, with a performance about 5 logins/sec. After 30 to 60 seconds into the test, randomly one of the backend servers pop3ds will stop working. It is still accepting connections, but doesn't send a banner anymore. This is recognized by the load balancer as "working" (as the port is still open), but one after another all my connections will hit the malfunctioning server and the test basically stalls. A restart of the cyrus service stops the problem for another 30 - 60 seconds. If I just stop the one offending server, so it won't be used by the LB anymore, the test usually finishes without a problem...... At first I thought that this was a problem related to entropy, but it even persisted after I turned off "allowapop", and unconfigured everything relating to TLS (as SSL/TLS will be handled completely by the perdition, we don't need it) My personal guess is that it is somehow related to the port tests by the load balancer, as normally a connection from the load balancer is the last thing I see in the log of the offending backend server. The port tests are easily distinguishable, as the LB just opens a TCP connection and instantly resets it before it reads any data from the pop3d, not even waiting for a banner. After this happens, there are no more log entries regarding pop3d, or log entries from the master that it spawns new pop3 processes. My second guess was that it is related to locking, but the IMAP server just continues to run fine, and doesn't have a problem. At the moment, I'm running out of ideas where to look, and my knowledge about cyrus debugging is quite limited (never had such a problem before ;) ), so any ideas or points how to debug the problem would be appreciated. Oh yes, I tried to strace the pop3d, and from the pop3d which generates the last log entry normally comes a SIGPIPE, as the end point isn't connected anymore to the pop3d. It looks a bit like master doesn't recognize that there is a problem regarding spawning off new children, and assigns new connections to a dysfunctional pop3d. Any ideas, hints, questions will be greatly appreciated, if information is missing I will provide what I can :) Thanks in advance! Regards, Jens ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html