Hello, On Sat, 25 Aug 2012, Dmitry Akindinov wrote: > Hello, > > We are currently stuck with the following ipvs problem: > > 1. The configuration includes a (potentially large) set of servers providing > various services - besides HTTP (POP, IMAP, LDAP, SMTP, XMPP, etc.) The test > setup includes just 2 servers, though. > 2. Each server runs a stock version of CentOS 6.0 OK, I don't know what kernel and patches includes every distribution. Can you tell at least what shows uname -a? > 3. The application software (CommuniGate Pro) controls the ipvs kernel module > using the ipvsadm commands. > 4. On each server, iptables are configured to: > a) disable connection tracking for VIP address(es) > b) mark all packets coming to the VIP address(es) with the mark value of > 100. > 5. On the currently active load balancer, the ipvsadm is used to configure > ipvs to load-balance packets with the marker 100: > -A -f 100 -s rr -p 1 > -a -f 100 -r <server1> -g > -a -f 100 -r <server2> -g > .... > where the active balancer itself is one of the <serverN> > 6. All other servers (just 1 "other" server in our test config) are running > ipvs, but with an empty rule set. I think, running slaves without same rules is a mistake. When the slave receives sync message it has to assign it to some virtual server and even assign real server for this connection. But if this slave is also a real server the things are complicated. I now check the code and do not see where we prevent backup to schedule traffic received from current master. The master gives the traffic to backup because considers it a real server but this backup with rules decides to schedule it to different real server. This problem can not happen for NAT, only for DR/TUN, I see that you are using DR forwarding method. So, currently, IPVS users do not add ipvsadm rules in backup for DR/TUN for this reason? > 7. The active load balancer runs the sync daemon started with ipvsadm > --start-daemon master > 7. All other servers run the sync daemon started with ipvsadm --start-daemon > backup. > > As a result, all servers have the duplicated ipvs connection tables. If the > active balancer fails, some other server assumes its role by arp-broadcasting > VIP and loading the ipvs rule set listed above. In initial email you said: "Now, we initiate a failover. During the failover, the ipvs table on the old "active" balancer is cleared," Why do you clear the connection table? What if you decide after 10 seconds to return back control to the first master? > When a connection is being established to the VIP address, and the active load > balancer directs it to itself, everything works fine. I assume you are talking for box 2 (the new master) > When a connection is being established to the VIP address, and the active load > balancer directs it to some other server, the connection is established fine, > and if the protocol is POP, IMAP, SMTP, the server prompt is sent to the > client via VIP, and it is seen by client just fine. You mean, new connection does 3-way handshake via the new master to other real servers and succeeds, or already established connection before the failover work after failover? Is packet directed to old master? > But when the client tries to send anything to the server, the packet > (according to tcpdump) reaches the load balancer server, and from there it > reaches the "other" server. Where the packet is dropped. The client resends > that packet, it goes to the active balancer, then to the "other" server, and > it is dropped again. Why this real server drops the packet? What is different in this packet? Are you talking about connections created before failover, that they can not continue to work after failover? May be problem happens for DR. Can you show tcpdump in old master that 3-way traffic is received and also that it is replied by it, not by some real server. Problem can happen only if master sends new traffic to backup (its real server). For example: - master schedules SYN to real server which is backup with same rules - SYNC conn is not sent before IPVS conn enters ESTABLISHED state, so backup does not know for such connection, it looks like new one - backup has rules, it decides to use real server 3 and directs the SYN there. It can happen only for DR/TUN because the daddr is VIP, that is why people overcome the problem by checking that packet comes from some master and not from uplink gateway MAC. For NAT there is no such double-step scheduling because the backups' rules do not match the internal real server IP in the daddr, they work only for VIP - more traffic comes, backup directs it to real server 3 - the first SYNC message for this connection comes from master but the SYNC message claims the backup is a real server for this connection. Looking at current code, ip_vs_proc_conn ignores the fact that master wants the backup as real server for this connection, backup will continue to use real server 3. For now, I don't see where this can fail except if persistence comes in the game or if failover happens to another backup which will use real server 3. The result is that the backup acts as balancer even if it is just a backup without master function. > Observations: > *) if ipvs is switched off on that "other" server, everything works just fine > (service ipvsadm stop) So, someone stops the SYN traffic in backup? > *) if ipvs is left running on that "other" server, but syncing daemon is > switched off, everything works just fine. Without rules in this backup? > We are 95% sure that the problem appears only if the "other server" ipvs > connection table gets a copy of this > connection from the active balancer. If the copy is not there (the sync daemon > was stopped when the connection > was established, and restarted immediately after), everything works just fine. Interesting, new master forwards to old master, so it should send SYNC containing the old master as real server, how can there be a problem, may be your kernel does not support properly the local server function which is fixed 2 years ago. > *) the problem exists for protocols like POP, IMAP, SMTP - where the server > immediately sends some data (prompt) to the client, as soon as the connection > is established. The SYNC packets always go after the traffic, so not sure why SYN will work while there will be difference for other traffic. May be your kernel version reacts differently when first SYNC message claims server 3 is the real server, not backup 1 and the double-scheduling is broken after 3-way handshake. > When the HTTP protocol is used, the problem does not exist, but only if the > entire request is sent as one packet. If the HTTP connection is a "keep-alive" > one, subsequent requests in the same connection do not reach the application > either. > I.e. it looks like the "idling" ipvs allows only one incoming data packet in, > and only if there has been no outgoing packet on that connection yet. May be SYNC message changes the destination in backup as I already said above? Some tcpdump output will be helpful in case you don't know how to dig into the sources of your kernel. > *) Sometimes (we still cannot reproduce this reliably) the ksoftirqd threads > on the "other" server jump to 100% CPU > utilization, and when it happens, it happens in reaction to one connection > being established. This sounds as a problem fixed 2 years ago: http://marc.info/?t=128428786100001&r=1&w=2 At that time even fwmark was not supported for sync purposes. Note that many changes happened in this 2 year period, some for fwmark support for IPVS sync, some for the 100% loops. Without knowing the kernel version I'm not willing to flood you with changes that you should check if they are present in your kernel if it contains additional patches. > Received suggestions: > *) it was suggested that we use iptables to filter the packets to VIP that > come from other servers in the farm (using their MAC addresses) and direct > them directly to the local application, bypassing ipvs processing. We cannot > do that, as servers in the farm can be added at any moment, and updating the > list of MACs on all servers is not trivial. It may be easier to filter the > packets that come from the router(s), which are less numerous and do not > change that often. > But it does not look like a good solution. If the ipvs table on "inactive" > balancer drops packets, why would it stop dropping them when it becomes an > "active" balancer? Just because there will be ipvs rules present? > > *) The suggestion to separate load balancer(s) and real servers won't work for > us at all. > > *) We tried not to empty the ipvs table on the "other" server(s). Instead, we > left it balancing - but with only one "real server" - this server itself. Now, > the "active" load balancer dsitributes packets to itself and other servers, > and when the packets hit the "other" server(s), they get to the ipvs again, > where they are balanced again, but to the local server only. Very good, only that you need recent kernel for this, 2010-Nov +, there are fixes even after that time. > It looks like it does solve the problem. But now the ipvs connection table on > the "other" server(s) is filled by both that server ipvs itself and by the > sync-daemon. While the locally-generated connection table entries should be > the same as corresponding entries received with the sync daemon, it does not > look good when the same table is modified from two sources. Sync happens only in one direction at a time, from current master to current backup (it can be more than one). The benefit is that all servers used for sync have same table and you can switch between them at any time. Of course, there is some performance price for traffic that goes to the local stack of backups but they should get from current master only traffic for their stack. > Any comment, please? Should we use the last suggestion? I think, with fresh kernel your setup should be supported. After showing the kernel version we can decide for further steps. I'm not sure if we need to change kernel not to schedule new connections for the BACKUP && !MASTER configuration. By this way backup can have same rules as backup which can work for DR/TUN. Without such change we can not do role change without breaking connections because the SYNC protocol declares real server 1 as server while some backup overrides this decision and uses real server 3, decision not known by other potential masters. > -- > Best regards, > Dmitry Akindinov > -- Regards -- Julian Anastasov <ja@xxxxxx> -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html