On 05 Apr 2013 14:40:00 +0100 J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > On Thu, Apr 04, 2013 at 07:59:35PM +0200, Bodo Stroesser wrote: > > There is no reason for apologies. The thread meanwhile seems to be a bit > > confusing :-) > > > > Current state is: > > > > - Neil Brown has created two series of patches. One for SLES11-SP1 and a > > second one for -SP2 > > > > - AFAICS, the series for -SP2 will match with mainline also. > > > > - Today I found and fixed the (hopefully) last problem in the -SP1 series. > > My test using this patchset will run until Monday. > > > > - Provided the test on SP1 succeeds, probably on Tuesday I'll start to test > > the patches for SP2 (and mainline). If it runs fine, we'll have a tested > > patchset not later than Mon 15th. > > OK, great, as long as it hasn't just been forgotten! > > I'd also be curious to understand why we aren't getting a lot of > complaints about this from elsewhere.... Is there something unique > about your setup? Do the bugs that remain upstream take a long time to > reproduce? > > --b. > It's no secret, what we are doing. So let me try to explain: We build appliances for storage purposes. Each appliance mainly consists of a cluster of servers and a bunch of FibreChannel RAID systems. The servers of the appliance run SLES11. One ore more of the servers in the cluster can act as a NFS server. Each NFS server is connected to the RAID systems and has two 10 GBit/s Ethernet controllers for the link to the clients. The appliance not only offers NFS access for clients, but also has some other types of interfaces to be used by the clients. For QA of the appliances we use a special test system, that runs the entire appliance with all its interfaces under heavy load. For the test of the NFS interfaces of the appliance, we connect the Ethernet links one by one to 10 GBit/s Ethernet controllers on a linux machine of the test system. The SW on the test system for each Ethernet link uses 32 TCP connections to the NFS server in parallel. So between NFS server of the appliance and linux machine of the test system we have two 10 GBit/s links with 32 TCP/RPC/NFS_V3 connections each. Each link is running at up to 1 GByte/s throughput (per second and per link a total of 32k NFS3_READ or NFS3_WRITE RPCs of 32k data each.) Normal Linux-NFS-Clients open only one single connection to a specific NFS server, even if there are multiple mounts. We do not use the linux builtin client, but create a RPC client by clnttcp_create() and do the NFS handling directly. Thus we can have multiple connections and we immediately can see if something goes wrong (e.g. if a RPC request is dropped), while the builtin linux client probably would do a silent retry. (But probably one could see single connections hang for a few minutes sporadically. Maybe someone hit by this would complain about the network ...) As a side effect of this test setup all 64 connections to the NFS server use the same uid/gid and all 32 connections on one link come from the same ip address. This - as we know now - maximizes the stress for a single entry of the caches. With our test setup at the beginning we had more than two dropped RPC request per hour and per NFS server. (Of course, this rate varied widely.) With each single change in cache.c the rate went down. The latest drop caused by a missing detail in the latest patchset for -SP1 occured after more than 2 days of testing! Thus, to verify the patches I schedule a test for at least 4 days. HTH Bodo ÿôèº{.nÇ+?·?®??+%?Ëÿ±éݶ¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥