On Tue, 2016-09-06 at 11:47 -0400, Oleg Drokin wrote: > On Sep 6, 2016, at 11:18 AM, Jeff Layton wrote: > > > > > On Tue, 2016-09-06 at 10:58 -0400, Oleg Drokin wrote: > > > > > > On Sep 6, 2016, at 10:30 AM, Jeff Layton wrote: > > > > > > > > > > > > > > > On Mon, 2016-09-05 at 00:55 -0400, Oleg Drokin wrote: > > > > > > > > > > > > > > > Hello! > > > > > > > > > > I have a somewhat mysterious problem with my nfs test rig that I suspect is something > > > > > stupid I am missing, but I cannot figure it out and would appreciate any help. > > > > > > > > > > NFS server is Fedora23 with 4.6.7-200.fc23.x86_64 as the kernel. > > > > > Clients are a bunch of 4.8-rc5 nodes, nfsroot. > > > > > If I only start one of them, all is fine, if I start all 9 or 10, then suddenly all > > > > > operations ground to a half (nfs-wise). NFS server side there's very little load. > > > > > > > > > > I hit this (or something similar) back in June, when testing 4.6-rcs (and the server > > > > > was running 4.4.something I believe), and back then after some mucking around > > > > > I set: > > > > > net.core.rmem_default=268435456 > > > > > net.core.wmem_default=268435456 > > > > > net.core.rmem_max=268435456 > > > > > net.core.wmem_max=268435456 > > > > > > > > > > and while no idea why, that helped, so I stopped looking into it completely. > > > > > > > > > > Now fast forward to now, I am back at the same problem and the workaround above > > > > > does not help anymore. > > > > > > > > > > I also have a bunch of "NFSD: client 192.168.10.191 testing state ID with incorrect client ID" > > > > > in my logs (also had in June. Tried to disable nfs 4.2 and 4.1 and that did not > > > > > help). > > > > > > > > > > So anyway I discovered the nfsdcltrack and such and I noticed that whenever > > > > > the kernel calls it, it's always with the same hexid of > > > > > 4c696e7578204e465376342e32206c6f63616c686f7374 > > > > > > > > > > NAturally if I try to list the content of the sqlite file, I get: > > > > > sqlite> select * from clients; > > > > > Linux NFSv4.2 localhost|1473049735|1 > > > > > sqlite> select * from clients; > > > > > Linux NFSv4.2 localhost|1473049736|1 > > > > > sqlite> select * from clients; > > > > > Linux NFSv4.2 localhost|1473049737|1 > > > > > sqlite> select * from clients; > > > > > Linux NFSv4.2 localhost|1473049751|1 > > > > > sqlite> select * from clients; > > > > > Linux NFSv4.2 localhost|1473049752|1 > > > > > sqlite> > > > > > > > > > > > > > Well, not exactly. It sounds like the clients are all using the same > > > > long-form clientid string. The server sees that and tosses out any > > > > state that was previously established by the earlier client, because it > > > > assumes that the client rebooted. > > > > > > > > The easiest way to work around this is to use the nfs4_unique_id nfs.ko > > > > module parm on the clients to give them each a unique string id. That > > > > should prevent the collisions. > > > > > > Hm, but it did work ok in the past. > > > What determines the unique id now by default? > > > The clients do start with a different ip address for one, so that > > > seems to make that a much more good proxy for unique id > > > (or local ip/server ip as is in case of centos7) than whatever local > > > hostname is at any random point in time during boot > > > (where it might not be set yet apparently). > > > > > > > The v4.1+ clientid is (by default) determined entirely from the > > hostname. > > > > IP addresses are a poor choice given that they can easily change for > > clients that have them dynamically assigned. That's the main reason > > that v4.0 behaves differently here. The big problems there really come > > into play with NFSv4 migration. See this RFC draft for the gory > > details: > > > > https://tools.ietf.org/html/draft-ietf-nfsv4-migration-issues-10 > > Duh, so "ip addresses are unreliable, let's use something even less > reliable". hostname is also dynamic in a bunch of cases, btw. > Worst of all, there are very many valid cases where nfs might be mounted > before hostname is set (or do you regard that as a bug in the environment > and I should just file a ticket in Fedora bugzilla?) > > Looking over the draft, the two cases are: > what if client reboots, how do we reclaim state ASAP and > what if there is server migration, but same client. > > The second case is trivial as long as the client id stays constant no matter > what server you connect to and might be any number of constant identifiers, > be it random, or not. > > On the other hand the rebooted client is more interesting. Of course there's > also a lease expiration (that's what we do in Lustre too, if the client dies, > it'll be expired eventually, but also if we talk to it and it does not reply, > we kick it out as well, and this has a much shorter timeout, so not as disruptive). > > Cannot some more unique identifier be used by default? > Say "mac address of the primary interface, whatever that happens to be", > in that case as long as your client remains on the same physical box > (and the network card has not changed), you should be fine. > I guess there are other ways. > Ideally, kernel would offer an API (might be there is already, but I cannot find it) > that could be queried for a unique id like that (with inputs from mac addresses, > various serial numbers identifiable and such). > Shrug...feel free to propose a better scheme for generating unique ids if you can think of one. Unfortunately, there are always cases when these mechanisms for getting a persistent+unique id break down. That's the reason that nfs provides an interface to allow setting a uniquifier from userland via module param. Cheers, -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html