> On Oct 31, 2024, at 10:48 AM, Rick Macklem <rick.macklem@xxxxxxxxx> wrote: > > On Thu, Oct 31, 2024 at 4:43 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: >> >> On Wed, 2024-10-30 at 15:48 -0700, Rick Macklem wrote: >>> On Wed, Oct 30, 2024 at 10:08 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: >>>> >>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@xxxxxxxxxxx. >>>> >>>> >>>> >>>> >>>>> On Oct 30, 2024, at 12:37 PM, Cedric Blancher <cedric.blancher@xxxxxxxxx> wrote: >>>>> >>>>> On Wed, 30 Oct 2024 at 17:15, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Oct 30, 2024, at 10:55 AM, Cedric Blancher <cedric.blancher@xxxxxxxxx> wrote: >>>>>>> >>>>>>> On Tue, 29 Oct 2024 at 17:03, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> On Oct 29, 2024, at 11:54 AM, Brian Cowan <brian.cowan@xxxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> Honestly, I don't know the usecase for re-exporting another server's >>>>>>>>> NFS export in the first place. Is this someone trying to share NFS >>>>>>>>> through a firewall? I've seen people share remote NFS exports via >>>>>>>>> Samba in an attempt to avoid paying their NAS vendor for SMB support. >>>>>>>>> (I think it's "standard equipment" now, but 10+ years ago? Not >>>>>>>>> always...) But re-exporting another server's NFS exports? Haven't seen >>>>>>>>> anyone do that in a while. >>>>>>>> >>>>>>>> The "re-export" case is where there is a central repository >>>>>>>> of data and branch offices that access that via a WAN. The >>>>>>>> re-export servers cache some of that data locally so that >>>>>>>> local clients have a fast persistent cache nearby. >>>>>>>> >>>>>>>> This is also effective in cases where a small cluster of >>>>>>>> clients want fast access to a pile of data that is >>>>>>>> significantly larger than their own caches. Say, HPC or >>>>>>>> animation, where the small cluster is working on a small >>>>>>>> portion of the full data set, which is stored on a central >>>>>>>> server. >>>>>>>> >>>>>>> Another use case is "isolation", IT shares a filesystem to your >>>>>>> department, and you need to re-export only a subset to another >>>>>>> department or homeoffice. Part of such a scenario might also be policy >>>>>>> related, e.g. IT shares you the full filesystem but will do NOTHING >>>>>>> else, and any further compartmentalization must be done in your own >>>>>>> department. >>>>>>> This is the typical use case for gov NFS re-export. >>>>>> >>>>>> It's not clear to me from this description why re-export is >>>>>> the right tool for this job. Please explain why ACLs are not >>>>>> used in this case -- this is exactly what they are designed >>>>>> to do. >>>>> >>>>> 1. IT departments want better/harder/immutable isolation than ACLs >>>> >>>> So you want MAC, and the storage administrator won't set >>>> that up for you on the NFS server. NFS doesn't do MAC >>>> very well if at all. >>>> >>>> >>>>> 2. Linux NFSv4 only implements POSIX draft ACLs, not full Windows or >>>>> NFSv4 ACLs. So there is no proper way to prevent ACL editing, >>>>> rendering them useless in this case. >>>> >>>> Er. Linux NFSv4 stores the ACLs as POSIX draft, because >>>> that's what Linux file systems can support. NFSD, via >>>> NFSv4, makes these appear like NFSv4 ACLs. >>>> >>>> But I think I understand. >>>> >>>> >>>>> There is a reason why POSIX draft ACls were abandoned - they are not >>>>> fine-granted enough for real world usage outside the Linux universe. >>>>> As soon as interoperability is required these things just bite you >>>>> HARD. >>>> >>>> You, of course, have the ability to run some other NFS >>>> server implementation that meets your security requirements >>>> more fully. >>>> >>>> >>>>> Also, just running more nfsd in parallel on the origin NFS server is >>>>> not a better option - remember the debate of non-2049 ports for nfsd? >>>> >>>> I'm not sure where this is going. Do you mean the storage >>>> administrator would provide NFS service on alternate >>>> ports that each expose a separate set of exports? >>>> >>>> So the only option Linux has there is using containers or >>>> libvirt. We've continued to privately discuss the ability >>>> for NFSD to support a separate set of exports on alternate >>>> ports, but it doesn't look feasible. The export management >>>> infrastructure and user space tools would need to be >>>> rewritten. >>>> >>>> >>>>>> And again, clients of the re-export server need to mount it >>>>>> with local_lock. Apps can still use locking in that case, >>>>>> but the locks are not visible to apps on other clients. Your >>>>>> description does not explain why local_lock is not >>>>>> sufficient or feasible. >>>>> >>>>> Because: >>>>> - it breaks applications running on more than one machine? >>>> >>>> Yes, obviously. Your description needs to mention that is >>>> a requirement, since there are a lot of applications that >>>> don't need locking across multiple clients. >>>> >>>> >>>>> - it breaks use cases like NFS--->SMB bridges, because without locking >>>>> the typical Windows .NET application will refuse to write to a file >>>> >>>> That's a quagmire, and I don't think we can guarantee that >>>> will work. Linux NFS doesn't support "deny" modes, for >>>> example. >>>> >>>> >>>>> - it breaks even SIMPLE things like Microsoft Excel >>>> >>>> If you need SMB semantics, why not use Samba? >>>> >>>> The upshot appears to be that this usage is a stack of >>>> mismatched storage protocols that work around a bunch of >>>> local IT bureaucracy. I'm trying to be sympathetic, but >>>> it's hard to say that /anyone/ would fully support this. >>>> >>>> >>>>> Of course the happy echo "hello Linux-NFSv4-only world" >/nfs/file >>>>> will always work. >>>>> >>>>>>> Of course no one needs the gov customers, so feel free to break locking. >>>>>> >>>>>> >>>>>> Please have a look at the patch description again: lock >>>>>> recovery does not work now, and cannot work without >>>>>> changes to the protocol. Isn't that a problem for such >>>>>> workloads? >>>>> >>>>> Nope, because of UPS (Uninterruptible power supply). Either everything >>>>> is UP, or *everything* is DOWN. Boolean. >>>> >>>> Power outages are not the only reason lock recovery might >>>> be necessary. Network partitions, re-export server >>>> upgrades or reboots, etc. So I'm not hearing anythying >>>> to suggest this kind of workload is not impacted by >>>> the current lock recovery problems. >>>> >>>> >>>>>> In other words, locking is already broken on NFSv4 re-export, >>>>>> but the current situation can lead to silent data corruption. >>>>> >>>>> Would storing the locking information into persistent files help, ie. >>>>> files which persist across nfsd server restarts? >>>> >>>> Yes, but it would make things horribly slow. >>>> >>>> And of course there would be a lot of coding involved >>>> to get this to work. >>> I suspect this suggestion might be a fair amount of code too >>> (and I am certainly not volunteering to write it), but I will mention it. >>> >>> Another possibility would be to have the re-exporting NFSv4 server >>> just pass locking ops through to the backend NFSv4 server. >>> - It is roughly the inverse of what I did when I constructed a flex files >>> pNFS server. The MDS did the locking ops and any I/O ops. were >>> passed through to the DS(s). Of course, it was hoped the client >>> would use layouts and bypass the MDS for I/O. >>> >> >> How do you handle reclaim in this case? IOW, suppose the backend server >> crashes but the reexporter stays up. How do you coordinate the grace >> periods between the two so that the client can reclaim its lock on the >> backend? > Well, I'm not saying it is trivial. > I think you would need to pass through all state operations: > ExchangeID, Open,...,Lock,LockU > - The tricky bit would be sessions, since the re-exporter would need to > maintain sessions. > --> Maybe the re-exporter would need to save the ClientID (from the > backend nfsd) in non-volatile storage. > > When the backend server crashes/reboots, the re-exporter would see > this as a failure (usually NFS4ERR_BAD_SESSION) and would pass > that to the client. > The only recovery RPC that would not be passed through would be > Create_session, although the re-exporter would do a Create_session > for connection(s) it has against the backend server. > I think something like that would work for the backend crash/recovery. The backend server would be in grace, and the re-exporter would be able to recover its lock state on the backend server using normal state recovery. I think the re- exporter would not need to expose the backend server's crash to its own clients. > A crash of the re-exporter could be more of a problem, I think. > It would need to have the ClientID (stored in non-volatile storage) > so that it could do a Create_session with it against the backend server. > - It would also depend on the backend server being courteous, so that > an re-exporter crash/reboot that takes a while such that the lease expires > doesn't result in a loss of state on the backend server. The backend server would not be in grace after the re-export server crashes. There's no way for the re-export server's NFS client to recover its lock state from the backend server. The re-export server recovers by re-learning lock state from its own clients. The question is how the re-export server could re-initialize this state in its local client of the backend server. -- Chuck Lever