Re: Viability of NVMeOF/TCP for VMWare

Alexander Patrakov <patrakov@xxxxxxxxx> · Fri, 28 Jun 2024 13:12:30 +0800

For NFS (e.g., as implemented by NFS-ganesha), the situation is also
quite stupid.

Without high availability (HA), it works (that is, until you update
NFS-Ganesha version), but corporate architects won't let you deploy
any system without HA, because, in their view, non-HA systems are not
production-ready by definition. (And BTW, the current NVMe-oF gateway
also has no multipath and thus no viable HA)

With an attempt to set up HA for NFS, you'll get at least the
following showstoppers:

For NFS v4.1:

* VMware refuses to work until the manual admin intervention if it
sees any change in the "owner" and "scope" fields of the EXCHANGE_ID
message between the previous and the current NFS connection.
* NFS-Ganesha sets both fields from the hostname by default, and the
patch that makes these fields configurable is "quite recent" (in
version 4.3). This is important, as otherwise, every NFS server
fail-over would trip off VMware, thus defeating the point of a
high-availability setup.
* There is a regression in NFS-Ganesha that manifests as a deadlock
(easily triggerable even without Ceph by running xfstests), which is
critical, because systemd cannot restart deadlocked services.
Unfortunately, the last NFS-Ganesha version before the regression
(4.0.8) does not contain the patch that allows manipulating the
"owner" and "scope" fields.
* Cephadm-based deployments do not set these configuration options anyway.
* If you would like to use the "rados_cluster" NFSv4 recovery backend
(used for grace periods), you need to be extra careful with various
"server names" also because they are used to decide whether to end the
grace period. If the recovery backend has seen two server names
(corresponding to two NFS-Ganesha instances, for scale-out), then both
must be up for the grace period to end. If there is only one server
name, you are allowed to run only one instance. If you want high
availability together with scale-out, you need to be able to schedule
two NFS-Ganesha instances (with names like a and b, not corresponding
to the names of hosts where they run) on two out of three available
servers. Orchestrators do not do this, you need to implement this on
your own.

For NFS v3:

* NFS-Ganesha opens files and acquires MDS locks just in case, to make
sure that another client cannot modify them while the original client
might have cached something.
* If NFS-Ganesha crashes or a server reboots, then the other
NFS-Ganesha, brought up to replace the original one, will also stumble
upon these locks, as the MDS recognizes it as a different client.
Result: it waits until the locks time out, which is too long
(minutes!), as the guest OS in VMware would then time out its storage.
* To avoid the problem mentioned above and to get seamless fail-over,
the replacement instance of NFS-Ganesha must present itself as the
same client (i.e., as the same fake hostname) to the MDS, but no known
orchestrators facilitate this.

Conclusion: please use iSCSI or sacrifice HA, as there are no working
alternatives yet.

On Fri, Jun 28, 2024 at 1:31 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
>
> There are folks actively working on this gateway and there's a Slack channel.  I haven't used it myself yet.
>
> My understanding is that ESXi supports NFS.  Some people have had good success mounting KRBD volumes on a gateway system or VM and re-exporting via NFS.
>
>
>
> > On Jun 27, 2024, at 09:01, Drew Weaver <drew.weaver@xxxxxxxxxx> wrote:
> >
> > Howdy,
> >
> > I recently saw that Ceph has a gateway which allows VMWare ESXi to connect to RBD.
> >
> > We had another gateway like this awhile back the ISCSI gateway.
> >
> > The ISCSI gateway ended up being... let's say problematic.
> >
> > Is there any reason to believe that NVMeOF will also end up on the floor and has anyone that uses VMWare extensively evaluated it's viability?
> >
> > Just curious!
> >
> > Thanks,
> > -Drew
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx