Re: Viability of NVMeOF/TCP for VMWare

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Mon, 1 Jul 2024 15:00:11 +0300

On 28/06/2024 17:59, Frédéric Nass wrote:

We came to the same conclusions as Alexander when we studied replacing Ceph's iSCSI implementation with Ceph's NFS-Ganesha implementation: HA was not working.
During failovers, vmkernel would fail with messages like this:
2023-01-14T09:39:27.200Z Wa(180) vmkwarning: cpu18:2098740)WARNING: NFS41: NFS41ProcessExidResult:2499: 'Cluster Mismatch due to different server scope. Probable server bug. Remount data store to access.'

We replaced Ceph's iSCSI implementation with PetaSAN's iSCSI GWs plugged to our external Ceph cluster (unsupported setup) and never looked back.
It's HA, active/active, highly scalable, robust whatever the situation (network issues, slow requests, ceph osd pause), and it rocks performances wise.

Good to hear. Interfacing PetaSAN with external clusters can be setup 
manually but it is not something we currently support, we may add this 
in future releases. Our NFS setup also works well with VMWare, with ha 
and active/active giving throughput on par with iSCSI but with slightly 
less iops.

We're using a PSP of type RR with below SATP rule:
esxcli storage nmp satp rule add -s VMW_SATP_ALUA -P VMW_PSP_RR -V PETASAN -M RBD -c tpgs_on -o enable_action_OnRetryErrors -O "iops=1" -e "Ceph iSCSI ALUA RR iops=1 PETASAN"

And these adaptor settings:
esxcli system settings advanced set -o /ISCSI/MaxIoSizeKB -i 512
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk | cut -d ' ' -f1) --key FirstBurstLength --value 524288
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk | cut -d ' ' -f1) --key MaxBurstLength --value 524288
esxcli iscsi adapter param set -A $(esxcli iscsi adapter list | grep iscsi_vmk | cut -d ' ' -f1) --key MaxRecvDataSegment --value 524288

Note that it's important to **not** use object-map feature on RBD images to avoid the issue mentioned in [1].

With version 3.2 and earlier, our iSCSI will not connect if 
fast-diff/object-map is enabled. With 3.3 we do support 
fast-diff/object-map but in such case we recommend setting clients in 
active/failover configuration. It will still work ok with active/active 
but with degraded performance due to the exclusive-lock acquiring among 
the nodes (required to support fast-diff/object-map) the lock will 
ping-pong among the nodes hence the active/passive recommendation. But 
to get top scale-out performance, active/active is recommended hence no 
fast-diff/object-map.

Regarding NVMe-oF, there's been an incredible amount of work done over a year and a half but we'll likely need to wait a bit longer to have something fully production-ready (hopefully active/active).

We will definitely be supporting this. Note that active/active will not 
possible with fast-diff/object-map/exclusive lock for same reasons noted 
above for iSCSI.

/maged

Regards,
Frédéric.

[1] https://croit.io/blog/fixing-data-corruption

----- Le 28 Juin 24, à 7:12, Alexander Patrakov patrakov@xxxxxxxxx a écrit :

For NFS (e.g., as implemented by NFS-ganesha), the situation is also
quite stupid.

Without high availability (HA), it works (that is, until you update
NFS-Ganesha version), but corporate architects won't let you deploy
any system without HA, because, in their view, non-HA systems are not
production-ready by definition. (And BTW, the current NVMe-oF gateway
also has no multipath and thus no viable HA)

With an attempt to set up HA for NFS, you'll get at least the
following showstoppers:

For NFS v4.1:

* VMware refuses to work until the manual admin intervention if it
sees any change in the "owner" and "scope" fields of the EXCHANGE_ID
message between the previous and the current NFS connection.
* NFS-Ganesha sets both fields from the hostname by default, and the
patch that makes these fields configurable is "quite recent" (in
version 4.3). This is important, as otherwise, every NFS server
fail-over would trip off VMware, thus defeating the point of a
high-availability setup.
* There is a regression in NFS-Ganesha that manifests as a deadlock
(easily triggerable even without Ceph by running xfstests), which is
critical, because systemd cannot restart deadlocked services.
Unfortunately, the last NFS-Ganesha version before the regression
(4.0.8) does not contain the patch that allows manipulating the
"owner" and "scope" fields.
* Cephadm-based deployments do not set these configuration options anyway.
* If you would like to use the "rados_cluster" NFSv4 recovery backend
(used for grace periods), you need to be extra careful with various
"server names" also because they are used to decide whether to end the
grace period. If the recovery backend has seen two server names
(corresponding to two NFS-Ganesha instances, for scale-out), then both
must be up for the grace period to end. If there is only one server
name, you are allowed to run only one instance. If you want high
availability together with scale-out, you need to be able to schedule
two NFS-Ganesha instances (with names like a and b, not corresponding
to the names of hosts where they run) on two out of three available
servers. Orchestrators do not do this, you need to implement this on
your own.

For NFS v3:

* NFS-Ganesha opens files and acquires MDS locks just in case, to make
sure that another client cannot modify them while the original client
might have cached something.
* If NFS-Ganesha crashes or a server reboots, then the other
NFS-Ganesha, brought up to replace the original one, will also stumble
upon these locks, as the MDS recognizes it as a different client.
Result: it waits until the locks time out, which is too long
(minutes!), as the guest OS in VMware would then time out its storage.
* To avoid the problem mentioned above and to get seamless fail-over,
the replacement instance of NFS-Ganesha must present itself as the
same client (i.e., as the same fake hostname) to the MDS, but no known
orchestrators facilitate this.

Conclusion: please use iSCSI or sacrifice HA, as there are no working
alternatives yet.

On Fri, Jun 28, 2024 at 1:31 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
There are folks actively working on this gateway and there's a Slack channel.  I
haven't used it myself yet.

My understanding is that ESXi supports NFS.  Some people have had good success
mounting KRBD volumes on a gateway system or VM and re-exporting via NFS.

On Jun 27, 2024, at 09:01, Drew Weaver <drew.weaver@xxxxxxxxxx> wrote:

Howdy,

I recently saw that Ceph has a gateway which allows VMWare ESXi to connect to
RBD.

We had another gateway like this awhile back the ISCSI gateway.

The ISCSI gateway ended up being... let's say problematic.

Is there any reason to believe that NVMeOF will also end up on the floor and has
anyone that uses VMWare extensively evaluated it's viability?

Just curious!

Thanks,
-Drew

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx