Re: Quincy NFS ingress failover

Thorne Lawler <thorne@xxxxxxxxxxx> · Thu, 31 Aug 2023 11:16:40 +1000

Here are the yaml files I used to create the NFS and ingress services:

nfs-ingress.yaml

service_type: ingress
service_id: nfs.xcpnfs
placement:
  count: 2
spec:
  backend_service: nfs.xcpnfs
  frontend_port: 2049
  monitor_port: 9000
  virtual_ip: 172.16.172.199/24

nfs.yaml

service_type: nfs
service_id: xcpnfs
placement:
  hosts:
    - san1
    - san2
spec:
  port: 20490

Am I missing something here? Is there another mailing list where I 
should be asking about this?

On 31/08/2023 10:38 am, Thorne Lawler wrote:
If there isn't any documentation for this yet, can anyone tell me:

 * How do I inspect/change my NFS/haproxy/keepalived configuration?
 * What is it supposed to look like? Does someone have a working example?

Thank you.

On 31/08/2023 9:36 am, Thorne Lawler wrote:
Sorry everyone,

Is there any more detailed documentation on the high availability NFS 
functionality in current Ceph?

This is a pretty serious sticking point.

Thank you.

On 30/08/2023 9:33 am, Thorne Lawler wrote:
Fellow cephalopods,

I'm trying to get quick, seamless NFS failover happening on my 
four-node Ceph cluster.

I followed the instructions here:
https://docs.ceph.com/en/latest/cephadm/services/nfs/#high-availability-nfs 

but testing shows that failover doesn't happen. When I placed node 2 
("san2") in maintenance mode, the NFS service shut down:

Aug 24 14:19:03 san2 
ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq[1962479]: 
24/08/2023 04:19:03 : epoch 64b8af5a : san2 : ganesha.nfsd-8[Admin] 
do_shutdown :MAIN :EVENT :Removing all exports.
Aug 24 14:19:13 san2 bash[3235994]: time="2023-08-24T14:19:13+10:00" 
level=warning msg="StopSignal SIGTERM failed to stop container 
ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq 
in 10 seconds, resorting to SIGKILL"
Aug 24 14:19:13 san2 bash[3235994]: 
ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq
Aug 24 14:19:13 san2 
systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic 
<mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: 
Main process exited, code=exited, status=137/n/a
Aug 24 14:19:14 san2 
systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic 
<mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: 
Failed with result 'exit-code'.
Aug 24 14:19:14 san2 systemd[1]: Stopped Ceph 
nfs.xcpnfs.1.0.san2.datsvq for e2f1b934-ed43-11ec-80fa-04421a1a1d66.

And that's it. The ingress IP didn't move.

More odd, the cluster seems to have placed the ingress IP on node 1 
(san1) but seems to be using the NFS service on node 2.

Do I need to more tightly connect the NFS service to the keepalive 
and haproxy services, or do I need to expand the ingress services to 
refer to multiple NFS services?

Thank you.

--

Regards,

Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170

_DDNS

/_*Please note:* The information contained in this email message and any 
attached files may be confidential information, and may also be the 
subject of legal professional privilege. _If you are not the intended 
recipient any use, disclosure or copying of this email is unauthorised. 
_If you received this email in error, please notify Discount Domain Name 
Services Pty Ltd on 03 9815 6868 to report this matter and delete all 
copies of this transmission together with any attachments. /
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx