Permission denied

Neale Ferguson <neale@xxxxxxxxxxxxxx> · Mon, 13 Oct 2014 15:20:05 +0000

I reported last week that I was getting permission denied when pcs was
starting a gfs2 resource. I thought it was due to the resource being
defined incorrectly, but it doesn¹t appear to be the case. On rare
occasions the mount works but most of the time one node gets it mounted
but the other gets denied. I¹ve enabled a number of logging options and
done straces on both sides but I¹m not getting anywhere.

My cluster looks like:

# pcs resource show
 Clone Set: dlm-clone [dlm]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Resource Group: apachegroup
   VirtualIP	(ocf::heartbeat:IPaddr2):	Started
   Website	(ocf::heartbeat:apache):	Started
   httplvm	(ocf::heartbeat:LVM):	Started
   http_fs	(ocf::heartbeat:Filesystem):	Started
 Clone Set: clvmd-clone [clvmd]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
   Started: [ rh7cn1.devlab.sinenomine.net ]
   Stopped: [ rh7cn2.devlab.sinenomine.net ]

The gfs2 resource is defined:

# pcs resource show clusterfs
 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
 Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo
fstype=gfs2 options=noatime
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
              monitor interval=10s on-fail=fence
(clusterfs-monitor-interval-10s)

When the mount is attempted on node 2 the log contains:

Oct 13 11:10:42 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join
cluster "lock_dlm", "rh7cluster:vol1"
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ipc_setup.c:handle_new_connection:485 IPC credentials authenticated
(47978-48271-30)
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ipc_shm.c:qb_ipcs_shm_connect:294 connecting to client [48271]
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672;
rb->word_size:263168
Oct 13 11:10:42 rh7cn2 corosync[47978]: message repeated 2 times: [[QB
] ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672;
rb->word_size:263168]
Oct 13 11:10:42 rh7cn2 corosync[47978]: [MAIN  ]
ipc_glue.c:cs_ipcs_connection_created:272 connection created
Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:cpg_lib_init_fn:1532 lib_init_fn: conn=0x2ab16a953a0,
cpd=0x2ab16a95a64
Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from
cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster.
Now mounting FS...
Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_lib_cpg_leave:1617 got leave reques
t on 0x2ab16a953a0Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea
ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_lib_cpg_finalize:1655 cpg finalize for
conn=0x2ab16a953a0
Oct 13 11:10:43 rh7cn2 dlm_controld[48271]: 251492 cpg_dispatch error 9

Is the ³leave request² symptomatic or causal? If the latter, why is it
generated? 
On other other side:
Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM]
vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate
request on 0x2ab0e33c8b0
Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM]
vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate
request on 0x2ab0e33c8b0
Oct 13 11:10:42 rh7cn1 corosync[10423]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from
cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover
generation 6 doneOct 13 11:10:43 rh7cn1 corosync[10423]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea
ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid
48271Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover
generation 7 done

dlm_tool dump shows:

251469 dlm:ls:vol1 conf 2 1 0 memb 1 2 join 2 left
251469 vol1 add_change cg 6 joined nodeid 2
251469 vol1 add_change cg 6 counts member 2 joined 1 remove 0 failed 0
251469 vol1 stop_kernel cg 6
251469 write "0" to "/sys/kernel/dlm/vol1/control"
251469 vol1 check_ringid done cluster 43280 cpg 1:43280
251469 vol1 check_fencing done
251469 vol1 send_start 1:6 counts 5 2 1 0 0
251469 vol1 receive_start 1:6 len 80
251469 vol1 match_change 1:6 matches cg 6
251469 vol1 wait_messages cg 6 need 1 of 2
251469 vol1 receive_start 2:1 len 80
251469 vol1 match_change 2:1 matches cg 6
251469 vol1 wait_messages cg 6 got all 2
251469 vol1 start_kernel cg 6 member_count 2
251469 dir_member 1
251469 set_members mkdir
"/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2"
251469 write "1" to "/sys/kernel/dlm/vol1/control"
251469 vol1 prepare_plocks
251469 vol1 set_plock_data_node from 1 to 1
251469 vol1 send_all_plocks_data 1:6
251469 vol1 send_all_plocks_data 1:6 0 done
251469 vol1 send_plocks_done 1:6 counts 5 2 1 0 0 plocks_data 0
251469 vol1 receive_plocks_done 1:6 flags 2 plocks_data 0 need 0 save 0
251470 dlm:ls:vol1 conf 1 0 1 memb 1 join left 2
251470 vol1 add_change cg 7 remove nodeid 2 reason leave
251470 vol1 add_change cg 7 counts member 1 joined 0 remove 1 failed 0
251470 vol1 stop_kernel cg 7
251470 write "0" to "/sys/kernel/dlm/vol1/control"
251470 vol1 purged 0 plocks for 2
251470 vol1 check_ringid done cluster 43280 cpg 1:43280
251470 vol1 check_fencing done
251470 vol1 send_start 1:7 counts 6 1 0 1 0
251470 vol1 receive_start 1:7 len 76
251470 vol1 match_change 1:7 matches cg 7
251470 vol1 wait_messages cg 7 got all 1
251470 vol1 start_kernel cg 7 member_count 1
251470 dir_member 2
251470 dir_member 1
251470 set_members rmdir
"/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2"
251470 write "1" to "/sys/kernel/dlm/vol1/control"
251470 vol1 prepare_plocks

I would appreciate any debugging suggestions. I¹ve straced
dlm_controld/corosync but not gained much clarity.

Neale

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster