On Sep 8, 2014, at 11:17 AM, David Teigland <teigland@xxxxxxxxxx> wrote: > On Mon, Sep 08, 2014 at 02:44:49PM +0000, Neale Ferguson wrote: >> 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 >> 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 > >> 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 >> 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 >> >> However, when NODE1 attempts to retrieve plocks it reports: >> >> 1410147820 lvclusdidiz0360 retrieve_plocks >> 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 > > You mentioned previously that it reported an error attempting to open the > checkpoint (SA_AIS_ERR_NOT_EXIST) in retrieve_plocks. That's a slightly > different error than successfully opening the checkpoint and finding it > empty, although both have the same effect. Did the other node report an > errors when it attempted to create this checkpoint? That problem still exists but it appears to be related to the clvmd lockspace. I'm still looking at this but I'm also looking at the lockspace that corresponds to the gfs2 target. Here's the what Node1 says about the clvmd checkpoint area: 1410147820 clvmd set_plock_ckpt_node from 0 to 2 1410147820 clvmd receive_plocks_stored 2:9 flags a sig 0 need_plocks 1 1410147820 clvmd match_change 2:9 matches cg 1 1410147820 clvmd retrieve_plocks 1410147820 retrieve_plocks ckpt open error 12 clvmd 1410147820 lockspace clvmd plock disabled our sig bbfa1301 nodeid 2 sig 0 Node 2 has this to say about clvmd: 1410147820 clvmd set_plock_ckpt_node from 2 to 2 1410147820 clvmd store_plocks saved ckpt uptodate 1410147820 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1410147820 clvmd send_plocks_stored cg 9 flags a data2 0 counts 8 2 1 0 0 1410147820 clvmd receive_plocks_stored 2:9 flags a sig 0 need_plocks 0 As for the gfs2 lockspace, Node 2 reports this when dealing with the checkpoint area: 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 2 to 2 1410147820 lvclusdidiz0360 unlink ckpt 520eedd100000002 1410147820 lvclusdidiz0360 unlink ckpt error 12 lvclusdidiz0360 1410147820 lvclusdidiz0360 unlink ckpt status error 12 lvclusdidiz0360 1410147820 unlink ckpt 520eedd100000002 close err 12 lvclusdidiz0360 1410147820 lvclusdidiz0360 store_plocks r_count 45 p_count 63 total_size 2520 max_section_size 280 1410147820 lvclusdidiz0360 store_plocks open ckpt handle 6157409500000003 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 > >> Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. >> >> I'm am try to understand the checkpointing process and from where this information is actually being retrieved. > > The checkpoints have always been a source of problems, both from the user > side in dlm_controld, and from the implementation in corosync/openais. I > added the signatures to detect these problems more directly (and quit > using checkpoints altogether in the RHEL7 version.) In this case it's not > yet clear which side is responsible for the problem. If it's on the > dlm_controld side, then it's probably related to unlinking or not > unlinking a previous checkpoint, which causes subsequent failures when > creating new checkpoints. I'm still not groking the checkpoint process. Where is this checkpoint information kept? Also, when I try an imitate the situation by holding a R/W lock and then causing that node to restart without shutting down (and releasing the lock), the other node purges the lock when it detects the failing node has disappeared. I don't understand why the locks reported in the previous mail aren't purged as well. Thanks for your comments, every bit helps me understand. Neale -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster