On Mon, Sep 08, 2014 at 02:44:49PM +0000, Neale Ferguson wrote: > 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 > 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 > 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 > 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 > > However, when NODE1 attempts to retrieve plocks it reports: > > 1410147820 lvclusdidiz0360 retrieve_plocks > 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 You mentioned previously that it reported an error attempting to open the checkpoint (SA_AIS_ERR_NOT_EXIST) in retrieve_plocks. That's a slightly different error than successfully opening the checkpoint and finding it empty, although both have the same effect. Did the other node report an errors when it attempted to create this checkpoint? > Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. > > I'm am try to understand the checkpointing process and from where this information is actually being retrieved. The checkpoints have always been a source of problems, both from the user side in dlm_controld, and from the implementation in corosync/openais. I added the signatures to detect these problems more directly (and quit using checkpoints altogether in the RHEL7 version.) In this case it's not yet clear which side is responsible for the problem. If it's on the dlm_controld side, then it's probably related to unlinking or not unlinking a previous checkpoint, which causes subsequent failures when creating new checkpoints. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster