Further to the problem described last week. What I'm seeing is that the node (NODE2) that keeps going when NODE1 fails has many entries in dlm_tool log_plocks output: 1410147734 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147734 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147734 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147736 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147736 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147736 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147738 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147738 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147738 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147740 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147740 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147740 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147742 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147742 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147742 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 i.e. with no corresponding unlock entry. NODE1 is brought down by init 6 and when it restarts it gets as far as "Starting cman" before NODE2 fences it (I assume we need a higher post_join_delay). When the node is fenced I see: 1410147774 clvmd purged 0 plocks for 1 1410147774 lvclusdidiz0360 purged 3 plocks for 1 So it looks like it tried to some clean up but then when NODE1 attempts to join NODE2 examines the lockspace and reports the following: 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78067.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78068.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78059.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88464.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88478.0" 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 So it believes NODE1 will have 45 plocks to process when it comes back. NODE1 receives that plock information: 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 However, when NODE1 attempts to retrieve plocks it reports: 1410147820 lvclusdidiz0360 retrieve_plocks 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. I'm am try to understand the checkpointing process and from where this information is actually being retrieved. Neale -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster