Adding linux-block maillist...... ---------- Forwarded message --------- 发件人: Dongsheng Yang <dongsheng081251@xxxxxxxxx> Date: 2020年2月6日周四 上午9:44 Subject: Fwd: Bug Report : meet an unexcepted WFBitMapS status after restarting the primary To: <lars.ellenberg@xxxxxxxxxx>, <philipp.reisner@xxxxxxxxxx>, <linux-block@xxxxxxxxxxxxxxx>, <joel.colledge@xxxxxxxxxx>, <drbd-dev@xxxxxxxxxxxxxxxx> Cc: <duan.zhang@xxxxxxxxxxxx> Hi Philipp and Lars, Any suggestions? Thanx ---------- Forwarded message --------- 发件人: Dongsheng Yang <dongsheng081251@xxxxxxxxx> Date: 2020年2月5日周三 下午7:06 Subject: Bug Report : meet an unexcepted WFBitMapS status after restarting the primary To: <joel.colledge@xxxxxxxxxx> Cc: <drbd-dev@xxxxxxxxxxxxxxxx>, <duan.zhang@xxxxxxxxxxxx> Hi guys, Version: drbd-9.0.21-1 Layout: drbd.res within 3 nodes -- node-1(Secondary), node-2(Primary), node-3(Secondary) Description: a.reboot node-2 when cluster is working. b.re-up the drbd.res on node-2 after it restarted. c.an expected resync from node-3 to node-2 happens. When the resync is done, however, node-1 raises an unexpected WFBitMapS repl status and can't recover to normal anymore. Status output: node-1: drbdadm status drbd6 role:Secondary disk:UpToDate hotspare connection:Connecting node-2 role:Primary replication:WFBitMapS peer-disk:Consistent node-3 role:Secondary peer-disk:UpToDate node-2: drbdadm status drbd6 role:Primary disk:UpToDate hotspare connection:Connecting node-1 role:Secondary peer-disk:UpToDate node-3 role:Secondary peer-disk:UpToDate I assume that there is a process sequence below according to my source code version: node-1 node-2 node-3 restarted with CRASHED_PRIMARY start sync with node-3 as target start sync with node-2 as source … … end sync with node-3 end sync with node-2 w_after_state_change loop 1 within for loop against node-1:(a) receive_uuids10 send uuid with UUID_FLAG_GOT_STABLE&CRASHED_PRIMARY to node-1 receive uuid of node-2 with CRASHED_PRIMARY loop 2 within for loop against node-3: clear CRASHED_PRIMARY(b) send uuid to node-2 with UUID_FLAG_RESYNC receive uuids10 sync_handshake to SYNC_SOURCE_IF_BOTH_FAILED sync_handshake to NO_SYNC change repl state to WFBitMapS The key problem is about the order of step(a) and step(b), that is, node-2 sends the unexpected CRASHED_PRIMARY to node-1 though it's actually no longer a crashed primary after syncing with node-3. So may I have the below questions: a.Is this really a BUG or just an expected result? b.If there's already a patch fix within the newest verion? c.If there's some workaround method against this kind of unexcepted status, since I really meet so many other problems like that :(