Fwd: Bug Report : meet an unexcepted WFBitMapS status after restarting the primary

Dongsheng Yang <dongsheng081251@xxxxxxxxx> · Thu, 6 Feb 2020 10:01:07 +0800

Adding linux-block maillist......

---------- Forwarded message ---------
发件人： Dongsheng Yang <dongsheng081251@xxxxxxxxx>
Date: 2020年2月6日周四 上午9:44
Subject: Fwd: Bug Report : meet an unexcepted WFBitMapS status after
restarting the primary
To: <lars.ellenberg@xxxxxxxxxx>, <philipp.reisner@xxxxxxxxxx>,
<linux-block@xxxxxxxxxxxxxxx>, <joel.colledge@xxxxxxxxxx>,
<drbd-dev@xxxxxxxxxxxxxxxx>
Cc: <duan.zhang@xxxxxxxxxxxx>

Hi Philipp and Lars,
     Any suggestions?

Thanx
---------- Forwarded message ---------
发件人： Dongsheng Yang <dongsheng081251@xxxxxxxxx>
Date: 2020年2月5日周三 下午7:06
Subject: Bug Report : meet an unexcepted WFBitMapS status after
restarting the primary
To: <joel.colledge@xxxxxxxxxx>
Cc: <drbd-dev@xxxxxxxxxxxxxxxx>, <duan.zhang@xxxxxxxxxxxx>

Hi guys,

Version: drbd-9.0.21-1

Layout: drbd.res within 3 nodes -- node-1(Secondary), node-2(Primary),
node-3(Secondary)

Description:
a.reboot node-2 when cluster is working.
b.re-up the drbd.res on node-2 after it restarted.
c.an expected resync from node-3 to node-2 happens. When the resync is
done, however,
  node-1 raises an unexpected WFBitMapS repl status and can't recover
to normal anymore.

Status output:

node-1: drbdadm status

drbd6 role:Secondary

disk:UpToDate

hotspare connection:Connecting

node-2 role:Primary

replication:WFBitMapS peer-disk:Consistent

node-3 role:Secondary

peer-disk:UpToDate

node-2: drbdadm status

drbd6 role:Primary

disk:UpToDate

hotspare connection:Connecting

node-1 role:Secondary

peer-disk:UpToDate

node-3 role:Secondary

peer-disk:UpToDate

I assume that there is a process sequence below according to my source
code version:
node-1                                           node-2
                                            node-3
        restarted with CRASHED_PRIMARY
        start sync with node-3 as target
   start sync with node-2 as source
        …                                                                …
                                                 end sync with node-3
                                            end sync with node-2
        w_after_state_change
                      loop 1 within for loop against node-1:(a)
receive_uuids10                                  send uuid with
UUID_FLAG_GOT_STABLE&CRASHED_PRIMARY to node-1
receive uuid of node-2 with CRASHED_PRIMARY      loop 2 within for
loop against node-3:
        clear  CRASHED_PRIMARY(b)
send uuid to node-2 with UUID_FLAG_RESYNC        receive uuids10
sync_handshake to SYNC_SOURCE_IF_BOTH_FAILED     sync_handshake to NO_SYNC
change repl state to WFBitMapS

The key problem is about the order of step(a) and step(b), that is,
node-2 sends the
unexpected  CRASHED_PRIMARY to node-1 though it's actually no longer a
crashed primary
after syncing with node-3.
So may I have the below questions:
a.Is this really a BUG or just an expected result?
b.If there's already a patch fix within the newest verion?
c.If there's some workaround method against this kind of unexcepted
status, since I really
  meet so many other problems like that :(