Goldwyn> Thanks for the review. You're welcome. I'm not qualified to comment on the actual code really, but I felt that you needed to be alot more up-front and detailed in your docs, esp since you were writing a readme on this. You should also talk more about the splitbrain possibilities, esp with non-cluster aware filesystems like ext3/4 which might be setup and used on there. The detailed design docs are great too, but maybe they should really be in the md-cluster-design.txt, while the md-cluster.txt file talks about how to use it and what to expect. Goldwyn> On 12/19/2014 09:38 AM, John Stoffel wrote: >>>>>>> "Goldwyn" == Goldwyn Rodrigues <rgoldwyn@xxxxxxx> writes: >> >> This is an interesting concept, but I think you're glossing over the >> details here way too much. You're so close to the trees, that you're >> missing the forest. You need to spell out the requirements in terms >> of software, configuration, etc ahead of time. >> >> Showing how people can configure this for testing would be good as >> well. Right now though, I wouldn't touch this with a ten foot pole. Goldwyn> I mentioned a quick howto in patch zero. However, putting it in the Goldwyn> design document will not hurt. Currently, it is known to work with Goldwyn> corosync 2.3.x and pacemaker 1.1 on Kernels 3.14.x >> Goldwyn> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx> Goldwyn> --- Goldwyn> Documentation/md-cluster.txt | 178 +++++++++++++++++++++++++++++++++++++++++++ Goldwyn> 1 file changed, 178 insertions(+) Goldwyn> create mode 100644 Documentation/md-cluster.txt >> Goldwyn> diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt Goldwyn> new file mode 100644 Goldwyn> index 0000000..038d0f0 Goldwyn> --- /dev/null Goldwyn> +++ b/Documentation/md-cluster.txt Goldwyn> @@ -0,0 +1,178 @@ Goldwyn> +The cluster MD is a shared-device RAID for a cluster. >> >> >> How is this cluster setup? What are the restrictions? You just >> straight into the ondisk format, without any introduction to the >> problem and how you solve it. Goldwyn> The cluster is a regular corosync/pacemaker cluster with DLM setup. I Goldwyn> mentioned this in patch zero as well. However, I assumed configuring a Goldwyn> cluster is not in the scope of the design document. This is the design Goldwyn> of cluster-md. I agree it could use a foreword though. >> Goldwyn> + Goldwyn> + Goldwyn> +1. On-disk format Goldwyn> + Goldwyn> +Separate write-intent-bitmap are used for each cluster node. Goldwyn> +The bitmaps record all writes that may have been started on that node, Goldwyn> +and may not yet have finished. The on-disk layout is: Goldwyn> + Goldwyn> +0 4k 8k 12k Goldwyn> +------------------------------------------------------------------- Goldwyn> +| idle | md super | bm super [0] + bits | Goldwyn> +| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | Goldwyn> +| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | Goldwyn> +| bm bits [3, contd] | | | Goldwyn> + Goldwyn> +During "normal" functioning we assume the filesystem ensures that only one Goldwyn> +node writes to any given block at a time, so a write Goldwyn> +request will Goldwyn> + - set the appropriate bit (if not already set) Goldwyn> + - commit the write to all mirrors Goldwyn> + - schedule the bit to be cleared after a timeout. Goldwyn> + Goldwyn> +Reads are just handled normally. It is up to the filesystem to Goldwyn> +ensure one node doesn't read from a location where another node (or the same Goldwyn> +node) is writing. >> >> >> GAH! So what filesystem(s) are supported and known to work? Why this >> this information not in the introduction? You just toss off this >> statement without any context. Goldwyn> The point here is data integrity is the responsibility of the Goldwyn> filesystem. The cluster-md just ensures that all it has confirmed as Goldwyn> written is stable and mirrored (RAID1). As for filesystem support, all Goldwyn> device based filesystems are supported. However, we are targeting Goldwyn> cluster based filesystems such as ocfs2. Yes, it could be moved in the Goldwyn> Introduction. >> >> And you also seem to imply that I can't just put LVM volumes ontop of >> this mirror either, which to me is a huge layering violation. If I'm Goldwyn> No, I am not implying LVM cannot be used. LVM can be used in conjunction Goldwyn> with cluster-md. >> using MD to build RAID1 devices, I don't care how MD handles >> reads/writes being out of sync. My filesystem or volumes on top get >> consistent storage without having to know anything special. Goldwyn> If you are reading the design document of cluster-md. I think you should Goldwyn> be concerned on how out of sync data is handled in order to understand Goldwyn> the design better. Filesystem just treat this as a normal block device Goldwyn> and do not need to know anything special. >> >> Goldwyn> +2. DLM Locks for management Goldwyn> + Goldwyn> +There are two locks for managing the device: Goldwyn> + Goldwyn> +2.1 Bitmap lock resource (bm_lockres) Goldwyn> + Goldwyn> + The bm_lockres protects individual node bitmaps. They are named in the Goldwyn> + form bitmap001 for node 1, bitmap002 for node and so on. When a node Goldwyn> + joins the cluster, it acquires the lock in PW mode and it stays so >> >> PW is what? Make sure you expand all your acronyms the first time you >> use them so we can confirm we all understand them please. Goldwyn> PW is Protected Write. I will add that. >> Goldwyn> + during the lifetime the node is part of the cluster. The lock resource Goldwyn> + number is based on the slot number returned by the DLM subsystem. Since Goldwyn> + DLM starts node count from one and bitmap slots start from zero, one is Goldwyn> + subtracted from the DLM slot number to arrive at the bitmap slot number. >> >> Why do you bother? Why not just make the bitmap slots start at 1 and >> reserve zero for a special case? Say that the bitmap is setup but not >> initialized? Goldwyn> What would that special case be? The bitmap setup is not a two-step Goldwyn> process. If it is setup, it is also initialized. >> Goldwyn> + Goldwyn> +3. Communication Goldwyn> + Goldwyn> +Each node has to communicate with other nodes when starting or ending Goldwyn> +resync, and metadata superblock updates. >> >> HOW!!!! Does this all depend on DRDB being installed? Or some other >> HA software? Goldwyn> DLM. Mentioned later in the design. Yes, I will add that as well. >> Goldwyn> + Goldwyn> +3.1 Message Types Goldwyn> + Goldwyn> + There are 3 types, of messages which are passed Goldwyn> + Goldwyn> + 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been Goldwyn> + updated, and the node must re-read the md superblock. This is performed Goldwyn> + synchronously. Goldwyn> + Goldwyn> + 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended Goldwyn> + so that each node may suspend or resume the region. Goldwyn> + Goldwyn> +3.2 Communication mechanism Goldwyn> + Goldwyn> + The DLM LVB is used to communicate within nodes of the cluster. There Goldwyn> + are three resources used for the purpose: Goldwyn> + Goldwyn> + 3.2.1 Token: The resource which protects the entire communication Goldwyn> + system. The node having the token resource is allowed to Goldwyn> + communicate. Goldwyn> + Goldwyn> + 3.2.2 Message: The lock resource which carries the data to Goldwyn> + communicate. Goldwyn> + Goldwyn> + 3.2.3 Ack: The resource, acquiring which means the message has been Goldwyn> + acknowledged by all nodes in the cluster. The BAST of the resource Goldwyn> + is used to inform the receive node that a node wants to communicate. Goldwyn> + Goldwyn> +The algorithm is: Goldwyn> + Goldwyn> + 1. receive status Goldwyn> + Goldwyn> + sender receiver receiver Goldwyn> + ACK:CR ACK:CR ACK:CR Goldwyn> + Goldwyn> + 2. sender get EX of TOKEN Goldwyn> + sender get EX of MESSAGE Goldwyn> + sender receiver receiver Goldwyn> + TOKEN:EX ACK:CR ACK:CR Goldwyn> + MESSAGE:EX Goldwyn> + ACK:CR Goldwyn> + Goldwyn> + Sender checks that it still needs to send a message. Messages received Goldwyn> + or other events that happened while waiting for the TOKEN may have made Goldwyn> + this message inappropriate or redundant. Goldwyn> + Goldwyn> + 3. sender write LVB. Goldwyn> + sender down-convert MESSAGE from EX to CR Goldwyn> + sender try to get EX of ACK Goldwyn> + [ wait until all receiver has *processed* the MESSAGE ] Goldwyn> + Goldwyn> + [ triggered by bast of ACK ] Goldwyn> + receiver get CR of MESSAGE Goldwyn> + receiver read LVB Goldwyn> + receiver processes the message Goldwyn> + [ wait finish ] Goldwyn> + receiver release ACK Goldwyn> + Goldwyn> + sender receiver receiver Goldwyn> + TOKEN:EX MESSAGE:CR MESSAGE:CR Goldwyn> + MESSAGE:CR Goldwyn> + ACK:EX Goldwyn> + Goldwyn> + 4. triggered by grant of EX on ACK (indicating all receivers have processed Goldwyn> + message) Goldwyn> + sender down-convert ACK from EX to CR Goldwyn> + sender release MESSAGE Goldwyn> + sender release TOKEN Goldwyn> + receiver upconvert to EX of MESSAGE Goldwyn> + receiver get CR of ACK Goldwyn> + receiver release MESSAGE Goldwyn> + Goldwyn> + sender receiver receiver Goldwyn> + ACK:CR ACK:CR ACK:CR Goldwyn> + Goldwyn> + Goldwyn> +4. Handling Failures Goldwyn> + Goldwyn> +4.1 Node Failure Goldwyn> + When a node fails, the DLM informs the cluster with the slot. The node >> >> This needs to be re-worded. The cluster is the entire group of >> machines, I think you mean: >> >> The DLM informs the node with the slot. Goldwyn> Correct. >> >> And is a node failure as simple as a reboot? How about if the entire >> cluster crashes, how to do you know which node is the more upto date >> and should be the master? Goldwyn> There is not concept of master here since everything is distributed. We Goldwyn> do not want a central dependency. A node failure is it's inability to Goldwyn> respond. It is usually STONITHd (Shoot the Other node in the Head) by Goldwyn> the cluster resource management. Goldwyn> The concept of bitmap is that data needs to be synced (that what I had Goldwyn> been trying to explain in the point where you mentioned about Goldwyn> filesystem). In case of a cluster failure, The first node to come up Goldwyn> performs the "bitmap recovery" for all the bitmaps. >> Goldwyn> + starts a cluster recovery thread. The cluster recovery thread: Goldwyn> + - acquires the bitmap<number> lock of the failed node Goldwyn> + - opens the bitmap Goldwyn> + - reads the bitmap of the failed node Goldwyn> + - copies the set bitmap to local node Goldwyn> + - cleans the bitmap of the failed node Goldwyn> + - releases bitmap<number> lock of the failed node Goldwyn> + - initiates resync of the bitmap on the current node Goldwyn> + Goldwyn> + The resync process, is the regular md resync. However, in a clustered Goldwyn> + environment when a resync is performed, it needs to tell other nodes Goldwyn> + of the areas which are suspended. Before a resync starts, the node Goldwyn> + send out RESYNC_START with the (lo,hi) range of the area which needs Goldwyn> + to be suspended. Each node maintains a suspend_list, which contains Goldwyn> + the list of ranges which are currently suspended. On receiving Goldwyn> + RESYNC_START, the node adds the range to the suspend_list. Similarly, Goldwyn> + when the node performing resync finishes, it send RESYNC_FINISHED Goldwyn> + to other nodes and other nodes remove the corresponding entry from Goldwyn> + the suspend_list. Goldwyn> + Goldwyn> + A helper function, should_suspend() can be used to check if a particular Goldwyn> + I/O range should be suspended or not. Goldwyn> + Goldwyn> +4.2 Device Failure Goldwyn> + Device failures are handled and communicated with the metadata update Goldwyn> + routine. Goldwyn> + Goldwyn> +5. Adding a new Device Goldwyn> +For adding a new device, it is necessary that all nodes "see" the new device Goldwyn> +to be added. For this, the following algorithm is used: Goldwyn> + Goldwyn> + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues Goldwyn> + ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) Goldwyn> + 2. Node 1 sends NEWDISK with uuid and slot number Goldwyn> + 3. Other nodes issue kobject_uevent_env with uuid and slot number Goldwyn> + (Steps 4,5 could be a udev rule) Goldwyn> + 4. In userspace, the node searches for the disk, perhaps Goldwyn> + using blkid -t SUB_UUID="" Goldwyn> + 5. Other nodes issue either of the following depending on whether the disk Goldwyn> + was found: Goldwyn> + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and Goldwyn> + disc.number set to slot number) Goldwyn> + ioctl(CLUSTERED_DISK_NACK) Goldwyn> + 6. Other nodes drop lock on no-new-devs (CR) if device is found Goldwyn> + 7. Node 1 attempts EX lock on no-new-devs Goldwyn> + 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk Goldwyn> + as SpareLocal Goldwyn> + 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED Goldwyn> + 10. Other nodes get the information whether a disk is added or not Goldwyn> + by the following METADATA_UPDATED. Goldwyn> + Goldwyn> + Goldwyn> -- Goldwyn> 2.1.2 >> Goldwyn> -- Goldwyn> Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html