Thank you for the clarification. To make sure I fully understand. An application that requires consistency should use O_DIRECT and enforce an R/W lock on top of the mirrored device? Asaf On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@xxxxxxxxxxxxxxx> wrote: > > Hi Asaf, > > Yes, in principle there are all sorts of cases where you can perform a > read of newly written data that is not yet on the underlying disk and > hence the possibility of reading the old data following recovery from an > intervening catastrophic event (such as a crash). This is a fundamental > characteristic of write caching and applies with any storage device and > any write operation where something crashes before the write is complete > - you can get this with a single disk or SSD without having RAID in the > mix at all. > > The correct and only way to guarantee that you can never have your > "consistency issue" is to flush the write through to the underlying > devices before reading. If you explicitly flush the write operation > (which will block until all writes are complete on A, B, M) and the > flush completes successfully, then all reads will be of the new data and > there is no consistency issue. > > Your scenario describes a concern for the higher level code, not in the > storage system. If your application needs to be absolutely certain that > even after a crash you cannot end up reading old data having previously > read new data then it is the responsibility of the application to flush > the writes to the storage before executing the read. You would also > need to ensure that the application cannot read from the data between > write and flush; there's several different ways to achieve that > (O_DIRECT may be helpful). Alternatively, you might want to look at > using something other than the disk for your data interchange between > processes. > > Regards, > > Geoff. > > Geoff Back > What if we're all just characters in someone's nightmares? > > On 19/03/2023 09:13, Asaf Levy wrote: > > Hi John, > > > > Thank you for your quick response, I'll try to elaborate further. > > What we are trying to understand is if there is a potential race > > between reads and writes when mirroring 2 devices. > > This is unrelated to the fact that the write was not acked. > > > > The scenario is: let's assume we have a reader R and a writer W and 2 > > MD devices A and B. A and B are managed under a device M which is > > configured to use A and B as mirrors (RAID 1). Currently, we have some > > data on A and B, let's call it V1. > > > > W issues a write (V2) to the managed device M > > The driver sends the write both to A and B at the same time. > > The write to device A (V2) completes > > R issues a read to M which directs it to A and returns the result (V2). > > Now the driver and device A fail at the same time before the write > > ever gets to device B. > > > > When the driver recovers all it is left with is device B so future > > reads will return older data (V1) than the data that was returned to > > R. > > > > Thanks, > > Asaf > > > > On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@xxxxxxxxxxx> wrote: > >>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@xxxxxxxxxxxx> writes: > >>> I'm trying to understand how mdadm protects against inconsistent data > >>> read in the face of failures that occur while writing to a device that > >>> has raid1. > >> You need to give a better test case, with examples. > >> > >>> Here is the scenario: I have set up raid1 that has 2 mirrors. First > >>> one is on local storage and the second is on remote storage. The > >>> remote storage mirror is configured with write-mostly. > >> Configuration details? And what is the remote device? > >> > >>> We have parallel jobs: 1 writing to an area on the device and the > >>> other reading from that area. > >> So you create /dev/md9 and are writing/reading from it, then the > >> system crashes and you lose the local half of the mirror, right? > >> > >>> The write operation writes the data to the first mirror, and at that > >>> point the read operation reads the new data from the first mirror. > >> So how is your write succeeding if it's not written to both halves of > >> the MD device? You need to give more details and maybe even some > >> example code showing what you're doing here. > >> > >>> Now, before data has been written to the second (remote) mirror a > >>> failure has occurred which caused the first machine to fail, When > >>> the machine comes up, the data is recovered from the second, remote, > >>> mirror. > >> Ah... some more details. It sounds like you have a system A which is > >> writing to a SITE local remote device as well as a REMOTE site device > >> in the MD mirror, is this correct? > >> > >> Are these iSCSI devices? FibreChannel? NBD devices? More details > >> please. > >> > >>> Now when reading from this area, the users will receive the older > >>> value, even though, in the first read they got the newer value that > >>> was written. > >>> Does mdadm protect against this inconsistency? > >> It shouldn't be returning success on the write until both sides of the > >> mirror are updated. But we can't really tell until you give more > >> details and an example. > >> > >> I assume you're not building a RAID1 device and then writing to the > >> individual devices behind it's back or something silly like that, > >> right? > >> > >> John > >> >