On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote: > Boaz Harrosh, on 06/03/2010 04:07 PM wrote: >> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote: >>> There's one interesting problem here, at least theoretically, with SCSI >>> or similar transports which allow to have commands queue depth >1 and >>> allowed to internally reorder queued requests. I don't know the FS/block >>> layers sufficiently well to tell if sending several requests for the >>> same page really possible or not, but we can see a real life problem, >>> which can be well explained if it's possible. >>> >>> The problem could be if the second (rewrite) request (SCSI command) for >>> the same page queued to the corresponding device before the original >>> request finished. Since the device allowed to freely reorder requests, >>> there's a probability that the original write request would hit the >>> permanent storage *AFTER* the retry request, hence the data changes it's >>> carrying would be lost, hence welcome data corruption. >>> >> >> I might be totally wrong here but I think NCQ can reorder sectors but >> not writes. That is if the sector is cached in device memory and a later >> write comes to modify the same sector then the original should be >> replaced not two values of the same sector be kept in device cache at the >> same time. >> >> Failing to do so is a scsi device problem. > > SCSI devices supporting Full task management model (almost all) and > having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1 > allowed to freely reorder any commands with SIMPLE task attribute. If an > application wants to maintain order of some commands for such devices, > it must issue them with ORDERED task attribute and over a _single_ MPIO > path to the device. > > Linux neither uses ORDERED attribute, nor honors or enforces anyhow > QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with > order dependencies (overlapping writes in our case) over a single MPIO path. > OK I take your word for it. But that sounds stupid to me. I would think that sectors can be ordered. not commands per se. What happen with reads then? do they get to be ordered? I mean a read in between the two writes which value is read? It gets so complicated that only a sector model makes sense to me. >> Please note that page-to-sector is not necessary constant. And the same page >> might get written at a different sector, next time. But FSs will have to >> barrier in this case. >> >>> For single parallel SCSI or SAS devices such race may look practically >>> impossible, but for sophisticated clusters when many nodes pretending to >>> be a single SCSI device in a load balancing configuration, it becomes >>> very real. >>> >>> The real life problem we can see in an active-active DRBD-setup. In this >>> configuration 2 nodes act as a single SCST-powered SCSI device and they >>> both run DRBD to keep their backstorage in-sync. The initiator uses them >>> as a single multipath device in an active-active round-robin >>> load-balancing configuration, i.e. sends requests to both nodes in >>> parallel, then DRBD takes care to replicate the requests to the other node. >>> >>> The problem is that sometimes DRBD complies about concurrent local >>> writes, like: >>> >>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! >>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192 >>> >>> This message means that DRBD detected that both nodes received >>> overlapping writes on the same block(s) and DRBD can't figure out which >>> one to store. This is possible only if the initiator sent the second >>> write request before the first one completed. >> >> It is totally possible in today's code. >> >> DRBD should store the original command_sn of the write and discard >> the sector with the lower SN. It should appear as a single device >> to the initiator. > > How can it find the SN? The commands were sent over _different_ MPIO > paths to the device, so at the moment of the sending all the order > information was lost. > I'm not hard on the specifics here. But I think the initiator has set the same SN on the two paths, or has incremented them between paths. You said: > The initiator uses them as a single multipath device in an active-active > round-robin load-balancing configuration, i.e. sends requests to both nodes > in paralle. So what was the SN sent to each side. Is there a relationship between them or they each advance independently? If there is a relationship then the targets on two sides should store the SN for later comparison. (Life is hard) > Until SCSI generally allowed to preserve ordering information between > MPIO paths in such configurations the only way to maintain commands > order would be queue draining. Hence, for safety all initiators working > with such devices must do it. > > But looks like Linux doesn't do it, so unsafe with MPIO clusters? > > Vlad > Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html