On Fri, Jul 23, 2010 at 3:16 PM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote: > Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote: >> >> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<vst@xxxxxxxx> >> wrote: >>> >>> James Bottomley, on 06/01/2010 05:27 PM wrote: >>>> >>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote: >>>>> >>>>> What is the best strategy to continue with the invalid guard tags on >>>>> write requests? Should this be fixed in the filesystems? >>>> >>>> For write requests, as long as the page dirty bit is still set, it's >>>> safe to drop the request, since it's already going to be repeated. What >>>> we probably want is an error code we can return that the layer that sees >>>> both the request and the page flags can make the call. >>>> >>>>> Another idea would be to pass invalid guard tags on write requests >>>>> down to the hardware, expect an "invalid guard tag" error and report >>>>> it to the block layer where a new checksum is generated and the >>>>> request is issued again. Basically implement a retry through the whole >>>>> I/O stack. But this also sounds complicated. >>>> >>>> No, no ... as long as the guard tag is wrong because the fs changed the >>>> page, the write request for the updated page will already be queued or >>>> in-flight, so there's no need to retry. >>> >>> There's one interesting problem here, at least theoretically, with SCSI >>> or similar transports which allow to have commands queue depth>1 and allowed >>> to internally reorder queued requests. I don't know the FS/block layers >>> sufficiently well to tell if sending several requests for the same page >>> really possible or not, but we can see a real life problem, which can be >>> well explained if it's possible. >>> >>> The problem could be if the second (rewrite) request (SCSI command) for >>> the same page queued to the corresponding device before the original request >>> finished. Since the device allowed to freely reorder requests, there's a >>> probability that the original write request would hit the permanent storage >>> *AFTER* the retry request, hence the data changes it's carrying would be >>> lost, hence welcome data corruption. >>> >>> For single parallel SCSI or SAS devices such race may look practically >>> impossible, but for sophisticated clusters when many nodes pretending to be >>> a single SCSI device in a load balancing configuration, it becomes very >>> real. >>> >>> The real life problem we can see in an active-active DRBD-setup. In this >>> configuration 2 nodes act as a single SCST-powered SCSI device and they both >>> run DRBD to keep their backstorage in-sync. The initiator uses them as a >>> single multipath device in an active-active round-robin load-balancing >>> configuration, i.e. sends requests to both nodes in parallel, then DRBD >>> takes care to replicate the requests to the other node. >>> >>> The problem is that sometimes DRBD complies about concurrent local >>> writes, like: >>> >>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD >>> L] new: 144072784s +8192; pending: 144072784s +8192 >>> >>> This message means that DRBD detected that both nodes received >>> overlapping writes on the same block(s) and DRBD can't figure out which one >>> to store. This is possible only if the initiator sent the second write >>> request before the first one completed. >>> >>> The topic of the discussion could well explain the cause of that. But, >>> unfortunately, people who reported it forgot to note which OS they run on >>> the initiator, i.e. I can't say for sure it's Linux. >> >> Sorry for the late chime in, but here's some more information of >> potential interest as I've previously inquired about this to the drbd >> mailing list: >> >> 1. It only happens when using blockio mode in IET or SCST. Fileio, >> nv_cache, and write_through do not generate the warnings. > > Some explanations for those who not familiar with the terminology: > > - "Fileio" means Linux IO stack on the target receives IO via > vfs_readv()/vfs_writev() > > - "NV_CACHE" means all the cache synchronization requests > (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored > > - "WRITE_THROUGH" means write through, i.e. the corresponding backend file > for the device open with O_SYNC flag. > >> 2. It happens on active/passive drbd clusters (on the active node >> obviously), NOT active/active. In fact, I've found that doing round >> robin on active/active is a Bad Idea (tm) even with a clustered >> filesystem, until at least the target software is able to synchronize >> the command state of either node. >> 3. Linux and ESX initiators can generate the warning, but I've so far >> only been able to reliably reproduce it using a Windows initiator and >> sqlio or iometer benchmarks. I'll be trying again using iometer when I >> have the time. >> 4. It only happens using a random write io workload (any block size), >> with initiator threads>1, OR initiator queue depth>1. The higher >> either of those is, the more spammy the warnings become. >> 5. The transport does not matter (reproduced with iSCSI and SRP) >> 6. If DRBD is disconnected (primary/unknown), the warnings are not >> generated. As soon as it's reconnected (primary/secondary), the >> warnings will reappear. > > It would be great if you prove or disprove our suspicions that Linux can > produce several write requests for the same blocks simultaneously. To be > sure we need: > > 1. The initiator is Linux. Windows and ESX are not needed for this > particular case. > > 2. If you are able to reproduce it, we will need full description of which > application used on the initiator to generate the load and in which mode. > > Target and DRBD configuration doesn't matter, you can use any. I just tried, and this particular DRBD warning is not reproducible with io (iometer) coming from a Linux initiator (2.6.30.10) The same iometer parameters were used as on windows, and both the base device as well as filesystem (ext3) were tested, both negative. I'll try a few more tests, but it seems that this is a nonissue with a Linux initiator. Hope that helps, -Gennadiy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html