I'm using stgt to export a block device that's replicated in primary/primary mode with DRBD, upon which the initiators will ultimately mount a GFS2 filesystem (after some indirection through CLVM). Basically, what I'm finding is that some nodes are getting stale cached reads, while other nodes are getting the correct data. The gory details are below... I have the following systems (VMs, at the moment) setup: gfs2-1: data node (stgt exports the DRBD block device) gfs2-2: data node (stgt exports the DRBD block device) gfs2-3: initiator gfs2-4: initiator gfs2-5: initiator I have disabled the write cache on both gfs2-1 and gfs2-2: [root@gfs2-1 ~]# grep -Pv '^\s*#' /etc/tgt/targets.conf default-driver iscsi <target iqn.2010-11.com.polldev:gfs2-1.gfs2> backing-store /dev/drbd1 write-cache off </target> DRBD is using protocol "C" (fully synchronous) on /dev/vdb (a virtio disk). All of the initiator machines import the target from both gfs2-1 and gfs2-2, which is then accessed in a multibus fashion via dm-multipath. The idea is that I can reboot or otherwise remove one of the data nodes from service at any time without any other nodes knowing or caring. Now, this all works swimmingly except that about half the nodes get stale cache back when they read data, depending on which data node they're actually reading from. For example: 1. gfs2-4 and gfs2-5 read sector n from gfs2-2 2. gfs2-3 issues a write to sector n on gfs2-1 3. gfs2-1 commits the write to disk 4. gfs2-1 replicates the write to gfs2-2 via DRBD At this point, any attempt to read the data written in step 1 through gfs2-2 over iSCSI will return the same result that it did during step 1 and will /not/ reflect the data written in step 2. Unfortunately, this means that those reads will return incorrect data. Reads issued to gfs2-1 will return the correct data. Now, the especially interesting part is that reading directly from /dev/drbd1 on gfs2-1 or gfs2-2 (avoiding iSCSI) always returns the correct data. Furthermore, if I issue a "echo 3 >/proc/sys/vm/drop_caches" on both gfs2-1 and gfs2-2 (but not the rest of the nodes), the correct data is returned via iSCSI until more writes occur. Given the above, I'm fairly certain that the problem is somewhere in tgtd, but to further confirm, I tried exporting the block devices on gfs2-1 and gfs2-2 via GNBD instead of iSCSI, and the problem disappeared, reads always return the correct value. (There are other issues with GNBD which make me hesitant to go any further with it, but that's neither here nor there.) My specific test case is to append a '.' to a file every second on one node and issue a "watch ls -l" in that directory on every node. As the nodes switch from one path to another in the multipath, some nodes inevitably get back stale data while others get back fresh data. Though, as mentioned above, gfs2-1 and gfs2-2 always get fresh data because they mount /dev/drbd1 directly instead of going through iSCSI. I had a peek at the code and it appears that rdwr does not open the file with O_DIRECT. This means that in the case of dual primary DRBD and thus the block device changing behind tgtd's back, the page cache would be stale but not invalidated as it should be. Though I didn't test it (yet), I'm fairly sure that if rdwr opened the file with O_DIRECT that this would work correctly for me. I also tried replacing gfs2-1 and gfs2-2 with otherwise identically configured RHEL 6 machines, but this produced the same results. I eventually stumbled upon the bs-type config setting, although I tried mmap, sg and aio, only aio seemed to work at all. Both mmap and sg failed with an error like so: tgtadm: invalid request Command: tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /dev/drbd1 --bstype mmap exited with code: 22. Notably, when using aio, all the nodes appear to get the correct data on read. Unfortunately, when RHEL 5 clients connect to RHEL 6 targets using aio, I'm unable to mount the filesystem (a slew of I/O errors are returned and the system withdraws from the filesystem). All of the machines are running RHEL 5.5, which means scsi-target-utils is version 0.0-6.20091205snap.el5_5.3 and iscsi-initiator-utils is version 6.2.0.871-0.16.el5. In my brief experiment with RHEL 6, scsi-target-utils was version 1.0.4-3.el6 and iscsi-initiator-utils was version 6.2.0.872-10.el6. Ultimately, my question is twofold: 1) Is it intentional that rdwr does not use O_DIRECT? 2) Should RHEL 5 clients be able to connect successfully using aio to RHEL 6 targets? - Neil -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html