Mysterious stale cache

Neil Skrypuch <neil@xxxxxxxxxxxxxx> · Wed, 17 Nov 2010 17:23:57 -0500

I'm using stgt to export a block device that's replicated in primary/primary 
mode with DRBD, upon which the initiators will ultimately mount a GFS2 
filesystem (after some indirection through CLVM). Basically, what I'm finding 
is that some nodes are getting stale cached reads, while other nodes are 
getting the correct data. The gory details are below...

I have the following systems (VMs, at the moment) setup:

gfs2-1: data node (stgt exports the DRBD block device)
gfs2-2: data node (stgt exports the DRBD block device)
gfs2-3: initiator
gfs2-4: initiator
gfs2-5: initiator

I have disabled the write cache on both gfs2-1 and gfs2-2:
[root@gfs2-1 ~]# grep -Pv '^\s*#' /etc/tgt/targets.conf

default-driver iscsi
<target iqn.2010-11.com.polldev:gfs2-1.gfs2>
        backing-store /dev/drbd1
        write-cache off
</target>

DRBD is using protocol "C" (fully synchronous) on /dev/vdb (a virtio disk).

All of the initiator machines import the target from both gfs2-1 and gfs2-2, 
which is then accessed in a multibus fashion via dm-multipath. The idea is 
that I can reboot or otherwise remove one of the data nodes from service at 
any time without any other nodes knowing or caring.

Now, this all works swimmingly except that about half the nodes get stale 
cache back when they read data, depending on which data node they're actually 
reading from. For example:

1. gfs2-4 and gfs2-5 read sector n from gfs2-2
2. gfs2-3 issues a write to sector n on gfs2-1
3. gfs2-1 commits the write to disk
4. gfs2-1 replicates the write to gfs2-2 via DRBD

At this point, any attempt to read the data written in step 1 through gfs2-2 
over iSCSI will return the same result that it did during step 1 and 
will /not/ reflect the data written in step 2. Unfortunately, this means that 
those reads will return incorrect data. Reads issued to gfs2-1 will return 
the correct data.

Now, the especially interesting part is that reading directly from /dev/drbd1 
on gfs2-1 or gfs2-2 (avoiding iSCSI) always returns the correct data. 
Furthermore, if I issue a "echo 3 >/proc/sys/vm/drop_caches" on both gfs2-1 
and gfs2-2 (but not the rest of the nodes), the correct data is returned via 
iSCSI until more writes occur.

Given the above, I'm fairly certain that the problem is somewhere in tgtd, but 
to further confirm, I tried exporting the block devices on gfs2-1 and gfs2-2 
via GNBD instead of iSCSI, and the problem disappeared, reads always return 
the correct value. (There are other issues with GNBD which make me hesitant 
to go any further with it, but that's neither here nor there.)

My specific test case is to append a '.' to a file every second on one node 
and issue a "watch ls -l" in that directory on every node. As the nodes 
switch from one path to another in the multipath, some nodes inevitably get 
back stale data while others get back fresh data. Though, as mentioned above, 
gfs2-1 and gfs2-2 always get fresh data because they mount /dev/drbd1 
directly instead of going through iSCSI.

I had a peek at the code and it appears that rdwr does not open the file with 
O_DIRECT. This means that in the case of dual primary DRBD and thus the block 
device changing behind tgtd's back, the page cache would be stale but not 
invalidated as it should be. Though I didn't test it (yet), I'm fairly sure 
that if rdwr opened the file with O_DIRECT that this would work correctly for 
me.

I also tried replacing gfs2-1 and gfs2-2 with otherwise identically configured 
RHEL 6 machines, but this produced the same results.

I eventually stumbled upon the bs-type config setting, although I tried mmap, 
sg and aio, only aio seemed to work at all. Both mmap and sg failed with an 
error like so:

tgtadm: invalid request
Command:
        tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 
1 -b /dev/drbd1  --bstype mmap
exited with code: 22.

Notably, when using aio, all the nodes appear to get the correct data on read. 
Unfortunately, when RHEL 5 clients connect to RHEL 6 targets using aio, I'm 
unable to mount the filesystem (a slew of I/O errors are returned and the 
system withdraws from the filesystem).

All of the machines are running RHEL 5.5, which means scsi-target-utils is 
version 0.0-6.20091205snap.el5_5.3 and iscsi-initiator-utils is version 
6.2.0.871-0.16.el5.

In my brief experiment with RHEL 6, scsi-target-utils was version 1.0.4-3.el6 
and iscsi-initiator-utils was version 6.2.0.872-10.el6.

Ultimately, my question is twofold:

1) Is it intentional that rdwr does not use O_DIRECT?
2) Should RHEL 5 clients be able to connect successfully using aio to RHEL 6 
targets?

- Neil
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html