LIO iSCSI memory usage issue when running read IOs w/ IOMeter.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nicholas, all,

Here is a recap of our issue:

* Running iSCSI read IOs over LIO with IO Meter (version 2006.07.27) on Windows and a queue depth (IOs/target) of 10 or above causes the memory usage to grow nearly as big and as fast as the read IOs received, with a degradation of the IO performance proportional to the ammount of extra memory used (especially visible when using a fast 10GbE link), and whereby the extra memory used rarely goes over 1 to 3GB, after which it suddenly goes back up to its original level, at which point the cycle restarts.

This is probably not very clear, so here's a bit more details:

Free memory 3.8GB.
Running read IOs over a GbE link at 100MB/sec: free memory decreases by ~100MB per second.
seconds later, free memory reaches 2.8GB
next second, the free memory has gone back to 3.8GB (recovered).
Restart above cycle continuously

Some more detailed informations about the conditions to reproduce the issue:

* The issue only happens on 3.5+ kernels. Works fine on 3.4 kernels.
* The issue only happens using IO meter, it does not happen using xdd and similar IO settings. * The IO settings to reproduce the issue are: Read IOs/10IOs per target (queue depth: 10)/Sequential/1MB IOs (iometer config file attached). * The issue is intermittent, but happens more often than not (~90% of the time so far).
* The issue happens on both 32 bits and 64 bits OSes.
* The issue does not seem to happen on Fibre Channel, only iSCSI.
* If the memory usage goes beyond the free memory, Linux's OOM killer is invoked.
* Swap is never used.
* Inactive (page cache) memory is not used (inactive memory does not grow).
* Nothing suspicious growing out of proportion is seen in slabtop (slabtop output available in the file archive).
* After stopping IOs, the memory usage returns to the same pre-IOs level.

Regarding the kernel versions, here's what we tried:

Tested on Linux 3.4 76e10d158efb6d4516018846f60c2ab5501900bc: Works fine, when running read IOs over iSCSI, memory usage does *not* up. on Ubuntu 9.10 Server (x86_64) with default kernel 3.5 (Linux ubuntu-redintel 3.5.0-17-generic): Memory usage does grow. on Linux 3.6.6 3820288942d1c1524c3ee85cbf503fee1533cfc3: Memory usage does grow. on Linux 3.8.8 19f949f52599ba7c3f67a5897ac6be14bfcb1200: Memory usage does grow.

While we did reproduce the issue on a target running over a mcp_ramdisk backstore, we did most of our testing over an IBLOCK backstore created ontop of a RAID0 * 9 15K Enterprise SAS drives. The issue did occur on a mcp_ramdisk but it is to be noted that an IBLOCk was previously created during the same session, and later replaced with a mcp_ramdisk. We had an issue then to recreate the problem with a mcp_ramdisk later, but keep in mind that this could be explained by the slightly intermittent nature of the problem. So while I *think* the issue should happen regardless of the backstore, if having trouble reproducing it I would tend to try reproducing it on an IBLOCK backstore.

To see if the issue occurs, simply run "watch -n1 head /proc/meminfo" and observe the free memory dropping fast as IOs are being run.

Here's a link to a file archive containing various debug information:

https://www.dropbox.com/s/gpb4vh82bcsbi75/lio_iscsi_captures.tar.gz

Here are what all of the files within this archive correspond to:

iometer.1M.qdepth10.read.seq.pcap: 10MB "Wireshark" capture of the traffic to and from the LIO target while the issue was occuring.
iometer.icf: iometer config used to reproduce the issue.
1M.qdepth10.read.seq.iostat: Dump of iostat -xm 2 on the Ubuntu target system while the issue was occuring. 1M.qdepth10.read.seq.slabtop: Sample output of slabtop while the issue was occuring. Slabtop refreshes every few seconds but I could not detect any values out of the ordinary or with large amount of RAM (no counter using more than 20MB and most less than 15MB large). targetcli_config: Copy of the targetcli config after running "saveconfig" in targetcli. iblock_backstore: Details about the Linux RAID backstore configuration and its disk members.

Something of interest: The memory used is not directly proportional to the number of initiators nor the speed of the IOs. Running read IOs using one initiator on a GbE link at 100MB/sec yields losing IOs at 100MB/sec for a total of 1GB, while running read IOs with two more initiators, one of which on a 10GbE link, at an aggregate speed of ~4-500MB/sec yields losing IOs at a slightly faster than 100MB/sec rate for a total of 2GB of RAM.

Also, when stopping an initiator from running IOs, a larger memory drop is observed for the next second or so as, maybe, the pending IOs from that initiator are all flusing (entire suppositions based on nothing concrete here).

Another odd thing is that the issue is intermittent. The pattern for the intermittence is pretty hard to figure out (most times testing the issue happened, but if it didn't, a restart on both sides would usually "make it" happen). However, when testing with three separate initiators on 3 separate targets on the same system (one on a GbE link, the second and third ones on the two GbE links, 3 separate physical initiators running different versions of Windows and IO Meter), the issue was happening, but somehow if stopping the two GbE initiators, the 10GbE IOs would "stabilize" and the memory would become constants again. Restarting the other two initiators would restart the issue. I could reproduce this without fail on that particular session.

In conclusion, most of the times, when using IO meter to run intensive (high queue depth, large IOs) read IOs, the issue will appear. The fact that it never happened on 3.4 and always happens after 3.5 points to a change that was introduced in between. It seems to be that LIO is using the memory for some sort of caching and that the memory is being released later than it should. The fact that it recovers after a few seconds is also interesting.

That's quite a lot of information so I hope this is not too confusing.

I hope you can manage to reproduce the issue on your side so that you can see with more clarity what the issue is about.

Thanks a million in advance for your help!

Regards,

Ben@MPSTOR
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux