On Fri, 2016-02-26 at 21:25 -0500, Dan Lane wrote: > Okay, despite my ongoing efforts to resolve this, I am no closer to > having a stable storage solution. Here is what I know so far, let's > figure this out. > > First of all, I am NOT alone with my problems, I have a friend that is > experiencing these same problems with using FC. In fact I would like > to know if anyone is actually running this successfully - surely one > of the developers must have a lab set up to validate the code, right? > > Kernel 4.3.4 symptoms: > Kernel panics randomly (as reported in the past). Nicholas identified > a bug that he believed was causing this and created a patch. > > Kernel 4.5rc? - created from the latest kernel source code with the > patch from Nicholas about 3 weeks ago The patch series below merged for v4.5-rc4, but has not made it back to the stable tree as-of yet. [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling http://www.spinics.net/lists/target-devel/msg11822.html Which means you'll still need to use target-pending/4.4-stable until these patches make their way into the stable kernel tree. > Runs fine for about a day, then simply stops responding. There is > absolutely nothing in the /var/log/messages when this happens, and the > service seems to still be running, but no servers can see the storage. > Is there anywhere else I can look at logs? Is there a way to enable > more verbose logging? Additionally, once this happens it is > impossible to stop the service, even running a kill -9 on the process > never succeeds, the only thing that can be done at this point is to > reboot the target server. > Oddly, my FC switch still sees the target server. > Here is what "ps aux | grep target" shows for the process after it > crashes and I try to stop it, note the "D"ie "uninterruptible sleep": > root 17055 0.0 0.0 214848 15444 ? Ds 19:35 0:00 > /usr/bin/python3 /usr/bin/targetctl clear > Not sure what you mean by 'crashing' here, but if your not using v4.5-rc4+ or target-pending/4.4-stable I prepared for you a month ago, then you going to keep hitting the same original bug. > Kernel 3.5.0 (old target server, unpatched because I'm afraid of > breaking target!) > Runs flawlessly all day long, never fails for any reason, no matter > what version of ESXi is used Good, It's helpful to know that the same setup works pre VAAI. > > I believe the problem is related to the LIO implementation of VAAI for > the following reason: > when I used ESXi 5 without VAAI enabled and the 4.3.4 target, I didn't > have any problems. When I tried ESXi 5.5 and 6 (with VAAI enabled), > LIO crashed. I also don't have any problems with ESXi 6 against my > old target based on kernel 3.5, which is prior to the implementation > of VAAI. > That's what I've been waiting to hear. You're probably hitting this well known ATS heartbeat bug in ESX v5.5u2 and above: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956 You'll need to manually disable ATS heartbeat from the ESX host side on all volumes, as recommended by every major storage vendor. > I'm really tired of having my equipment blamed for the problem, or the > idea that I'm using an old firmware on my FC cards or FC switch. We don't know what your hw environment is, and getting all makes + models + firmware revs for all hosts + targets HBAs helps to understand what we're dealing with. I've still never seen anything about your host side FC hardware or switch setup, after asking multiple times. > I > actually spent a large amount of money building the server that I'm > using for LIO because of the suggestion that maybe the backend wasn't > keeping up in the past. All firmwares have been updated to their > latest, and I have seen the problem even when using a direct > connection (no switch). I've also used three different physical > servers as the target as well. In addition, a friend of mine has had > the exact same issues as I mentioned earlier. One thing that has been > suggested is that my backend disks can't keep up. my backend is 20x > 10k SAS drives in RAID 6 with an intel 730 SSD acting as read and > write cache using LSI cachecade 2.0. Testing with this setup prior to > a crash has shown 400MB/s reads and writes (likely being limited by > the single 4gb FC connection) and 1-10ms latency, needless to say, I > don't think the problem is my back end. Additionally, my old target > server that never crashes only has 6 old seagate 7200 RPM drives in > RAID 6, which are good for about 200MB/s reads and writes. Like I've explained many times before, ESX is particularity sensitive to backend device latency. That is, if ESX keeps repeatedly generating ABORT_TASKs, it will eventually take the LUN offline. I've still never seen 'iostat -xm 2' output from your backend while workloads are running, which would be useful in order to confirm the average read + write latency for the backend itself. If the backend device latency is above the default ql2xcmdtimeout as previously discussed here: http://www.spinics.net/lists/target-devel/msg11844.html Then you'll keep seeing ABORT_TASKs and eventually the ESX host will take the device offline. > I'm open to doing just about any troubleshooting that could help. > Also, as mentioned in the past I have access to absolutely any version > of ESXi and I have multiple available hosts, so it you would like to > test anything related to that I would be happy to assist. Let's see the full vmkernel.log output from the ESX hosts. --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html