Re-sending in plain text mode. On Sat, Apr 30, 2016 at 12:34 AM, DAVID S <shraderdm@xxxxxxxxx> wrote: > Hi, > > I would like to add that I have also seen these same issues, and I can > replicate the exact same problems by repeating the steps that Dan noted in > his original email. > > I also have a similar setup (multiple hosts running ESXi with fibre channel > HBAs, and a storage server running LIO target on kernel 4.5). I too have > tested multiple configurations, e.g. with fibre channel switches in between > the hosts and storage, direct connections between the storage server and > hosts, and every possible mix of the two. I have implemented all the changes > that the developers have noted in these emails, such as disabling VAAI > heartbeat, etc. and my target still crashes regularly. > > I am also a consultant, (for Red Hat, not VMware) and I use this system to > test architecture designs for customers, and I have moved to using only > local storage on my hosts because the unpredictable LIO crashes were > hindering my ability to do anything in my environment. > > Even if the answer is to remove the support for VAAI in the name of > reliability, I would definitely prefer having a working target over an > unreliable one that supports VAAI. > > I am very willing to assist with testing this as well, and I can provide > logs, etc. Please let me know what I can do to assist with this, and I think > everyone would appreciate this being fixed. > > Thanks, > David > > TL;DR Can we work together to get this fixed? Can we get a fork for testing > without VAAI? > > On Fri, Apr 29, 2016 at 10:49 AM, Dan Lane <dracodan@xxxxxxxxx> wrote: >> >> On Tue, Apr 26, 2016 at 11:27 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: >> > On Tue, Apr 26, 2016 at 11:24 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: >> >> >> >> On Apr 24, 2016 11:50 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote: >> >>> >> >>> Nick, >> >>> I tried the raising the timeouts (to double the original) and also >> >>> increasing/decreasing the queue depth, neither seemed to make a >> >>> difference, although at this point the infrequency of the crashes has >> >>> made troubleshooting quite challenging. >> >>> >> >>> I recently had an idea though - I've always felt like it was the >> >>> target giving up, not the ESXi hosts, but I couldn't think of a good >> >>> way to prove this before. It occurred to me that I could test with a >> >>> subset of hosts powered on until the target became unavailable, then >> >>> power those hosts off before powering on a different host that would >> >>> be unaware of the other hosts deciding the link was unavailable. The >> >>> result was that the host that I powered on alone did NOT see the >> >>> storage! Once I rebooted the target server the host in question could >> >>> see the storage. If you can think of a way that the information about >> >>> the target being unusable would be passed to the host that was only >> >>> powered on later (other than vCenter, which was powered off) I would >> >>> love to hear it. I honestly may be missing something here, but I >> >>> can't think of anything that would cause this other than the storage >> >>> server failing. >> >>> >> >>> Thanks, >> >>> Dan >> >>> >> >>> On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger >> >>> <nab@xxxxxxxxxxxxxxx> wrote: >> >>> > On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote: >> >>> >> Nicholas, >> >>> >> Can you please take another look at the fiber channel code and >> >>> >> see if you can find the cause of the problems with ESXi/VAAI? >> >>> > >> >>> > There are two options at this point to get to root cause of why your >> >>> > ESX >> >>> > hosts are continuously generating I/O timeouts and subsequent >> >>> > ABORT_TASKs, even when the host is idle. >> >>> > >> >>> > First, have you looked into changing the ESX host FC LLD side >> >>> > timeout >> >>> > and queuedepth that I asked about earlier here..? >> >>> > >> >>> > http://www.spinics.net/lists/target-devel/msg11844.html >> >>> > >> >>> > What effect does reducing the queue depth and increasing the timeout >> >>> > have on the rate in which ABORT_TASKs are being generated..? >> >>> > >> >>> > Beyond that, you'll need to engage directly the QLogic folks (CC'ed) >> >>> > to >> >>> > generate a firmware dump on both host and target sides for them to >> >>> > analyze to figure out what's going on. >> >>> > >> >>> > Are you using QLogic HBAs on the ESX host side as well..? >> >>> > >> >> >> >> Resending this in case it was ignored because I got excited and forgot >> >> to >> >> correct Google's automatic top posting that it does... Yep, totally >> >> blaming >> >> Google there! >> >> >> >> Nick, >> >> I tried the raising the timeouts (to double the original) and also >> >> increasing/decreasing the queue depth, neither seemed to make a >> >> difference, >> >> although at this point the infrequency of the crashes has made >> >> troubleshooting quite challenging. >> >> >> >> I recently had an idea though - I've always felt like it was the target >> >> giving up, not the ESXi hosts, but I couldn't think of a good way to >> >> prove >> >> this before. It occurred to me that I could test with a subset of >> >> hosts >> >> powered on until the target became unavailable, then power those hosts >> >> off >> >> before powering on a different host that would be unaware of the other >> >> hosts >> >> deciding the link was unavailable. The result was that the host that I >> >> powered on alone did NOT see the storage! Once I rebooted the target >> >> server >> >> the host in question could see the storage. If you can think of a way >> >> that >> >> the information about the target being unusable would be passed to the >> >> host >> >> that was only powered on later (other than vCenter, which was powered >> >> off) I >> >> would love to hear it. I honestly may be missing something here, but I >> >> can't think of anything that would cause this other than the storage >> >> server >> >> failing. >> >> >> >> Thanks, >> >> Dan >> > >> > And then my phone decides to convert it to HTML, causing it to get >> > dropped... ugh >> >> What does it take to get a response to the messages I send about this? >> I felt like I had a pretty concrete idea about getting to the bottom >> of the problem, and yet I'm being completely ignored... Do you simply >> have no intention of resolving this issue? Is this the same way your >> paying customers are treated? I have literally been trying to get >> Fiber Channel target working reliably for YEARS! It was November 2013 >> when kernel 3.12 was released with an attempt at implementing VAAI >> that everything went from reliable and great to horribly broken. >> Since then I have heard everything blamed for my problems - you said >> it was my hosts, my storage server, my RAID cards, my disk drives, my >> FC cards, my FC switches, ESXi... and so I replaced it all, even to >> the point of building an entirely new storage server and buying new >> switches and hosts, yeah - I've spent a LOT on this. I then sat here >> and beta tested kernel after kernel, updated firmwares, drivers, etc., >> spent countless hours trying different settings, and worked with >> friends to get them trying this to see if they had the same results >> (and they did). There has been ton of finger pointing, something that >> I as a consultant in the field do everything to avoid... It never >> resolves anything, it just makes people lose confidence in the person >> pointing the finger. >> >> My biggest question is - do the developers of this stuff even test it? >> And I don't mean in simulations, I mean do you build out a test >> system with Fiber Channel cards and ESXi (the only thing that uses >> VAAI) and rigorously test this? I know the answer, you couldn't >> possibly be doing this, otherwise you would see the problems first >> hand. I've even tried offering access to my system to the developers >> so you could truly test this, since I know hardware isn't cheap >> (actually that's not true, Fiber Channel IS cheap and ESXi is free), >> but you never took me up on that offer. I have also mentioned that I >> work for VMware, so I have access to internal only documentation and >> the VMware engineering team, I could literally be your best resource >> to get this resolved, but instead I'm being snubbed. >> >> This leads me to believe one thing - you don't care about users of the >> open source LIO, you only care about the paying users of your Datera >> products. This is exactly the situation everyone feared when it was >> decided to have a commercial company be responsible for an open source >> component of the Linux kernel. If you don't like my conclusion, do >> something to change it, even something as simple as replying to my >> email saying that you're looking at it would at least appease my >> feeling like I'm being ignored. >> >> Dan > > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html