On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote: > >> Runs fine for about a day, then simply stops responding. There is > >> absolutely nothing in the /var/log/messages when this happens, and the > >> service seems to still be running, but no servers can see the storage. > >> Is there anywhere else I can look at logs? Is there a way to enable > >> more verbose logging? Additionally, once this happens it is > >> impossible to stop the service, even running a kill -9 on the process > >> never succeeds, the only thing that can be done at this point is to > >> reboot the target server. > >> Oddly, my FC switch still sees the target server. > >> Here is what "ps aux | grep target" shows for the process after it > >> crashes and I try to stop it, note the "D"ie "uninterruptible sleep": > >> root 17055 0.0 0.0 214848 15444 ? Ds 19:35 0:00 > >> /usr/bin/python3 /usr/bin/targetctl clear > >> > > > > Not sure what you mean by 'crashing' here, but if your not using > > v4.5-rc4+ or target-pending/4.4-stable I prepared for you a month ago, > > then you going to keep hitting the same original bug. > > > > Unfortunately I'm about to leave town for a few weeks, so I have very > little time to look at this. That said, let's talk about this... I > built the latest kernel using linux-next as well as the torvalds build > git last night. Here are the commands I used (in case you see any > problems). > > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > git remote add linux-next > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > cd linux > git remote add linux-next > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > git fetch linux-next > git fetch --tags linux-next > cp /boot/config-4.3.4-300.fc23.x86_64 .config > make oldconfig > make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install > > This resulted in a functioning 4.5rc5+ kernel. a matter of hours > later the storage once again disappeared from my ESXi hosts. I > understand there may be things I need to tweak on my hosts, but should > those things cause LIO to stop responding from the target server? > It's back to acting the same exact way as before (with the > target-pending/4.4-stable from a month ago), I can't stop the service > or kill the process. > > # uname -a > Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST > 2016 x86_64 x86_64 x86_64 GNU/Linux > You don't need to keep updating the kernel. As per the reply to David, you'll need to either explicitly disable ESX side ATS heartbeat to avoid this well-known ESX bug that effects every target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable AtomicTestandSet all-together: http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574 To repeat, there are no target side changes to avoid this well known ESX 5.5u2+ host bug. You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or disable COMPARE_AND_WRITE all-together. > # ps aux | grep target > root 2531 0.2 0.0 214848 15500 ? Ds 11:48 0:00 > /usr/bin/python3 /usr/bin/targetctl clear > > This is what I mean by "crashes", targetctl hangs here and there is > absolutely nothing I've been able to do to get it to start responding > again other than rebooting. > So it looks targetcli-fb is able to reach uninterruptable sleep. Send along your 'cat /proc/$PID/stack' output for the hung targetcli task, and let's find out what -fb user-space is doing different for qla2xxx. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html