Re: [RFC] relaxed barrier semantics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Mason, on 08/05/2010 05:32 PM wrote:
On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote:
Chris Mason, on 08/02/2010 09:39 PM wrote:
I regret putting the ordering into the original barrier code...it
definitely did help reiserfs back in the day but it stinks of magic and
voodoo.

But if the ordering isn't in the common (block) code, how to
implement the "hardware offload" for ordering, i.e. ORDERED
commands, in an acceptable way?

I believe, the decision was right, but the flags and magic requests
based interface (and, hence, implementation) was wrong. That's it
which stinks of magic and voodoo.

The interface definitely has flaws.  We didn't expand it because James
popped up with a long list of error handling problems.

Could you point on the corresponding message, please? I can't find it in my archive.

Basically how
do the hardware and the kernel deal with a failed request at the start
of the chain.  Somehow the easy way of failing them all turned out to be
extremely difficult.

Have you considered to not fail them all, but using ACA SCSI facility just suspend the queue, then requeue the failed request, then restart processing? I might be missing something, but using this approach the failed requests recovery should look quite simple and, most important, compact, hence easily audited. Something like below. Sorry, since it's a low level recovery, it requires some deep SCSI knowledge to follow.

We need:

1. A low level driver without internal queue and masking returned status and sense. At first look, many of the existing drivers more or less satisfy this requirement, including drivers in my direct interest: qla2xxx, iscsi and ib_srp.

2. A device with support of ORDERED commands as well as ACA and UA_INTLCK facilities in QERR mode 0.

Assume we have N ORDERED requests queued to a device and one of them failed. Then submitting new requests to the device would be suspended and recovery thread woken up.

Let's we have a list of queued to the device requests in order as they queued. Then the recovery thread would need to deal with the following cases:

1. The failed command failed with CHECK_CONDITION and from the head of the queue. (The device now established ACA and suspended its internal queue.) Then the command should be sent to the device as ACA task and, after it's finished, ACA should be cleared. (The device now would restart its queue.) Then submitting new requests to the device would also be resumed.

2. The failed command failed with CHECK_CONDITION and isn't from the head of the queue.

2.1. The failed command in the last in the queue. ACA should be cleared and the failed command should simply be restarted. Then submitting new requests to the device would also be resumed.

2.2. The failed command isn't last in the queue. Then the recovery thread would send ACA command TEST UNIT READY to be sure all in-flight commands reached the device. Then it would abort all the commands after the failed one using ABORT TASK Task Management function. Then ACA should be cleared and the failed command as well as all the aborted commands would be resend to the device. Then submitting new requests to the device would also be resumed.

3. The failed command failed with other status than CHECK_CONDITION and from the head of the queue.

3.1. The failed command is the only queued command. Then TEST UNIT READY command should be sent to the device to get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then ACA should be cleared and the failed command restarted. Then submitting new requests to the device would also be resumed.

3.2. There are other queued commands. Then the recovery thread should remember the failed command and exit. The next command would get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would proceed as in (1), except that 2 failed commands would be restarted as ACA commands before clearing ACA.

4. The failed command isn't from the head of the queue and failed with other status than CHECK_CONDITION. It might happen in case of TASK QUEUE FULL condition. This case would be proceed similarly as cases (3.x), then (2.2).

That's all. Simple, compact and clear for auditing.

Even if that part had been refined, I think trusting the ordering down
to the lower layers was a doomed idea.  The list of ways it could go
wrong is much much longer (and harder to debug) than the list of
benefits.

It's hard to debug, because it's currently a overloaded flags nightmare. It isn't the idea to trust lower level doomed, everybody trust lower levels everywhere in the kernel. Doomed the idea to provide requested functionality via a set of flags and artificial barrier requests with obscured side effects. Linux just needs a clear and _natural_ interface for that. Like one I proposed in http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am proposing to slowly start thinking to move to a new interface and implementation out from the current hell. It's obvious that what Linux has now in this area is a dead end. The new flag Christoph is going to add makes it even worse.

With all of that said, I did go ahead and benchmark real ordered tags
extensively on a scsi drive in the initial implementation.  There was
very little performance difference.

It isn't surprise that you didn't see much difference with a local (Wide?) SCSI drive. Such drives sit on a low latency link, simple enough to have small internal latencies and dumb enough to not make much benefits from internal reordering. But how about external arrays? Or even clusters? Nowadays everybody can build such arrays and clusters from any Linux (or other *nix) box using any OSS SCSI target implementation starting from SCST I have been developing. Such array/cluster devices use links with in an order of magnitude higher latency, they are very sophisticated inside, so have much bigger internal latencies as well as they have much bigger opportunities to optimize I/O pattern by internal reordering. All the record numbers I've seen so far were reached with deep queue. For instance, the last SCST record (>500K 4K IOPSes from a single target) was achieved with queue depth 128!

So, I believe, Linux must use that possibility to get full storage performance and to finally simplify its storage stack.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux