On Mon, 27 May 2019 at 17:56, Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxxxx> wrote: > > > > I understand that as well. But that doesn't change the fact that the > > > application may be waiting for a loooooong (relatively speaking) time > > > for it's results. As latency through hardware may be several orders of > > > a magnitude larger than the time it actually takes to *process* the > > > request. So when used synchronously the HW may appear to work at a mere > > > fraction of its true performance. > > > > > > > Of course. Sometimes you care about that, and sometimes you don't. > > > > > And if your main interest is in that application, you may not care so > > > much about what the rest of the system does, even if it can use the > > > remaining bandwidth of the accelerator. > > > > > > > 100s of instances of that application, thread, etc could be running at > > the same time, and throughput may be more important than latency. > > > > > In which case it may be desirable *not* to use the accelerator for that > > > application at all due to *very* poor performance (for that application). > > > > > > Which would make even more cycles on the accelerator available to the > > > other applications in the system, so that knife cuts both ways ... > > > > > > > Single thread perfomance is only one metric, and it may not be the one > > you care about most. > > > > How relevant is the fact that there may be (other) situations where latency > is not relevant for someone being in a situation where it is relevant? > > I was just talking about that situation where it actually is relevant and > therefore you do *not* wish the hardware driver to be used. > > Ok, let's phrase it such that can be no further misunderstandings: > > If you want performance from a single-threaded single application that does > synchronous, blocking, crypto API calls, then you do not want those to end > up at our hardware. Or, very likely, pretty much any other hardware. > The cra_priority mechanism does not allow the driver to convey such a thing. > Agreed. But we came to this point in response to your assertion that a userland application can only make meaningful use of the hardware accelerator if it uses some kind of asynchronous API like AIO, and I tried to explain that this is not the case. > > > Adding tons of workarounds to drivers, for example, slows them down, makes > > them > > > use more CPU cycles and more power, and ultimately defeats the purpose of > > having > > > a hardware accelerator at all. That is actually my concern. > > > > If the workaround is in a driver and not on a hot path, we don't > > really care about the memory footprint. > > > These workarounds are on a hot path by definition, as they have to filter > out specific requests coming in, i.e. it affects every single request done > to the driver. As for memory footprint: that is still relevant for embedded > systems even today. Besides, memory footprint affects instruction cache hit > ratio and therefore, indirectly, performance as well. > Of course. But what appears on the actual hot path is a single 'cbz' instruction that is always predicted correctly, and the actual code lives somewhere else in the binary. That is why I said *memory* footprint not *cache* footprint, since it only affects the former and not the latter. > > > And as an aside, once workarounds have been implemented and proven to > > "work", the > > > underlying issue rarely makes it to the HW guys so we're stuck with it > > forever. > > > > > > > Well, the starting point of the argument was that you deliberately > > omitted handling of zero length inputs to save silicon area. > > > Where did I ever say that we omitted that to save silicon area? > You're putting words in my mouth (or fingers) now. I never said that, > that is not the reason at all. > Fair enough. But I did understand correctly that this was a deliberate decision, no? > > So the > > issue would already be known to the h/w guys, and they decided it was > > something they'd punt to the software instead. > > > NO. We never decided any such thing. We decided that it was not a relevant > use case that we needed to support at all. Neither in the hardware nor in > the driver. Our own generic OS driver does not contain any such workarounds. > In fact, based on this Linux driver thing we had a new internal discussion > on it and the outcome did not change: not a relevant use case for us. > RIght. So how does this this relate to your remark above that working workarounds prevent issues from being known to the h/w guys? > > > NO. Hardware is broken if it doesn't comply to its own specifications - > > > which *may* include references to industry standards it must comply with. > > > If I intentionally specify that zero length hashes are not supported, and > > > I don't pretend to comply with any industry standard that requires them, > > > then that's just a *limitation* of the hardware, most certainly not a bug. > > > > Fair enough. But if you want to integrate that h/w in a system that > > does aim to comply, it is up to the software to fix the impedance > > mismatch. > > > Comply with what exactly? You can't "comply" with algorithms ... you just > implement whatever subset makes sense and specify the constraints. You can > comply with protocol specifications, and that's what we do. None of those > requires zero length hashing, HMAC, cipher or AEAD operations. > Many algorithms are unbounded anyway and hardware is bounded by definition. > I'll ignore the remark about boundedness since it has no bearing whatsoever on this discussion. As for compliance, many of the zero length test vectors were sourced from FIPS or NIST documents, so i don't care what you call it, but it is a perfectly reasonable requirement that new implementations work as expected for test vectors that have been published along with the algorithm. Again, I am not saying your hardware should do this. I am only saying that, from a software engineering perspective, your driver is where we fix up the differences, not anywhere else. > > > Hardware necessarily *always* has limitations because of all kinds of > > > constraints: area, power, complexity. And even something as mundane as a > > > schedule constraint where you simply can't fit all desired features in the > > > desired schedule. Which is usually very solid due to timeslots being > > > planned in a fab etc. We don't have the luxury of extending our schedule > > > forever like SW guys tend to do ... we're very proud of our track record > > > of always meeting our promised schedules. Plus - silicon can't be patched, > > > so what's done is done and you have to live with it. For many years to > > > come, usually. > > > > > > > This is all pretty well understood. We all have different interests to > > balance against each other, which is why we are perfectly fine with > > handling some corner cases in the driver. What we are not prepared to > > do is let those corner cases leak into the core crypto layer as cases > > that require special handling. > > > Which can be avoided by not selecting a driver for an application it > does not support ... if the corner case is not exercised, then no harm is > done. No need for any "leaking" through the crypto layers. > True. So again, if you choose to support your hardware as part of a subsystem that does not have these requirements, I am perfectly fine with that. > > > > I know there is a gradient here going > > > > from hashes, AEADs to symmetric ciphers, but I think this applies to > > > > all of them. > > > > > > > > > Please keep in mind that existing hardware cannot be changed. So why > > > > > wasn't the API designed around the limitations of *existing* hardware? > > > > > > > > From a software point of view, adding special cases for zero length > > > > inputs amounts to what you are trying to avoid: using more 'silicon > > > > area'. > > > > > > > No, that's actually not the reason at all in this case. We're trying to > > > avoid significant extra complexity and effort on both the hardware itself > > > and the verification thereof. Silicon area is not even in the picture as > > > a concern for something as "small" as this. > > > > > > Adding zero length support to our hardware architecture is not a trivial > > > exercise. And then you have to weigh added complexity - =added risk, when > > > you talk about hardware with multi-million dollar mask sets in play - > > > against usefulness. Zero-length support was - and still is! - simply not > > > worth the added risk and effort. > > > > > > > Of course. That is why it is perfectly fine to handle this in your driver. > > > Perfectly fine for you, maybe, but not so much for me. > > Why: > Performance loss. Negligible > Driver complexity vs maintenance. Yes, but again, this complexity has to live *somewhere*, and we don't want it in the generic code. > Wasted effort implementing totally irrelevant cases. > I agree that it is unfortunate that we have to spend time on this. > > > And if you go that naive route, just fix everything in the driver, then > > > you simply end up with something terribly inefficient because all those > > > corner case checks end up in the fast path and eating up code space. > > > > > > > This is *exactly* the reason why we want this workaround in your > > driver, because if it is not in your driver, we will have to put it in > > generic code where it affects everybody. > > > > > For a someone claiming to "meet in the middle to compromise" you're > > > surely not compromising anything at all ... No offense. > > > > > > > None taken. I am really trying to work with you here, but changing > > core code to address the limitations of one particular h/w > > implementation is not something we do lightly. > > > Well, for one thing it is not "one particular h/w implementation". > As for the zero length thing, I believe it's almost all of them, based > on comments from other driver maintainers. I've only seen 1 comment saying > that the HW *did* support it. And that was a "maybe". > > And my main suggestion does not require any core code changes at all. > So what exactly are you proposing? Mind you, we cannot optimize this away, so we will have to add checks /somewhere/ that we are not calling into the crypto code with length values it doesn't support.