On Wed, Mar 1, 2017 at 11:29 AM, Milan Broz <gmazyland@xxxxxxxxx> wrote: > > On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote: > > On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz <gmazyland@xxxxxxxxx> wrote: > >> > >> On 02/22/2017 07:12 AM, Binoy Jayan wrote: > >>> > >>> I was wondering if this is near to be ready for submission (apart from > >>> the testmgr.c > >>> changes) or I need to make some changes to make it similar to the IPSec offload? > >> > >> I just tried this and except it registers the IV for every new device again, it works... > >> (After a while you have many duplicate entries in /proc/crypto.) > >> > >> But I would like to see some summary why such a big patch is needed in the first place. > >> (During an internal discussions seems that people are already lost in mails and > >> patches here, so Ondra promised me to send some summary mail soon here.) > >> > >> IIRC the first initial problem was dmcrypt performance on some embedded > >> crypto processors that are not able to cope with small crypto requests effectively. > >> > >> > >> Do you have some real performance numbers that proves that such a patch is adequate? > >> > >> I would really like to see the performance issue fixed but I am really not sure > >> this approach works for everyone. It would be better to avoid repeating this exercise later. > >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential > >> to speedup things even for crypt drivers that do not support own IV generators. > >> > > > > AFAIK the problem that we are trying to solve is that if the IV is > > generated outside the crypto API > > domain than you are forced to have an invocation of the crypto API per > > each block because you > > need to provide the IV for each block. > > > > By putting the IV generation responsibility in the Crypto API we open > > the way to do a single invocation > > of the crypto API for a whole sequence of blocks. > > Sure, but this is theory. Does it really work on some hw already? > Do you have performance measurements or comparison? I'm working on up streaming a driver for Arm CryptoCell that supports this and working offline to get Binoy a board to test this with. Alas, shipping crypto HW has its fair share of regulatory challenges... :-) I can certainly understand if you don't wont to take the patch until we have results with dm-crypt itself but the difference between 8 separate invocation of the engine for 512 bytes of XTS and a single invocation for 4KB are pretty big. >From what I know of HW engines I'd be surprised if this is in any way unique to CryptoCell. > > For software implementation of XTS this doesn't matter much - but for > > hardware based XTS providers > > It is not only embedded crypto, we have some more reports in the past > that 512B sectors are not ideal even for other systems. > (IIRC it was also with AES-NI that represents really big group of users). I never said anything about embedded :-) It really is an observation about overhead of context switches between dm-crypt and whatever/wherever you handle crypto - be it an off CPU hardware engine or a bunch of parallel kernel threads running on other cores. You really want to burst as much as possible. > > > This lead some vendors to ship hacked up versions of dm-crypt to match > > the specific crypto hardware > > they were using, or so I've heard at least - didn't see the code myself. > > I saw few version of that. There was a very hacky way to provide request-based dmcrypt > (see old "Introduce the request handling for dm-crypt" thread on dm-devel). > This is not the acceptable way but definitely it points to the same problem. > > > I believe Binoy is trying to address this in a generic upstream worthy > > way instead. > > IIRC the problem is performance, if we can solve it by some generic way, > good, but for now it seems to be a big change and just hope it helps later... > I see what you're saying. We need number to back this up. > > Anyway, you are only supposed to see s difference when using a > > hardware based XTS provider algo > > that supports IV generation. > > > >> I like the patch is now contained inside dmcrypt, but it still exposes IVs that > >> are designed just for old, insecure, compatibility-only containers. > >> > >> I really do not think every compatible crap must be accessible through crypto API. > >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that this way > >> if I know it is accessible outside of dmcrypt internals...) > >> Even the ESSIV is something that was born to fix predictive IVs (CBC watermarking > >> attacks) for disk encryption only, no reason to expose it outside of disk encryption. > >> > > > > The point is that you have more than one implementation of these > > "compatible crap" - the > > software implementation that you wrote and potentially multiple > > hardware implementations > > and putting this in the crypto API domain is the way to abstract this > > so you use the one > > that works best of your platform. > > For XTS you need just simple linear IV. No problem with that, implementation > in crypto API and hw is trivial. > > But for compatible IV (that provides compatibility with loopAES and very old TrueCrypt), > these should be never ever implemented again anywhere. > > Specifically "tcw" is broken, insecure and provided here just to help people to migrate > from old Truecrypt containers. Even Truecrypt followers removed it from the codebase. > (It is basically combination of IV and slight modification of CBC mode. All > recent version switched to XTS and plain IV.) > > So building abstraction over something known to be broken and that is now intentionally > isolated inside dmcrypt is, in my opinion, really not a good idea. > I don't think anyone is interested in these modes. How do you support XTS and essiv in a generic way without supporting this broken modes is not something I'm clear on though. > > But please do get me wrong, I do not want to block any improvement. > > But it seems to me that this thread focused on creating nice crypto API interface > for FDE IVs instead of demonstration that the proposed solution really solves > the performance issue. > And not only for your hw driver, maybe other systems could benefit from the better > processing of small requests as well. > Of course, the benefits at large needs to outweigh the cost. But I don't think functioning better when working on large bursts is in any way special to specific HW. Indeed, I wonder if we can show a benefit for just cryptd use case. I'll look into that. Thanks, Gilad -- Gilad Ben-Yossef Chief Coffee Drinker "If you take a class in large-scale robotics, can you end up in a situation where the homework eats your dog?" -- Jean-Baptiste Queru