Re: [RFC PATCH 0/8] Qualcomm Cloud AI 100 driver

Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> · Wed, 20 May 2020 17:59:43 +0200

On Wed, May 20, 2020 at 08:48:13AM -0600, Jeffrey Hugo wrote:
> On 5/20/2020 2:34 AM, Daniel Vetter wrote:
> > On Wed, May 20, 2020 at 7:15 AM Greg Kroah-Hartman
> > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > 
> > > On Tue, May 19, 2020 at 10:41:15PM +0200, Daniel Vetter wrote:
> > > > On Tue, May 19, 2020 at 07:41:20PM +0200, Greg Kroah-Hartman wrote:
> > > > > On Tue, May 19, 2020 at 08:57:38AM -0600, Jeffrey Hugo wrote:
> > > > > > On 5/18/2020 11:08 PM, Dave Airlie wrote:
> > > > > > > On Fri, 15 May 2020 at 00:12, Jeffrey Hugo <jhugo@xxxxxxxxxxxxxx> wrote:
> > > > > > > > 
> > > > > > > > Introduction:
> > > > > > > > Qualcomm Cloud AI 100 is a PCIe adapter card which contains a dedicated
> > > > > > > > SoC ASIC for the purpose of efficently running Deep Learning inference
> > > > > > > > workloads in a data center environment.
> > > > > > > > 
> > > > > > > > The offical press release can be found at -
> > > > > > > > https://www.qualcomm.com/news/releases/2019/04/09/qualcomm-brings-power-efficient-artificial-intelligence-inference
> > > > > > > > 
> > > > > > > > The offical product website is -
> > > > > > > > https://www.qualcomm.com/products/datacenter-artificial-intelligence
> > > > > > > > 
> > > > > > > > At the time of the offical press release, numerious technology news sites
> > > > > > > > also covered the product.  Doing a search of your favorite site is likely
> > > > > > > > to find their coverage of it.
> > > > > > > > 
> > > > > > > > It is our goal to have the kernel driver for the product fully upstream.
> > > > > > > > The purpose of this RFC is to start that process.  We are still doing
> > > > > > > > development (see below), and thus not quite looking to gain acceptance quite
> > > > > > > > yet, but now that we have a working driver we beleive we are at the stage
> > > > > > > > where meaningful conversation with the community can occur.
> > > > > > > 
> > > > > > > 
> > > > > > > Hi Jeffery,
> > > > > > > 
> > > > > > > Just wondering what the userspace/testing plans for this driver.
> > > > > > > 
> > > > > > > This introduces a new user facing API for a device without pointers to
> > > > > > > users or tests for that API.
> > > > > > 
> > > > > > We have daily internal testing, although I don't expect you to take my word
> > > > > > for that.
> > > > > > 
> > > > > > I would like to get one of these devices into the hands of Linaro, so that
> > > > > > it can be put into KernelCI.  Similar to other Qualcomm products. I'm trying
> > > > > > to convince the powers that be to make this happen.
> > > > > > 
> > > > > > Regarding what the community could do on its own, everything but the Linux
> > > > > > driver is considered proprietary - that includes the on device firmware and
> > > > > > the entire userspace stack.  This is a decision above my pay grade.
> > > > > 
> > > > > Ok, that's a decision you are going to have to push upward on, as we
> > > > > really can't take this without a working, open, userspace.
> > > > 
> > > > Uh wut.
> > > > 
> > > > So the merge criteria for drivers/accel (atm still drivers/misc but I
> > > > thought that was interim until more drivers showed up) isn't actually
> > > > "totally-not-a-gpu accel driver without open source userspace".
> > > > 
> > > > Instead it's "totally-not-a-gpu accel driver without open source
> > > > userspace" _and_ you have to be best buddies with Greg. Or at least
> > > > not be on the naughty company list. Since for habanalabs all you
> > > > wanted is a few test cases to exercise the ioctls. Not the entire
> > > > userspace.
> > > 
> > > Also, to be fair, I have changed my mind after seeing the mess of
> > > complexity that these "ioctls for everyone!" type of pass-through
> > > these kinds of drivers are creating.  You were right, we need open
> > > userspace code in order to be able to properly evaluate and figure out
> > > what they are doing is right or not and be able to maintain things over
> > > time correctly.
> > > 
> > > So I was wrong, and you were right, my apologies for my previous
> > > stubbornness.
> > 
> > Awesome and don't worry, I'm pretty sure we've all been stubborn
> > occasionally :-)
> > 
> >  From a drivers/gpu pov I think still not quite there since we also
> > want to see the compiler for these programmable accelerator thingies.
> > But just having a fairly good consensus that "userspace library with
> > all the runtime stuff excluding compiler must be open" is a huge step
> > forward. Next step may be that we (kernel overall, drivers/gpu will
> > still ask for the full thing) have ISA docs for these programmable
> > things, so that we can also evaluate that aspect and gauge how many
> > security issues there might be. Plus have a fighting chance to fix up
> > the security leaks when (post smeltdown I don't really want to
> > consider this an if) someone finds a hole in the hw security wall. At
> > least in drivers/gpu we historically have a ton of drivers with
> > command checkers to validate what userspace wants to run on the
> > accelerator thingie. Both in cases where the hw was accidentally too
> > strict, and not strict enough.
> 
> I think this provides a pretty clear guidance on what you/the community are
> looking for, both now and possibly in the future.
> 
> Thank you.
> 
> From my perspective, it would be really nice if there was something like
> Mesa that was a/the standard for these sorts of accelerators.  Its somewhat
> the wild west, and we've struggled with it.

Put a first cut at such a thing out there and see how it goes!  Nothing
is preventing you from starting such a project, and it would be most
welcome as you have seen.

good luck,

greg k-h