Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

Marek Olšák <maraeo@xxxxxxxxx> · Thu, 17 Jun 2021 14:28:06 -0400

The kernel will know who should touch the implicit-sync semaphore next, and at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.

Marek

On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:

> As long as we can figure out who touched to a certain sync object last that

> would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your

fence? Or maybe I'm just again totally confused.

-Daniel

> 

> Christian.

> 

> Am 14.06.21 um 19:10 schrieb Marek Olšák:

> > The call to the hw scheduler has a limitation on the size of all

> > parameters combined. I think we can only pass a 32-bit sequence number

> > and a ~16-bit global (per-GPU) syncobj handle in one call and not much

> > else.

> > 

> > The syncobj handle can be an element index in a global (per-GPU) syncobj

> > table and it's read only for all processes with the exception of the

> > signal command. Syncobjs can either have per VMID write access flags for

> > the signal command (slow), or any process can write to any syncobjs and

> > only rely on the kernel checking the write log (fast).

> > 

> > In any case, we can execute the memory write in the queue engine and

> > only use the hw scheduler for logging, which would be perfect.

> > 

> > Marek

> > 

> > On Thu, Jun 10, 2021 at 12:33 PM Christian König

> > <ckoenig.leichtzumerken@xxxxxxxxx

> > <mailto:ckoenig.leichtzumerken@xxxxxxxxx>> wrote:

> > 

> >     Hi guys,

> > 

> >     maybe soften that a bit. Reading from the shared memory of the

> >     user fence is ok for everybody. What we need to take more care of

> >     is the writing side.

> > 

> >     So my current thinking is that we allow read only access, but

> >     writing a new sequence value needs to go through the scheduler/kernel.

> > 

> >     So when the CPU wants to signal a timeline fence it needs to call

> >     an IOCTL. When the GPU wants to signal the timeline fence it needs

> >     to hand that of to the hardware scheduler.

> > 

> >     If we lockup the kernel can check with the hardware who did the

> >     last write and what value was written.

> > 

> >     That together with an IOCTL to give out sequence number for

> >     implicit sync to applications should be sufficient for the kernel

> >     to track who is responsible if something bad happens.

> > 

> >     In other words when the hardware says that the shader wrote stuff

> >     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process

> >     who did that.

> > 

> >     If the hardware says that seq - 1 was written fine, but seq is

> >     missing then the kernel blames whoever was supposed to write seq.

> > 

> >     Just pieping the write through a privileged instance should be

> >     fine to make sure that we don't run into issues.

> > 

> >     Christian.

> > 

> >     Am 10.06.21 um 17:59 schrieb Marek Olšák:

> > >     Hi Daniel,

> > > 

> > >     We just talked about this whole topic internally and we came up

> > >     to the conclusion that the hardware needs to understand sync

> > >     object handles and have high-level wait and signal operations in

> > >     the command stream. Sync objects will be backed by memory, but

> > >     they won't be readable or writable by processes directly. The

> > >     hardware will log all accesses to sync objects and will send the

> > >     log to the kernel periodically. The kernel will identify

> > >     malicious behavior.

> > > 

> > >     Example of a hardware command stream:

> > >     ...

> > >     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence

> > >     number is assigned by the kernel

> > >     Draw();

> > >     ImplicitSyncSignalWhenDone(syncObjHandle);

> > >     ...

> > > 

> > >     I'm afraid we have no other choice because of the TLB

> > >     invalidation overhead.

> > > 

> > >     Marek

> > > 

> > > 

> > >     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@xxxxxxxx

> > >     <mailto:daniel@xxxxxxxx>> wrote:

> > > 

> > >         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:

> > >         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:

> > >         > > [SNIP]

> > >         > > > Yeah, we call this the lightweight and the heavyweight

> > >         tlb flush.

> > >         > > >

> > >         > > > The lighweight can be used when you are sure that you

> > >         don't have any of the

> > >         > > > PTEs currently in flight in the 3D/DMA engine and you

> > >         just need to

> > >         > > > invalidate the TLB.

> > >         > > >

> > >         > > > The heavyweight must be used when you need to

> > >         invalidate the TLB *AND* make

> > >         > > > sure that no concurrently operation moves new stuff

> > >         into the TLB.

> > >         > > >

> > >         > > > The problem is for this use case we have to use the

> > >         heavyweight one.

> > >         > > Just for my own curiosity: So the lightweight flush is

> > >         only for in-between

> > >         > > CS when you know access is idle? Or does that also not

> > >         work if userspace

> > >         > > has a CS on a dma engine going at the same time because

> > >         the tlb aren't

> > >         > > isolated enough between engines?

> > >         >

> > >         > More or less correct, yes.

> > >         >

> > >         > The problem is a lightweight flush only invalidates the

> > >         TLB, but doesn't

> > >         > take care of entries which have been handed out to the

> > >         different engines.

> > >         >

> > >         > In other words what can happen is the following:

> > >         >

> > >         > 1. Shader asks TLB to resolve address X.

> > >         > 2. TLB looks into its cache and can't find address X so it

> > >         asks the walker

> > >         > to resolve.

> > >         > 3. Walker comes back with result for address X and TLB puts

> > >         that into its

> > >         > cache and gives it to Shader.

> > >         > 4. Shader starts doing some operation using result for

> > >         address X.

> > >         > 5. You send lightweight TLB invalidate and TLB throws away

> > >         cached values for

> > >         > address X.

> > >         > 6. Shader happily still uses whatever the TLB gave to it in

> > >         step 3 to

> > >         > accesses address X

> > >         >

> > >         > See it like the shader has their own 1 entry L0 TLB cache

> > >         which is not

> > >         > affected by the lightweight flush.

> > >         >

> > >         > The heavyweight flush on the other hand sends out a

> > >         broadcast signal to

> > >         > everybody and only comes back when we are sure that an

> > >         address is not in use

> > >         > any more.

> > > 

> > >         Ah makes sense. On intel the shaders only operate in VA,

> > >         everything goes

> > >         around as explicit async messages to IO blocks. So we don't

> > >         have this, the

> > >         only difference in tlb flushes is between tlb flush in the IB

> > >         and an mmio

> > >         one which is independent for anything currently being

> > >         executed on an

> > >         egine.

> > >         -Daniel

> > >         --         Daniel Vetter

> > >         Software Engineer, Intel Corporation

> > >         http://blog.ffwll.ch <http://blog.ffwll.ch>

> > > 

> > 

> 

-- 

Daniel Vetter

Software Engineer, Intel Corporation

http://blog.ffwll.ch