2012/4/25 Dave Airlie <airlied@xxxxxxxxx>: > 2012/4/25 Christian König <deathsimple@xxxxxxxxxxx>: >> On 21.04.2012 16:14, Jerome Glisse wrote: >>> >>> 2012/4/21 Christian König<deathsimple@xxxxxxxxxxx>: >>>> >>>> On 20.04.2012 01:47, Jerome Glisse wrote: >>>>> >>>>> 2012/4/19 Christian König<deathsimple@xxxxxxxxxxx>: >>>>>> >>>>>> This includes mostly fixes for multi ring lockups and GPU resets, but >>>>>> it >>>>>> should general improve the behavior of the kernel mode driver in case >>>>>> something goes badly wrong. >>>>>> >>>>>> On the other hand it completely rewrites the IB pool and semaphore >>>>>> handling, so I think there are still a couple of problems in it. >>>>>> >>>>>> The first four patches were already send to the list, but the current >>>>>> set >>>>>> depends on them so I resend them again. >>>>>> >>>>>> Cheers, >>>>>> Christian. >>>>> >>>>> I did a quick review, it looks mostly good, but as it's sensitive code >>>>> i would like to spend sometime on >>>>> it. Probably next week. Note that i had some work on this area too, i >>>>> mostly want to drop all the debugfs >>>>> related to this and add some new more usefull (basicly something that >>>>> allow you to read all the data >>>>> needed to replay a locking up ib). I also was looking into Dave reset >>>>> thread and your solution of moving >>>>> reset in ioctl return path sounds good too but i need to convince my >>>>> self that it encompass all possible >>>>> case. >>>>> >>>>> Cheers, >>>>> Jerome >>>>> >>>> After sleeping a night over it I already reworked the patch for improving >>>> the SA performance, so please wait at least for v2 before taking a look >>>> at >>>> it :) >>>> >>>> Regarding the debugging of lockups I had the following on my "in mind >>>> todo" >>>> list: >>>> 1. Rework the chip specific lockup detection code a bit more and probably >>>> clean it up a bit. >>>> 2. Make the timeout a module parameter, cause compute task sometimes >>>> block a >>>> ring for more than 10 seconds. >>>> 3. Keep track of the actually RPTR offset a fence is emitted to >>>> 3. Keep track of all the BOs a IB is touching. >>>> 4. Now if a lockup happens start with the last successfully signaled >>>> fence >>>> and dump the ring content after that RPTR offset till the first not >>>> signaled >>>> fence. >>>> 5. Then if this fence references to an IB dump it's content and the BOs >>>> it >>>> is touching. >>>> 6. Dump everything on the ring after that fence until you reach the RPTR >>>> of >>>> the next fence or the WPTR of the ring. >>>> 7. If there is a next fence repeat the whole thing at number 5. >>>> >>>> If I'm not completely wrong that should give you practically every >>>> information available, and we probably should put that behind another >>>> module >>>> option, cause we are going to spam syslog pretty much here. Feel free to >>>> add/modify the ideas on this list. >>>> >>>> Christian. >>> >>> What i have is similar, i am assuming only ib trigger lockup, before each >>> ib >>> emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. >>> On >>> lockup allocate big memory to copy the whole ib and all the bo referenced >>> by the ib (i am using my bof format as i already have userspace tools). >>> >>> Remove all the debugfs file. Just add a new one that gave you the first >>> faulty >>> ib. On read of this file kernel free the memory. Kernel should also free >>> the >>> memory after a while or better would be to enable the lockup copy only if >>> some kernel radeon option is enabled. >> >> >> Just resent my current patchset to the mailing list, it's not as complete as >> your solution, but seems to be a step into the right direction. So please >> take a look at them. >> >> Being able to generate something like a "GPU crash dump" on lockup sounds >> like something very valuable to me, but I'm not sure if debugfs files are >> the right direction to go. Maybe something more like a module parameter >> containing a directory, and if set we dump all informations (including bo >> content) available in binary form (instead of the current human readable >> form of the debugfs files). > > Do what intel driver does, create a versioned binary debugfs file with > all the error state in it for a lockup, > store only one of these at a time, run a userspace tool to dump it out > into something you can > upload or just cat the file and upload it. > > You don't want the kernel writing to dirs on disk under any circumstances > We have an internal binary format for dumping command streams and associated buffers, we should probably use that so that we can better take advantage of existing internal tools. Alex > Dave. > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > http://lists.freedesktop.org/mailman/listinfo/dri-devel _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel