Re: Request for info on a big problems with nouveau driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]<

 





On Sat, Mar 16, 2019 at 12:33 PM Mauro Rossi <issor.oruam@xxxxxxxxx> wrote:
Hi Stéphane,

the good news is that Kerol Herbst patches are mitigating effectively
the GPU lockup.
it would really be a pity to loose and abandon nouveau driver in android-x86,
while intel, radeon and amdgpu are working perfectly.

The Android GUI reboots always the same way when bringing back main screen,
with home button or using square menu button.

I've collected log with drm.debug level 63
to see what is happening prior to EGL-MAIN: DRI2: failed to create
screen/ EGL_NOT_INITIALIZED

Full log and tombstone in the attachment,
could someone in nouveau team decipher the errors?

That's a question more appropriate for the nouveau list, I am CCing the list.

Stéphane


In the logs there is also the DRM ioctl commands happening before the
DRI screen error

Mauro

03-16 18:57:03.615     0     0 E         : 00a0 2 base507c_ntfy_set
03-16 18:57:03.615     0     0 E         : 00000060
03-16 18:57:03.615     0     0 E         : f0000000
03-16 18:57:03.615     0     0 E         : 0084 1 base907c_image_set
03-16 18:57:03.615     0     0 E         : 00000010
03-16 18:57:03.615     0     0 E         : 00c0 1 base907c_image_set
03-16 18:57:03.615     0     0 E         : fb0000fe
03-16 18:57:03.615     0     0 E         : 0400 5 base907c_image_set
03-16 18:57:03.615     0     0 E         : 00010000
03-16 18:57:03.615     0     0 E         : 00000000
03-16 18:57:03.615     0     0 E         : 04000500
03-16 18:57:03.615     0     0 E         : 00005004
03-16 18:57:03.615     0     0 E         : 0000cf00
03-16 18:57:03.615     0     0 E         : 0080 1 base507c_update
03-16 18:57:03.615     0     0 E         : 00000000
03-16 18:57:03.616  2729  4165 W EGL-MAIN: DRI2: failed to create dri screen
03-16 18:57:03.616  2729  4165 W EGL-MAIN: DRI2: failed to create screen
03-16 18:57:03.617  2729  4165 W libEGL  : eglInitialize(0xad3ab800)
failed (EGL_NOT_INITIALIZED)
03-16 18:57:03.617  2729  4165 I system_server:
android::hardware::configstore::V1_0::ISurfaceFlingerConfigs::hasWideColorDisplay
retrieved: 0
03-16 18:57:03.617  2729  4165 I OpenGLRenderer: Initialized EGL, version 1.4
03-16 18:57:03.617  2729  4165 D OpenGLRenderer: Swap behavior 2
03-16 18:57:03.617  2729  4165 F OpenGLRenderer: Failed to choose
config, error = EGL_NOT_INITIALIZED
--------- beginning of crash
03-16 18:57:03.617  2729  4165 F libc    : Fatal signal 6 (SIGABRT),
code -6 in tid 4165 (RenderThread), pid 2729 (system_server)

On Tue, Mar 5, 2019 at 8:55 AM Mauro Rossi <issor.oruam@xxxxxxxxx> wrote:
>
> Hi,
> one of the problems (the Play Store Crash) was resolved with following commit:
> http://git.osdn.net/view?p=android-x86/frameworks-base.git;a=commit;h=d488a6c2bbedc06fc22942555d0157e7bf09f135
>
> Now the remaining one, affecting the dEQP-EGL multithreading tests and
> RenderThread in general,
> has been traced in the attached logs.
>
> It seams a problem similar to "a second libEGL call failing" when
> RenderThread is trying to create dri screen
> which is killed by Android attempt to load EGL config which fails and
> it is treated as Fatal.
> We just need to find the root cause of failure.
>
> In the logcat there is a clue of what is happening:
>
> --------- beginning of crash
> 03-04 20:50:56.762  1440  1440 E AndroidRuntime: FATAL EXCEPTION: main
> 03-04 20:50:56.762  1440  1440 E AndroidRuntime: Process:
> com.android.systemui, PID: 1440
> 03-04 20:50:56.762  1440  1440 E AndroidRuntime:
> java.lang.NullPointerException: Attempt to invoke virtual method
> 'android.graphics.GraphicBuffer
> android.graphics.Bitmap.createGraphicBufferHandle()' on a null object
> reference
> 03-04 20:50:56.762  1440  1440 E AndroidRuntime: at
> com.android.systemui.recents.views.RecentsTransitionHelper.drawViewIntoGraphicBuffer(RecentsTransitionHelper.java:436)
>
> Mauro
>
> On Tue, Mar 5, 2019 at 1:29 AM Stéphane Marchesin <marcheu@xxxxxxxxxxxx> wrote:
> >
> >
> >
> > On Sat, Mar 2, 2019 at 12:08 AM Mauro Rossi <issor.oruam@xxxxxxxxx> wrote:
> >>
> >> Hi Stéphane,
> >>
> >> On Fri, Mar 1, 2019 at 11:24 PM Stéphane Marchesin <marcheu@xxxxxxxxxxxx> wrote:
> >> >
> >> >
> >> >
> >> > On Fri, Mar 1, 2019 at 4:30 AM Mauro Rossi <issor.oruam@xxxxxxxxx> wrote:
> >> >>
> >> >> Hi Stéphane,
> >> >>
> >> >> thanks for responding
> >> >>
> >> >> On Thu, Feb 28, 2019 at 9:56 PM Stéphane Marchesin <marcheu@xxxxxxxxxxxx> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Feb 19, 2019 at 6:54 PM Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote:
> >> >> >>
> >> >> >> Hi Mauro,
> >> >> >>
> >> >> >> Thanks for your query. I'm not very active in the graphics area
> >> >> >> anymore, but let me add +Stéphane Marchesin , who should know the
> >> >> >> best.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Tomasz
> >> >> >>
> >> >> >> On Wed, Feb 20, 2019 at 3:00 AM Mauro Rossi <issor.oruam@xxxxxxxxx> wrote:
> >> >> >> >
> >> >> >> > Hi Tomasz,
> >> >> >> >
> >> >> >> > I wanted to ask some help, even just some information about how
> >> >> >> > nouveau is working with chromeos minigbm stack, because we have big
> >> >> >> > issues with drm_gralloc and gbm_gralloc.
> >> >> >> >
> >> >> >> > nouveau gallium driver does not support Multithreading and oreo-x86
> >> >> >> > has introduced additional RenderThread scenarios which cause
> >> >> >> > instability.
> >> >> >> >
> >> >> >> > dEQP-EGL multithreding tests are causing GUI restarts, even with
> >> >> >> > latest Karol Herbst patches with per gl context mutex locking and per
> >> >> >> > fence mutex locking,
> >> >> >> > he said there is an additional race condition that may require another
> >> >> >> > major rewrite,
> >> >> >> > but he did not mention which additional race condition.
> >> >> >> >
> >> >> >> > I wanted to ask you just some info, in case you may have them, or
> >> >> >> > suggestions on how to avoid the problem.
> >> >> >> >
> >> >> >> > 1) Are you aware of problems with chromeos with nouveau MT and how
> >> >> >> > they were avoided?
> >> >> >> > At the moment I can boot with minigbm, but the navigation bar and menu
> >> >> >> > bar are trasparent and invisible, so I was not able to check if
> >> >> >> > minigbm has same problems we have.
> >> >> >> >
> >> >> >> > 2) We are so stuck with nouveau support that I was thinking to explore
> >> >> >> > another angle,
> >> >> >> > is it possible to disable additional threads in android-x86 code base for Oreo?
> >> >> >> > Do you have some colleagues that may provide indication on how to do it?
> >> >> >> >
> >> >> >
> >> >> >
> >> >> > Hi Mauro,
> >> >> >
> >> >> > We don't officially support nouveau on Chrome OS (there are no devices which use it). The nouveau minigbm driver was written to be able to develop Chrome for Chrome OS on top of a Linux workstation with an nvidia GPU. In particular, we have never started Android with that configuration.
> >> >> >
> >> >> > Can you give more details on issue 1, i.e. what is invisible? Last I looked Chrome was working. Are you certain this is related to threading?
> >> >> >
> >> >> > Stéphane
> >> >>
> >> >> [minigbm issue]
> >> >>
> >> >> The problem with minigbm was mentioned after trying to exploit minigbm
> >> >> as it is in Chrome OS stack (which supports running Android
> >> >> applications AFAIK)
> >> >>
> >> >> The stock minigbm was not ready to boot in android-x86, lambdadroid
> >> >> added dma fb support and I added some required formats (RGBA, RGBX,
> >> >> RGB565)
> >> >> to be able to boot:
> >> >> https://github.com/maurossi/minigbm/commits/minigbm_fb
> >> >>
> >> >> Using that version of minigbm with android-x86 (oreo-x86) I see is
> >> >> that Android GUI top bar, bottom menu bar, icons and cursor are
> >> >> invisible/not rendered,
> >> >> even if blind interaction is possible.
> >> >> Maybe I've done something wrong because the drm format selection in
> >> >> minigbm is not as easy to underdestrand as drm_gralloc and gbm_gralloc
> >> >> ones.
> >> >
> >> >
> >> > Yeah as I said, we never ran any Android with the nouveau minigbm driver, not ARC++, even less Android, so I don't know.
> >> >
> >> >>
> >> >>
> >> >> The GUI transparency (or missing rendering) with minigbm does not seem
> >> >> related to multiple threads using same GL context,
> >> >> however the GPU lookups and failure of dEQP-EGL multithreading tests
> >> >> happening also with drm_gralloc and gbm_gralloc are certainly related.
> >> >>
> >> >> [MT issues]
> >> >>
> >> >> Since it is already assessed and known that nouveau lacks MT support
> >> >> as per other mesa drivers i965, radeon, amdgpu
> >> >> and Karol Herbst submitted patches to mesa-dev to bring "per gl
> >> >> context mutex" and "per fence mutex locking" in nouveau,
> >> >> I tried to run CTS dEQP-EGL with mesa GLES/EGL built with those patches,
> >> >> the result was that dEQP-EGL multithreading tests failed causing GUI
> >> >> reboots or PC restarts.
> >> >>
> >> >
> >> > I am surprised by that; we have no problem with android on radeon which uses gallium which would have the same issues.
> >>
> >> We have no problem with radeon too,
> >> but for nouveau there is an history of GPU lockups with android-x86 as we speak,
> >> Ilia Mirkin confirmed in several different bugzilla tickets that
> >> nouveau does not react well to multiple threads workers on same gl
> >> context.
> >
> >
> >
> > Hmm if you get GPU lockups, yes that's a different problem.
> >
> >
> >>
> >>
> >> Infact with some prototypal mutex locking patches we had a mitigation
> >> for android-x86 releases from lollipop-x86 to nougat-x86
> >>
> >> Karol Hebst submitted patches to mesa-dev on last december for that
> >> exact same problem,
> >> the patches are not yet up-streamed, so technically the problem is still there.
> >>
> >> The current Use Case is android-x86, but the first next GUI using
> >> multiple threads will have problems too.
> >>
> >> >
> >> >
> >> >> Having contacted Karol Herbst he told that there may be one additional
> >> >> race condition, but he did not clarified which one.
> >> >>
> >> >> What about launching dEQP-EGL on platform different from android, e.g.
> >> >> EGL wayland is that possible to see if the tests also fail on Linux
> >> >> platform?
> >> >
> >> >
> >> > We use the surfaceless/null backend for deqp. We have upstreamed it, you should be able to use that also. Otherwise I have used the glx backend successfully as well on my desktop.
> >>
> >> Could it be that in your scenario there is only one thread per gl
> >> context at a time?
> >>
> >
> > In general, most of deqp is one GL context at a time, unless you run the parallel deqp stuff. So yes it would probably help. Similarly Chrome OS is running pretty much in a single GPU process, so we wouldn't see that problem either when running nouveau.
> >
> >
> >>
> >> >
> >> >
> >> >>
> >> >> Are there similar tests in piglit?
> >> >
> >> >
> >> > I'm not aware of any, but I stopped using piglit years ago.
> >> >
> >> >>
> >> >>
> >> >> [Other issue appeared with Android 8 Oreo hardware bitmaps]
> >> >>
> >> >> System UI and Play Store crashes, are happening after successful
> >> >> android-x86 boot with drm_gralloc and gbm_gralloc,
> >> >> these crashes seem to be very much related to this path:
> >> >> CreateHardwareBitmap -> CreateBitmap -> Null Pointer Exception.
> >> >> CreateHardwareBitmap (introduced in Android Oreo),
> >> >
> >> >
> >> > Seems like you are missing dri extensions?
> >>
> >> Checking in the logcat the boot with nouveau has all extensions as per
> >> other drivers,
> >> but it has DRI_IMAGE twice, is that bad?
> >>
> >> 02-02 10:35:37.176  2489  2489 D vndksupport: Loading
> >> /vendor/lib/egl/libGLES_mesa.so from current namespace instead of
> >> sphal namespace.
> >> 02-02 10:35:37.188  2489  2489 D libEGL  : loaded
> >> /vendor/lib/egl/libGLES_mesa.so
> >> 02-02 10:35:37.251  2489  2489 D vndksupport: Loading
> >> /vendor/lib/hw/gralloc.gbm.so from current namespace instead of sphal
> >> namespace.
> >> 02-02 10:35:37.253  2489  2489 I EGL-MAIN: found extension DRI_Core version 2
> >> 02-02 10:35:37.253  2489  2489 I EGL-MAIN: found extension
> >> DRI_IMAGE_DRIVER version 1
> >> 02-02 10:35:37.253  2489  2489 I EGL-MAIN: found extension
> >> DRI_ConfigOptions version 2
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension
> >> DRI_TexBuffer version 2
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension DRI2_Flush version 4
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension DRI_IMAGE version 17
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension DRI_IMAGE version 17
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension
> >> DRI_RENDERER_QUERY version 1
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension
> >> DRI_CONFIG_QUERY version 1
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension DRI2_Fence version 2
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension
> >> DRI2_Interop version 1
> >> 02-02 10:35:37.257  2489  2489 I EGL-MAIN: found extension DRI_NoError version 1
> >>
> >
> > Can you put it in gdb and see where the NULL crash is? One can only intuit about what's going on otherwise.
> >
> > Stéphane
> >
> >
> >>
> >> >
> >> > Stéphane
> >> >
> >> >
> >> >>
> >> >> uses only one copy
> >> >> of bitmap instead of two, are there some restrictions in nouveau with
> >> >> RGBA/RGBX, BGRA hardware bitmaps?
> >> >>
> >> >> Thanks in advance for any info, suggestions
> >> >> I am available and ready to support testing/verifications to see the
> >> >> MT and HardwareBitmap issues solved.
> >> >>
> >> >> Mauro
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> > Mauro
_______________________________________________
Nouveau mailing list
Nouveau@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux