I don't have time to go chasing this stuff any further on your behalf,
but it *does* smell to me like an icache management problem. Remember,
MIPS processors almost universally have split I/D caches and no
coherence support between them, so if you either (a) forget to do an
explicit D-cache write-back operation after copying to a page mapped
write-back that's going to be used as instructions/text, or (b) forget
to do an explicit I-cache invalidate when you re-use a page for
instructions that has been previously used for a different instruction
page, you will have problems, even without going into DMA I/O coherence
issues. If your problem were (b), though, you'd be seeing bad answers,
segmentation violations, bus errors, etc., at least as often as you'd
be seeing illegal instruction exceptions. So my money would be on (a). The need for cache management is so fundamental to Linux for MIPS that all the necessary general hooks have been there for years. If I were you, I'd focus on the definitions of the primitives that you spotted in c-r4k.c. Does the stuff in the JZ_RISC section correspond to the assembly language flush sequence done in the Ingenic patch to head.S? Are you sure that the JZ_RISC section is in fact the version of those functions that's being built into your kernel? Regards, Kevin K. Nils Faerber wrote: Hi Kevin! Kevin D. Kissell schrieb:The only thing that you've mentioned below that really makes me think that you're looking at a kernel bug is the comment about things not failing under GDB. But if *any* of the programs that are failing fail under gdb, I'd want to know just what instruction is at the place where they're taking a SIGILL. If gdb heisenbergs things too much, then the basic brute force thing to do would be to instrument the kernel itself to report on what happened, and what it sees at the "bad instruction" address, using printk. If the memory value actually looks like a legit instruction, it would confirm the hypothesis that you've got an icache maintenance problem. I note that the Ingenic patch has a "flushcaches" routine that has hardwired assumptions about the cache organization. Could those be incorrect on the chip you're using?Thanks for having a thought about the issue! By now I pitily have to admit that my GDB assumption was not all that correct :( After *a*lot* more tries I found an application that actually also fails inside GDB. But with some more tries I can now confirm that applications fail at random points - it is not a single instruction that causes the fault but rather random points. So I think your memory/cache issue theory sounds pretty interesting... I just had a look at the JZ4730 code (in arch/mips/jz4730/) and the only mention of a cache flush is in pm.c which will only be executed in case of going to sleep (i.e. CPU deep sleep aka s2ram). arch/mips/mm/c-r4k.c also contains a JZ_RISC section for setting up cache options and arch/mips/mm/tlbex.c a TLB case special for the JZ. Those look promising! I could very well think of cases where a wrong cache flush could cause such or similar problems.Regards, and happy hunting,Happy? When I found it maybe. The annoying thing about this is that Ingenic is not very helpful. I emailed them several times already asking for the full datasheet of the CPU with no replay at all yet. The datasheet they hae on their webpage is just the brief with about 60 pages and not very helpful when you ar elooking for details like cache handling etc. So I will have to resort to experiments - trial an error. Thank you very much for your thoughts and idea!Kevin K.Cheers nils faerber |