suspend/resume memory corruption on Dell Latitude D830 -- help please

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Kernel: 2.6.24.3 x86_64
Apologizes in advance that I cannot start a bug report as I do not know how to reproduce this outside of nvidia/X :(

I've written a small test program which allocates memory, initializes it and prompts to "suspend/resume and press enter to verify". When I suspend(to RAM, not disk)/resume and press enter, I sometimes see memory corruption. If the test does not see corruption, it free memory, reallocs,initializes and prompts. If the test see corruption it does not
free/realloc. It just reinitializes.  (code included at end)

So, when I see corruption, it is then re-corrupted in the same exact way every time I suspend/resume.

I've worked with Dell and have replace the memory, system board and processor. So, I'm thinking there is a 99.99% probability that the problem is software (including BIOS).

For memory, I have 2 times 2 GB DIMMS and if both are installed,
I can run the same test under x86_64 linux and 32 bit vista and see the same exact corruption signature:
    2 16 bit values get corrupted:
        1 at offset 0x09a gets changed to 0x0047
        1 at offset 0x0a2 gets changed to 0x1200
(Actually the same exact address until I exist the test and start over.)
It appears the same physical page of memory is getting written to all the time.

So, it appears, so far, that the problem is not with Linux per se.

Some other facts:
  o  If I run with 2 GB under 32bit Vista the test seems to always pass.
  o  If I run with 2 GB under x86_64 Linux, I still get failures.
o There are a couple of other failure signatures, at least one of which is again, exactly the same on Vista as on Linux. This one happens much less often, but involves at least one contiguous chunk of 152 bytes being corrupted. (I can supply the pattern if anyone is interested.)

Here's where I need help... (As I do not now much about the real details of acpi suspend/resume) Does, or is, the OS supposed to leave some memory alone for the acpi BIOS to use? Since the porblem happens with 2GB and x86_64 linux but not with 2GB and Vista, could there be a problem with Linux?
Is this obviously a Dell BIOS problem?
I'm wondering if there is nvidia BIOS involved?
What other tests can I do? If I suspend/resume at the text console, I run into the blank video problem. (Is vbetool post,... the best/only way to re-init the video? The executable I have seems to have some problems.) Can I safely ignore the blanked video and do testing for a serial console? What else could it be? Do both Linux and Vista have bugs?

What other info should I collect?

Apologies if the consensus is that this is not a linux-acpi devel issue :( -- but, I'm thinking that some people on this list have contacts with Dell BIOS people and if it is a Dell BIOS issue, this would (either way) be the most efficient way to get to real problem.

As a bit of background...
For the past 8 months, I've been getting intermittent crashes of the OS and/or applications. I thought there was a problem with the HW so I did lots of tests, mainly Dell's diagnostics. But since the crashes seemed to most often happen shortly after resume (within a few seconds), I decided to try and suspend/resume during a memory test. Dell's tests do not support this, so I had to come up with my own. I first was using "lucifer" http://www.ibiblio.org/pub/linux/utils/lucifer-1.0.tar.gz
but now I wrote my own (included below).

Please, any discussion or comments would be helpful.

Thanks,
Ron

The program I'm using:
#include <stdio.h>              /* printf */
#include <stdlib.h>             /* strtoul */
#include <unistd.h>             /* sleep */

#define USAGE "\
  usage: %s <mem_bytes>\n\
example: %s 11000000000\n\
", argv[0], argv[0]

#define MEM_VAL 0xdeadbeef

static void
bell( int bell_cnt )
{
    if (bell_cnt <= 0) return;
    printf("\007"); fflush(stdout);
    while (--bell_cnt)
    {   sleep(1);
        printf("\007"); fflush(stdout);
    }
    return;
}

int
main(  int      argc
     , char     *argv[] )
{
        long unsigned int       mem_bytes;
        long unsigned int       mem_dwords;
        long unsigned int       ii;
        unsigned int            *ptr;
        char                    buf[80];
        int                     error_found=0;

    if (argc <= 1) { printf( USAGE );exit(0); }

    mem_bytes = strtoul( argv[1], 0, 0 );
    printf( "testing %lu bytes\n", mem_bytes );
    mem_dwords = mem_bytes/4;
    printf( "testing %lu dwords (32bit)\n", mem_dwords );

    while (1)
    {
        if (!error_found) ptr = (unsigned int *)malloc( mem_dwords*4 );
        if (ptr == NULL)
        {   printf("malloc failed (too much mem?)\n"); exit(1);
        }

        printf( "initializing the mem..." ); fflush(stdout);
        for (ii=0; ii<mem_dwords; ii++) *(ptr+ii) = MEM_VAL;
        printf( " done\n" );

        printf( "susp/resume/press enter to test\n" ); bell(1);
        fgets( buf, sizeof(buf), stdin );

        printf( "verifying the mem..." ); fflush(stdout);
        for (ii=0; ii<mem_dwords; ii++)
        {   if (*(ptr+ii) != MEM_VAL)
            {   printf( "error at %p: found 0x%08x (should be 0x%08x)\n"
                       , ptr+ii, *(ptr+ii), MEM_VAL );
                error_found = 1;
            }
        }
        printf( " done\n" );

        if (!error_found) { free( ptr ); ptr = 0; }
        else              bell(2);
    }
    return (0);
}   /* main */


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux