Re: Cant find out MCE reason (CPU 35 BANK 8)

Vladimir Budnev <vladimir.budnev@xxxxxxxxx> · Tue, 22 Mar 2011 10:59:29 -0400

2011/3/22  <m.roth@xxxxxxxxx>

Vladimir Budnev wrote:

> 2011/3/22 <m.roth@xxxxxxxxx>

>> Vladimir Budnev wrote:

>> > 2011/3/22 <m.roth@xxxxxxxxx>

>> >> Vladimir Budnev wrote:

>> >> > 2011/3/21 <m.roth@xxxxxxxxx>

>> >> >> Vladimir Budnev wrote:

>> >> >> >

>> >> >> > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with

>> >> >> > 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

>> >> >> >

>> >> >> > For some time we have lots of MCE in mcelog and we cant find out

>> >> >> > the reason.

>> >> >>

>> >> >> The only thing that shows there (when it shows, since sometimes it

>> >> >> doesn't seem to) is a hardware error. You *WILL* be replacing

>> >> >> hardware, sometime soon, like yesterday.

>> >> <snip>

>> > We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or

depends on

>> > situation? I hope we will find those bustards ourselvs but hint would

>> > be great.

>> >

>> > And one more thing i cant funderstand ... if there is,say, 8 "cpu

>> > numbers" per each memory module(in our situation), why we see only 4

numbers

>> > and not 8 e.g. 0,1,2,3,4,5,6,7 ?

>>

>> I'm now confused about a lot: originally, you mentioned 53 - 57, was it?

>> That doesn't add up, since you say you have 2 quad core processors, for

>> a total of 8 cpus, and each of those processors have 6 banks, which would

>> mean each processor should only see six (directly). Where I'm confused

>> is how you could have cores 32-35, or 53-whatsit, when you only have 8

>> cores in two processors.

>

>  2 cpu each 8 cores and HT support. So 16 at max i think. for such way is

> it  ok?

Huh? Above, you say "2 quad core proc" - that's 8 cores over two processor

chips. HT support doesn't figure into it; if you use dmidecode or lshw, I

believe it will show you 8 cores, not 16.
Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.

>  I really lost the idea line with those cpu to memory bank mappings...

Each processor will directly see the DIMMs associate with it, so that the

banks associated with each processor will be what directly affects the

cores. So, if you see something like

Mar 20 05:01:35 <system name> kernel:  Northbridge Error, node 0, core: 5

(these processors are 8-core), it means that one of the DIMMs in bank 0,

0-3, is bad.

You should see

       __

      |_0|  0 1 2 3

                 __

                |_1|  0 1 2 3

or whatever on the m/b, so one of the top ones there is affected. Is that

any clearer?
First of all big thnx for helping mark.

In your example everything is ok. But i am lost with what we have.
Previously we recieved messages like i post in the first mail:

CPU 51 BANK 8 TSC 8511e3ca77dc 
MISC 274d587f00006141 ADDR 807044840 
STATUS cc0055000001009f MCGSTATU

And always there were same cpu numbers. I really dont know why do mcleog show such numbers but thats what we have.Always Bank 8 and there were 32,33,34,45 and 50,51,52,53 numbers in CPU field.

You encouraged us that it is a dimm problem and we decide to make a little research which i described up the thread. During that wev replaced DIMM moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21. It really seems that some how those numbers connected with RAM modules.

But... as i sad we have following slots
   CPU1    cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3
   CPU2    cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

We have modules placed in such way:
+------------+------------+------------+------------+------------+------------+------------+

|              |      V     |     V      |      V     |      V     |    free    |    free    |
+------------+------------+------------+------------+------------+------------+------------+
|   CPU1  |  cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 |

+------------+------------+------------+------------+------------+------------+------------+

+------------+------------+------------+------------+------------+------------+------------+

|              |      V     |     V      |      V     |      V     |    free    |    free    |

+------------+------------+------------+------------+------------+------------+------------+

|   CPU2  |  cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 |

+------------+------------+------------+------------+------------+------------+------------+

Definetely there is something with memory banks,becasue replacinbg moudels changed the mce messages, but what exactly...or iv interpreted all wrong?

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos