amdgpu with 8+ cards for GPU mining?

christian.koenig@xxxxxxx (Christian König) · Fri, 16 Feb 2018 20:51:28 +0100

Hi Joseph,

I think I've figured out why you run into problems with this hardware 
configuration. The BIOS doesn't assign enough memory address spaces to 
the PCI root hub for this to work.

There is a 2,5GB window below the 4GB limit, starting at address 
0x58000000:
> 58000000-f7ffffff : PCI Bus 0000:00

And a 64GB window above the 4GB limit, starting at address 0x2000000000:
> 2000000000-2fffffffff : PCI Bus 0000:00

Now your Polaris 10 cards have either 8GB or 4GB installed on each board 
and additionally to the installed memory we need 2MB for each card for 
the doorbell bar. Since the assignments can basically only be done as a 
power of two we end up with a requirement of 16GB address space for the 
8GB card and 8GB address space for the 4GB.

For compatibility reasons the cards only advertise a 256MB window for 
the video memory BAR to the BIOS on boot and we later try to resize that 
to the real size of the installed memory.

The first three cards are behind a common PCIe bridge and since we can't 
reprogram the bridge without turning all of them off at once this resize 
operation fails:
> [Â Â Â  1.496085] amdgpu 0000:04:00.0: BAR 0: no space for [mem size 
> 0x200000000 64bit pref]
> [Â Â Â  1.496208] amdgpu 0000:04:00.0: BAR 0: failed to assign [mem size 
> 0x200000000 64bit pref]
> [Â Â Â  1.496332] amdgpu 0000:04:00.0: BAR 2: no space for [mem size 
> 0x00200000 64bit pref]
> [Â Â Â  1.496455] amdgpu 0000:04:00.0: BAR 2: failed to assign [mem size 
> 0x00200000 64bit pref]
> [Â Â Â  1.496581] pcieport 0000:02:00.0: PCI bridge to [bus 03-0a]
> [Â Â Â  1.496686] pcieport 0000:02:00.0:Â Â  bridge window [io 0x7000-0x9fff]
> [Â Â Â  1.496795] pcieport 0000:02:00.0:Â Â  bridge window [mem 
> 0xf7600000-0xf78fffff]
> [Â Â Â  1.496919] pcieport 0000:02:00.0:Â Â  bridge window [mem 
> 0xa0000000-0xf01fffff 64bit pref]
> [Â Â Â  1.497112] pcieport 0000:03:01.0: PCI bridge to [bus 04]
> [Â Â Â  1.497216] pcieport 0000:03:01.0:Â Â  bridge window [io 0x9000-0x9fff]
> [Â Â Â  1.497325] pcieport 0000:03:01.0:Â Â  bridge window [mem 
> 0xf7800000-0xf78fffff]
> [Â Â Â  1.497450] pcieport 0000:03:01.0:Â Â  bridge window [mem 
> 0xe0000000-0xf01fffff 64bit pref]
> [Â Â Â  1.497594] [drm] Not enough PCI address space for a large BAR.
> [Â Â Â  1.508628] [drm] Detected VRAM RAM=8192M, BAR=256M
Fortunately the driver manages to fallback to the original 256MB 
configuration and continues with that. That is a bit sub-optimal, but 
still not a real problem.

For the remaining cards this operation succeeds and we can actually see 
that they are working fine with the new setup:
> [Â Â Â  8.419414] amdgpu 0000:0c:00.0: BAR 2: releasing [mem 
> 0x2ff0000000-0x2ff01fffff 64bit pref]
> [Â Â Â  8.426969] amdgpu 0000:0c:00.0: BAR 0: releasing [mem 
> 0x2fe0000000-0x2fefffffff 64bit pref]
> [Â Â Â  8.434531] pcieport 0000:00:1c.6: BAR 15: releasing [mem 
> 0x2fe0000000-0x2ff01fffff 64bit pref]
> [Â Â Â  8.442219] pcieport 0000:00:1c.6: BAR 15: assigned [mem 
> 0x2080000000-0x21ffffffff 64bit pref]
> [Â Â Â  8.449789] amdgpu 0000:0c:00.0: BAR 0: assigned [mem 
> 0x2100000000-0x21ffffffff 64bit pref]
> [Â Â Â  8.457390] amdgpu 0000:0c:00.0: BAR 2: assigned [mem 
> 0x2080000000-0x20801fffff 64bit pref]
> [Â Â Â  8.464981] pcieport 0000:00:1c.6: PCI bridge to [bus 0c]
> [Â Â Â  8.472505] pcieport 0000:00:1c.6:Â Â  bridge window [io 0xe000-0xefff]
> [Â Â Â  8.480066] pcieport 0000:00:1c.6:Â Â  bridge window [mem 
> 0xf7d00000-0xf7dfffff]
> [Â Â Â  8.487530] pcieport 0000:00:1c.6:Â Â  bridge window [mem 
> 0x2080000000-0x21ffffffff 64bit pref]
> [Â Â Â  8.495020] amdgpu 0000:0c:00.0: VRAM: 4096M 0x000000F400000000 - 
> 0x000000F4FFFFFFFF (4096M used)
> [Â Â Â  8.502610] amdgpu 0000:0c:00.0: GTT: 256M 0x0000000000000000 - 
> 0x000000000FFFFFFF
> [Â Â Â  8.510215] [drm] Detected VRAM RAM=4096M, BAR=4096M

Now what I think happens when you insert the ninth card is that the BIOS 
fails to assign even this small 256MB window to the card, so the card in 
general becomes completely useless.

To further narrow down this issue I need the output from "sudo lspci 
-vvvv" WITHOUT the amdgpu driver loaded when 9 cards are installed. Only 
this way I can inspect what the BIOS programmed as values for the PCI BARs.

Additional to that please provide the dmesg with the actual crash, e.g. 
with 9 cards and amdgpu manually load and/or crash log captured over the 
network.

Thanks in advance,
Christian.

Am 16.02.2018 um 19:42 schrieb Christian KÃ¶nig:
> Am 16.02.2018 um 19:17 schrieb Joseph Wang:
>> Here are the logs for the eight card case.
>>
>> cc'ing the Mageia linux group since I'm using that distribution for 
>> development.
>>
>> Three questions:
>>
>> 1) (this might be for the mageia people) What's the easiest way of 
>> booting up the system without loading in the amdgpu module?
>
> Usually modprobe.blacklist=amdgpu should work independent of the 
> distribution.
>
> Christian.
>
>>
>> 2) What's the easiest way of generating a patch from the amd-gfx 
>> repository against the mainline kernel.Â  The reason for this is that 
>> it's easier
>> for me to do local configuration management if I generate rpms locally.
>>
>> 3) Also right now I'm running a mix of software.Â  I take the opencl 
>> legacy drivers from the rpm package and they work against amdgpu.Â  
>> The trouble
>> is that they replace them mesa drivers and so I can't get opencl.Â  
>> I'd like to move onto ROCm but that involves a lot of configuration 
>> management.
>>
>> The good news is that I have a system with 8 gpu cards that works as 
>> a mining system.
>>
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180216/8d240248/attachment-0001.html>