Hi Alex, > No. For the current source code, I think the premap and no-op is not > working. > Indeed, we don't set bo->mem.bus.addr in amdgpu_ttm_io_mem_reserve() any more. Felix will probably want to fix that for the KFD branch. Anyway, as I said not unmapping the page tables is harmless compared to not releasing the memory backing it. So please just do as I told you and change the interruptible reservation to a non-interruptible. Regards, Christian. Am 20.04.2017 um 23:56 schrieb Xie, AlexBin: > > Hi Christian, > > > I read amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free() and > relevant codes from amdgpu_vram_scratch_init and > amdgpu_vram_scratch_fini. > > > No. For the current source code, I think the premap and no-op is not > working. > > > I add printk to prove. You can see bo_kmap_type is an invalid value. > ioremap_wc is really called to map the IO range into vmalloc space. > > > ... > > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759623] > entering amdgpu_vram_scratch_init. > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759631] > scratch ioremap_wc > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759631] > bo_kmap_type = 129 > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759632] bus > address = (null) > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759632] > is_iomem = 1 > Apr 20 16:31:18 axie-System-Product-Name kernel: [ 106.759633] > leaving amdgpu_vram_scratch_init. > ... > > I don't have log on rmmod AMDGPU yet. There are quite some settings to > make that happen in my computer. > I think they are symemtric. Both mapping and unmapping are not no-op. > > Here is the printk patch for your reference. > > From 25f95239c23225008e4b59f30b9b5363fb321f94 Mon Sep 17 00:00:00 2001 > From: Alex Xie <AlexBin.Xie at amd.com> > Date: Thu, 20 Apr 2017 17:48:49 -0400 > Subject: [PATCH] A hack to trace premap and noop. > > Change-Id: I61fbbdbd82f62433e229b2e7e97be7a780ea8afa > Signed-off-by: Alex Xie <AlexBin.Xie at amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++ > drivers/gpu/drm/ttm/ttm_bo.c | 1 + > drivers/gpu/drm/ttm/ttm_bo_util.c | 29 > ++++++++++++++++++++++++++--- > include/drm/ttm/ttm_bo_api.h | 1 + > 4 files changed, 50 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index fbb4afb..a537ea1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -313,6 +313,7 @@ static void amdgpu_block_invalid_wreg(struct > amdgpu_device *adev, > static int amdgpu_vram_scratch_init(struct amdgpu_device *adev) > { > int r; > +printk("entering amdgpu_vram_scratch_init."); > if (adev->vram_scratch.robj == NULL) { > r = amdgpu_bo_create(adev, AMDGPU_GPU_PAGE_SIZE, > @@ -340,16 +341,36 @@ static int amdgpu_vram_scratch_init(struct > amdgpu_device *adev) > amdgpu_bo_unpin(adev->vram_scratch.robj); > amdgpu_bo_unreserve(adev->vram_scratch.robj); > +/* Would like a printk to see if map / unmap is noop*/ > +adev->vram_scratch.robj->tbo.mem.bus.printk = true; > + > +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped) > +printk("amdgpu_vram_scratch premapped!"); > + > +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type); > +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr); > +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem); > +printk("leaving amdgpu_vram_scratch_init."); > + > return r; > } > static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev) > { > int r; > +printk("entering amdgpu_vram_scratch_fini."); > if (adev->vram_scratch.robj == NULL) { > return; > } > + > +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped) > +printk("amdgpu_vram_scratch premapped!"); > + > +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type); > +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr); > +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem); > + > r = amdgpu_bo_reserve(adev->vram_scratch.robj, false); > if (likely(r == 0)) { > amdgpu_bo_kunmap(adev->vram_scratch.robj); > @@ -357,6 +378,7 @@ static void amdgpu_vram_scratch_fini(struct > amdgpu_device *adev) > amdgpu_bo_unreserve(adev->vram_scratch.robj); > } > amdgpu_bo_unref(&adev->vram_scratch.robj); > +printk("leaving amdgpu_vram_scratch_fini."); > } > /** > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index 989b98b..9b05d54 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -1178,6 +1178,7 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev, > bo->mem.page_alignment = page_alignment; > bo->mem.bus.io_reserved_vm = false; > bo->mem.bus.io_reserved_count = 0; > +bo->mem.bus.printk = false; > bo->moving = NULL; > bo->mem.placement = (TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED); > bo->persistent_swap_storage = persistent_swap_storage; > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c > b/drivers/gpu/drm/ttm/ttm_bo_util.c > index bf6e216..9d06952 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > @@ -526,14 +526,24 @@ static int ttm_bo_ioremap(struct > ttm_buffer_object *bo, > if (bo->mem.bus.addr) { > map->bo_kmap_type = ttm_bo_map_premapped; > map->virtual = (void *)(((u8 *)bo->mem.bus.addr) + offset); > +if (bo->mem.bus.printk) > +printk ("scratch premapping"); > + > } else { > map->bo_kmap_type = ttm_bo_map_iomap; > -if (mem->placement & TTM_PL_FLAG_WC) > +if (mem->placement & TTM_PL_FLAG_WC) { > map->virtual = ioremap_wc(bo->mem.bus.base + bo->mem.bus.offset + offset, > size); > -else > +if (bo->mem.bus.printk) > +printk ("scratch ioremap_wc"); > + > +} > +else { > map->virtual = ioremap_nocache(bo->mem.bus.base + bo->mem.bus.offset + > offset, > size); > +if (bo->mem.bus.printk) > +printk ("scratch ioremap_nocache"); > +} > } > return (!map->virtual) ? -ENOMEM : 0; > } > @@ -618,21 +628,34 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map) > struct ttm_mem_type_manager *man = > &bo->bdev->man[bo->mem.mem_type]; > -if (!map->virtual) > +if (!map->virtual) { > +if (bo->mem.bus.printk) > +printk ("scratch unmap return earlier"); > return; > +} > switch (map->bo_kmap_type) { > case ttm_bo_map_iomap: > +if (bo->mem.bus.printk) > +printk ("scratch iounmap"); > iounmap(map->virtual); > break; > case ttm_bo_map_vmap: > +if (bo->mem.bus.printk) > +printk ("scratch vunmap"); > vunmap(map->virtual); > break; > case ttm_bo_map_kmap: > +if (bo->mem.bus.printk) > +printk ("scratch kunmap"); > kunmap(map->page); > break; > case ttm_bo_map_premapped: > +if (bo->mem.bus.printk) > +printk ("scratch unmap ttm_bo_map_premapped"); > break; > default: > +if (bo->mem.bus.printk) > +printk ("scratch unmap bug"); > BUG(); > } > (void) ttm_mem_io_lock(man, false); > diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h > index 2d0f63e..f74a554 100644 > --- a/include/drm/ttm/ttm_bo_api.h > +++ b/include/drm/ttm/ttm_bo_api.h > @@ -70,6 +70,7 @@ struct ttm_bus_placement { > boolis_iomem; > boolio_reserved_vm; > uint64_t io_reserved_count; > +bool printk; > }; > -- > 2.7.4 > > > > > Thanks, > > Alex Bin > > ------------------------------------------------------------------------ > *From:* Christian König <deathsimple at vodafone.de> > *Sent:* Thursday, April 20, 2017 4:43 AM > *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org > *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO > Hi AlexBin, > >> Missing kunmap mapping in vmalloc will make kernel master page table >> incorrect. > That's what I tried to explain yesterday, but unfortunately didn't had > time to do so. There is not corruption of the kernel master page table > in this case! > > The call of ttm_bo_kunmap is completely optional, take a look at > amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free(). > > The aperture is kept mapped into the page tables for the whole time > the driver is loaded. So this is a complete no-op and only done for > consistency. > >> It is good that you agree that there is no real world bad example >> caused by my patch. I will not discuss whether it is an improvement >> or not now to save time for both of us. >> > Great at least we can now agree to completely drop this patch. > > Thanks, > Christian. > > Am 19.04.2017 um 21:30 schrieb Xie, AlexBin: >> >> Hi Christian, >> >> >> Missing kunmap mapping in vmalloc will make kernel master page table >> incorrect. I would not call such issue as completely harmless. Please >> note that AMD graphic driver can run in 32 bit system. In 32 bit >> system, vmalloc address space is much smaller than size of most GPU VRAM. >> >> >> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. >> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap >> when amdgpu_bo_reserve returns error too. As result, kernel master >> page table can become incorrect, or as you call it "completely >> harmless vmalloc space leaking". >> >> >> Because amdgpu_bo_free_kernel is used in more places, such as psp >> command submission, there will be bigger chance to have other usage >> where signal is not blocked. This will become a real bug. >> >> >> I am thinking that we may fix the issue completely when TTM releases >> BO. But that is a bigger change. >> >> >> It is good that you agree that there is no real world bad example >> caused by my patch. I will not discuss whether it is an improvement >> or not now to save time for both of us. >> >> >> Thanks, >> >> Alex Bin Xie >> >> >> ------------------------------------------------------------------------ >> *From:* Christian König <deathsimple at vodafone.de> >> *Sent:* Wednesday, April 19, 2017 7:50 AM >> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org >> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>> Without correctly kunmap, page table is corrupted. Page entries >>> point to wrong memory locations. You might call it completely >>> harmless. But I think this is a severe bug. Leaking memory is better >>> than a corrupted page table. Think security. >> We are talking about the page tables for the vmalloc area in the >> kernel here, so no security problem. Leaking memory is much more >> problematic. >> >>> Would you provide any document and reference by saying" It is >>> impossible to receive a signal during module load/unload"? For >>> example, if the unload stuck in a lock, can CTRL+C stop the unload? >>> >> No, CTRL+C doesn't abort module load/unload. There have been patches >> to changes this a while ago, but IIRC it broke a whole bunch of >> driver relying on this. >> >>> What about there is some other return error? What about in future >>> somebody improve amdgpu_bo_reserve to return other errors, >>> then function amdgpu_vram_scratch_fini becomes buggy? >>> >> Yes, that is indeed an issue. For example -EDEADLK is possible as >> well. That's why I said we should use amdgpu_bo_free_kernel() instead. >> >>> While I am thinking whether there is a better way for the current >>> situation, would you give a real world example that my patch really >>> not working? Then we can address it. >>> >> I don't think there is because the driver can't receive a signal >> during load/unload, but the problem is rather that the patch doesn't >> improve the situation at all. >> >> Regards, >> Christian. >> >> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin: >>> >>> Hi Christian, >>> >>> >>> Without correctly kunmap, page table is corrupted. Page entries >>> point to wrong memory locations. You might call it completely >>> harmless. But I think this is a severe bug. Leaking memory is better >>> than a corrupted page table. Think security. >>> >>> >>> Would you provide any document and reference by saying" It is >>> impossible to receive a signal during module load/unload"? For >>> example, if the unload stuck in a lock, can CTRL+C stop the unload? >>> >>> >>> If "It is impossible to receive a signal during module load/unload", >>> interruptible waiting is fine too, because function >>> amdgpu_bo_reserve will return successfully. >>> >>> >>> What about there is some other return error? What about in future >>> somebody improve amdgpu_bo_reserve to return other errors, >>> then function amdgpu_vram_scratch_fini becomes buggy? >>> >>> >>> While I am thinking whether there is a better way for the current >>> situation, would you give a real world example that my patch really >>> not working? Then we can address it. >>> >>> >>> Thanks, >>> >>> Alex Bin >>> >>> >>> ------------------------------------------------------------------------ >>> *From:* Christian König <deathsimple at vodafone.de> >>> *Sent:* Wednesday, April 19, 2017 2:35 AM >>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org >>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>> Hi AlexBin, >>> >>> the answer is ttm_bo_kunmap isn't called at all and yes in the case >>> of an iomap we leak the address space reserved. >>> >>> But that is completely harmless on a 64bit system compared to >>> leaking the memory backing the address space. >>> >>> Using amdgpu_bo_free_kernel() instead of openly coding it here is >>> probably a good idea. >>> >>> Additional to that it's probably a good idea to set the no_intr flag >>> when reserving kernel BOs. It is impossible to receive a signal >>> during module load/unload, but it's probably better to document that >>> in the code as well. >>> >>> Regards, >>> Christian. >>> >>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin: >>>> Hi Christian, >>>> >>>> Have you found how/where/when? When you said "mapping will just be >>>> released a bit later on", you must know the answer. >>>> >>>> It is difficult to prove something does not exist. Anyway, I will >>>> give it a try to prove such "later on" does not exist. >>>> >>>> Function ttm_bo_kunmap is the only function to unmap. To prove >>>> this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to >>>> correctly kunmap. >>>> >>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint >>>> that all TTM delay delete mechanism or unref mechanism will NOT >>>> kunmap BO later on. >>>> >>>> Function ttm_bo_kunmap is called by AMDGPU function >>>> amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap. >>>> >>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for >>>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer >>>> is still NO. >>>> >>>> So all possibilities are searched. Did I miss anything? >>>> >>>> Thanks, >>>> Alex Bin Xie >>>> >>>> ------------------------------------------------------------------------ >>>> *From:* Xie, AlexBin >>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM >>>> *To:* Christian König; Zhou, David(ChunMing); >>>> amd-gfx at lists.freedesktop.org >>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>> >>>> Hi Christian, >>>> >>>> >>>> Would you point out where/when will kunmap happen for this BO when >>>> release? It must be somewhere in some function calls. >>>> >>>> >>>> I checked before I asked for review. But I did not see such obvious >>>> kunmap function call. >>>> >>>> >>>> If so, there should be a comment in function >>>> amdgpu_vram_scratch_fini to avoid future confusion. >>>> >>>> >>>> Thanks, >>>> Alex Bin Xie >>>> ------------------------------------------------------------------------ >>>> *From:* Christian König <deathsimple at vodafone.de> >>>> *Sent:* Tuesday, April 18, 2017 1:46 PM >>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); >>>> amd-gfx at lists.freedesktop.org >>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>> Hi AlexBin, >>>> >>>> No, David is right. This is a very common coding pattern in the >>>> kernel module. >>>> >>>> Freeing up a BO while there still exists a kernel mapping is >>>> perfectly ok, the mapping will just be released a bit later on. >>>> >>>> So this code is actually perfectly ok and just an optimization, but >>>> your patch breaks it and creates a memory leak. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin: >>>>> >>>>> Hi David, >>>>> >>>>> >>>>> When amdgpu_bo_reserve return errors, we cannot release the BO. >>>>> This is not a memory leak. General speaking, memory leak >>>>> is unnoticed and unintentional. >>>>> >>>>> >>>>> The caller of function amdgpu_vram_scratch_fini ignores the return >>>>> error value... >>>>> >>>>> >>>>> The "memory leak" is not caused by my patch. It is caused because >>>>> reserving BO fails. >>>>> >>>>> >>>>> This patch only aim to make function amdgpu_vram_scratch_fini >>>>> behave correctly. >>>>> >>>>> >>>>> To follow up, we can add a warning message when amdgpu_bo_reserve >>>>> error happens in a different patch. >>>>> >>>>> >>>>> If function call amdgpu_bo_reserve is changed to uninterruptible, >>>>> this changes driver behaviour. Without a substantial issue, I >>>>> would be cautious for such a change. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Alex Bin Xie >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> *From:* Zhou, David(ChunMing) >>>>> *Sent:* Monday, April 17, 2017 10:38 PM >>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org >>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>>> >>>>> >>>>> On 2017å¹´04æ??17æ?¥ 22:54, Xie, AlexBin wrote: >>>>>> >>>>>> Hi David, >>>>>> >>>>>> >>>>>> Thanks for the comments. However, please have look at >>>>>> amdgpu_bo_reserve definition. >>>>>> >>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool >>>>>> no_intr) >>>>>> >>>>> Ah, this is a wired wrapper for ttm_bo_reserve. >>>>> >>>>>> >>>>>> When we call this function like the following: >>>>>> >>>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false); >>>>>> The false means interruptible. >>>>>> >>>>>> >>>>>> On the other hand, when amdgpu_bo_reserve function return error, >>>>>> why do we unref BO without kunmap and unpin the BO? This is wrong >>>>>> implementation when amdgpu_bo_reserve return any error. >>>>>> >>>>> Yeah, I see your mean, it's in driver un-loading, How about >>>>> changing to no interruptible? Your patch will make a memleak if >>>>> bo_reserve fails, but it seems not matter. I have no strong >>>>> preference. >>>>> >>>>> Regards, >>>>> David Zhou >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Alex Bin Xie >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> *From:* Zhou, David(ChunMing) >>>>>> *Sent:* Friday, April 14, 2017 12:00 AM >>>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org >>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>>>> >>>>>> >>>>>> On 2017å¹´04æ??14æ?¥ 05:34, Alex Xie wrote: >>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve >>>>>> > can return with -ERESTARTSYS. When this function was interrupted >>>>>> > by a signal, BO should not be unref. Otherwise the BO might be >>>>>> > released while is kmapped and pinned, or BO MIGHT be deref >>>>>> > multiple times, etc. >>>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false); >>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't >>>>>> possible >>>>>> here. >>>>>> >>>>>> Thanks, >>>>>> David Zhou >>>>>> > >>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621 >>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie at amd.com> >>>>>> > --- >>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>>> > 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> > >>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> > index 53996e3..1dcc2d1 100644 >>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct >>>>>> amdgpu_device *adev) >>>>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj); >>>>>> > amdgpu_bo_unpin(adev->vram_scratch.robj); >>>>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj); >>>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj); >>>>>> > } >>>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj); >>>>>> > } >>>>>> > >>>>>> > /** >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170421/c972ac83/attachment-0001.html>