On 2022-02-07 at 20:52:33 +0530, Hellstrom, Thomas wrote: > On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote: > > On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote: > > > Hi, Ram, > > > > > > A couple of quick questions before starting a more detailed review: > > > > > > 1) Does this also support migrating of compressed data LMEM->LMEM? > > > What-about inter-tile? > > Honestly this series mainly facused on eviction of lmem into smem and > > restoration of same. > > > > To cover migration, we need to handle this differently from eviction. > > Becasue when we migrate the compressed content we need to be able to > > use > > that from that new placement. can't keep the ccs data separately. > > > > Migration of lmem->smem needs decompression incorportated. > > Migration of lmem_m->lmem_n needs to maintain the > > compressed/decompressed state as it is. > > > > So we need to pass the information upto emit_copy to differentiate > > eviction and migration > > > > If you dont have objection I would like to take the migration once we > > have the eviction of lmem in place. > > Sure NP. I was thinking that in the final solution we might also need > to think about the possibility that we might evict to another lmem > region, although I figure that won't be enabled until we support multi- > tile. Yes we need it for multi tile enablement of XeHPSDV. > > > > > > > > > 2) Do we need to block faulting of compressed data in the fault > > > handler > > > as a follow-up patch? > > > > In case of evicted compressed data we dont need to treat it > > differently > > from the evicted normal data. So I dont think this needs a special > > treatment. Sorry if i dont understand your question. > > My question wasn't directly related to eviction actually, but does > user-space need to have mmap access to compressed data? If not, block > it? We shouldn't mmap the ccs data. As per my understanding we should be mmaping the obj size which doesn't count the ttm_tt inflated size. I will verify this part and if needed will prepare a change to exclude increased pages from mmap range. Ram. > > Thanks, > Thomas > > > > > > > Ram > > > > > > /Thomas > > > > > > > > > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote: > > > > When we are swapping out the local memory obj on flat-ccs capable > > > > platform, > > > > we need to capture the ccs data too along with main meory and we > > > > need > > > > to > > > > restore it when we are swapping in the content. > > > > > > > > Extracting and restoring the CCS data is done through a special > > > > cmd > > > > called > > > > XY_CTRL_SURF_COPY_BLT > > > > > > > > Signed-off-by: Ramalingam C <ramalingam.c@xxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++----- > > > > ---- > > > > -- > > > > 1 file changed, 155 insertions(+), 128 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c > > > > b/drivers/gpu/drm/i915/gt/intel_migrate.c > > > > index 5bdab0b3c735..e60ae6ff1847 100644 > > > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c > > > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c > > > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, > > > > u32 > > > > size) > > > > return height % 4 == 3 && height <= 8; > > > > } > > > > > > > > +/** > > > > + * DOC: Flat-CCS - Memory compression for Local memory > > > > + * > > > > + * On Xe-HP and later devices, we use dedicated compression > > > > control > > > > state (CCS) > > > > + * stored in local memory for each surface, to support the 3D > > > > and > > > > media > > > > + * compression formats. > > > > + * > > > > + * The memory required for the CCS of the entire local memory is > > > > 1/256 of the > > > > + * local memory size. So before the kernel boot, the required > > > > memory > > > > is reserved > > > > + * for the CCS data and a secure register will be programmed > > > > with > > > > the CCS base > > > > + * address. > > > > + * > > > > + * Flat CCS data needs to be cleared when a lmem object is > > > > allocated. > > > > + * And CCS data can be copied in and out of CCS region through > > > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data > > > > directly. > > > > + * > > > > + * When we exaust the lmem, if the object's placements support > > > > smem, > > > > then we can > > > > + * directly decompress the compressed lmem object into smem and > > > > start using it > > > > + * from smem itself. > > > > + * > > > > + * But when we need to swapout the compressed lmem object into a > > > > smem region > > > > + * though objects' placement doesn't support smem, then we copy > > > > the > > > > lmem content > > > > + * as it is into smem region along with ccs data (using > > > > XY_CTRL_SURF_COPY_BLT). > > > > + * When the object is referred, lmem content will be swaped in > > > > along > > > > with > > > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at > > > > corresponding > > > > + * location. > > > > + * > > > > + * > > > > + * Flat-CCS Modifiers for different compression formats > > > > + * ---------------------------------------------------- > > > > + * > > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the > > > > buffers > > > > of Flat CCS > > > > + * render compression formats. Though the general layout is same > > > > as > > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression > > > > algorithm is > > > > + * used. Render compression uses 128 byte compression blocks > > > > + * > > > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the > > > > buffers > > > > of Flat CCS > > > > + * media compression formats. Though the general layout is same > > > > as > > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression > > > > algorithm is > > > > + * used. Media compression uses 256 byte compression blocks. > > > > + * > > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the > > > > buffers of Flat > > > > + * CCS clear color render compression formats. Unified > > > > compression > > > > format for > > > > + * clear color render compression. The genral layout is a tiled > > > > layout using > > > > + * 4Kb tiles i.e Tile4 layout. > > > > + */ > > > > + > > > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags) > > > > +{ > > > > + /* Mask the 3 LSB to use the PPGTT address space */ > > > > + *cmd++ = MI_FLUSH_DW | flags; > > > > + *cmd++ = lower_32_bits(dst); > > > > + *cmd++ = upper_32_bits(dst); > > > > + > > > > + return cmd; > > > > +} > > > > + > > > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private > > > > *i915, > > > > int size) > > > > +{ > > > > + u32 num_cmds, num_blks, total_size; > > > > + > > > > + if (!GET_CCS_SIZE(i915, size)) > > > > + return 0; > > > > + > > > > + /* > > > > + * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte > > > > + * blocks. one XY_CTRL_SURF_COPY_BLT command can > > > > + * trnasfer upto 1024 blocks. > > > > + */ > > > > + num_blks = GET_CCS_SIZE(i915, size); > > > > + num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> > > > > 10; > > > > + total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds; > > > > + > > > > + /* > > > > + * We need to add a flush before and after > > > > + * XY_CTRL_SURF_COPY_BLT > > > > + */ > > > > + total_size += 2 * MI_FLUSH_DW_SIZE; > > > > + return total_size; > > > > +} > > > > + > > > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 > > > > dst_addr, > > > > + u8 src_mem_access, u8 > > > > dst_mem_access, > > > > + int src_mocs, int dst_mocs, > > > > + u16 num_ccs_blocks) > > > > +{ > > > > + int i = num_ccs_blocks; > > > > + > > > > + /* > > > > + * The XY_CTRL_SURF_COPY_BLT instruction is used to copy > > > > the > > > > CCS > > > > + * data in and out of the CCS region. > > > > + * > > > > + * We can copy at most 1024 blocks of 256 bytes using one > > > > + * XY_CTRL_SURF_COPY_BLT instruction. > > > > + * > > > > + * In case we need to copy more than 1024 blocks, we need > > > > to > > > > add > > > > + * another instruction to the same batch buffer. > > > > + * > > > > + * 1024 blocks of 256 bytes of CCS represent a total > > > > 256KB of > > > > CCS. > > > > + * > > > > + * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM. > > > > + */ > > > > + do { > > > > + /* > > > > + * We use logical AND with 1023 since the size > > > > field > > > > + * takes values which is in the range of 0 - 1023 > > > > + */ > > > > + *cmd++ = ((XY_CTRL_SURF_COPY_BLT) | > > > > + (src_mem_access << > > > > SRC_ACCESS_TYPE_SHIFT) | > > > > + (dst_mem_access << > > > > DST_ACCESS_TYPE_SHIFT) | > > > > + (((i - 1) & 1023) << CCS_SIZE_SHIFT)); > > > > + *cmd++ = lower_32_bits(src_addr); > > > > + *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) | > > > > + (src_mocs << XY_CTRL_SURF_MOCS_SHIFT)); > > > > + *cmd++ = lower_32_bits(dst_addr); > > > > + *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) | > > > > + (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT)); > > > > + src_addr += SZ_64M; > > > > + dst_addr += SZ_64M; > > > > + i -= NUM_CCS_BLKS_PER_XFER; > > > > + } while (i > 0); > > > > + > > > > + return cmd; > > > > +} > > > > + > > > > static int emit_copy(struct i915_request *rq, > > > > - u32 dst_offset, u32 src_offset, int size) > > > > + bool dst_is_lmem, u32 dst_offset, > > > > + bool src_is_lmem, u32 src_offset, int size) > > > > { > > > > + struct drm_i915_private *i915 = rq->engine->i915; > > > > const int ver = GRAPHICS_VER(rq->engine->i915); > > > > u32 instance = rq->engine->instance; > > > > + u32 num_ccs_blks, ccs_ring_size; > > > > + u8 src_access, dst_access; > > > > u32 *cs; > > > > > > > > - cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6); > > > > + ccs_ring_size = ((src_is_lmem || dst_is_lmem) && > > > > HAS_FLAT_CCS(i915)) ? > > > > + calc_ctrl_surf_instr_size(i915, size) : > > > > 0; > > > > + > > > > + cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : > > > > 6); > > > > if (IS_ERR(cs)) > > > > return PTR_ERR(cs); > > > > > > > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request > > > > *rq, > > > > *cs++ = src_offset; > > > > } > > > > > > > > + if (ccs_ring_size) { > > > > + /* TODO: Migration needs to be handled with > > > > resolve > > > > of compressed data */ > > > > + num_ccs_blks = (GET_CCS_SIZE(i915, size) + > > > > + NUM_CCS_BYTES_PER_BLOCK - 1) >> > > > > 8; > > > > + > > > > + src_access = !src_is_lmem && dst_is_lmem; > > > > + dst_access = !src_access; > > > > + > > > > + if (src_access) /* Swapin of compressed data */ > > > > + src_offset += size; > > > > + else > > > > + dst_offset += size; > > > > + > > > > + cs = _i915_ctrl_surf_copy_blt(cs, src_offset, > > > > dst_offset, > > > > + src_access, > > > > dst_access, > > > > + 1, 1, > > > > num_ccs_blks); > > > > + cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC | > > > > MI_FLUSH_CCS); > > > > + } > > > > + > > > > intel_ring_advance(rq, cs); > > > > return 0; > > > > } > > > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct > > > > intel_context > > > > *ce, > > > > if (err) > > > > goto out_rq; > > > > > > > > - err = emit_copy(rq, dst_offset, src_offset, len); > > > > + err = emit_copy(rq, dst_is_lmem, dst_offset, > > > > + src_is_lmem, src_offset, len); > > > > > > > > /* Arbitration is re-enabled between requests. */ > > > > out_rq: > > > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct > > > > intel_context > > > > *ce, > > > > return err; > > > > } > > > > > > > > -/** > > > > - * DOC: Flat-CCS - Memory compression for Local memory > > > > - * > > > > - * On Xe-HP and later devices, we use dedicated compression > > > > control > > > > state (CCS) > > > > - * stored in local memory for each surface, to support the 3D > > > > and > > > > media > > > > - * compression formats. > > > > - * > > > > - * The memory required for the CCS of the entire local memory is > > > > 1/256 of the > > > > - * local memory size. So before the kernel boot, the required > > > > memory > > > > is reserved > > > > - * for the CCS data and a secure register will be programmed > > > > with > > > > the CCS base > > > > - * address. > > > > - * > > > > - * Flat CCS data needs to be cleared when a lmem object is > > > > allocated. > > > > - * And CCS data can be copied in and out of CCS region through > > > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data > > > > directly. > > > > - * > > > > - * When we exaust the lmem, if the object's placements support > > > > smem, > > > > then we can > > > > - * directly decompress the compressed lmem object into smem and > > > > start using it > > > > - * from smem itself. > > > > - * > > > > - * But when we need to swapout the compressed lmem object into a > > > > smem region > > > > - * though objects' placement doesn't support smem, then we copy > > > > the > > > > lmem content > > > > - * as it is into smem region along with ccs data (using > > > > XY_CTRL_SURF_COPY_BLT). > > > > - * When the object is referred, lmem content will be swaped in > > > > along > > > > with > > > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at > > > > corresponding > > > > - * location. > > > > - * > > > > - * > > > > - * Flat-CCS Modifiers for different compression formats > > > > - * ---------------------------------------------------- > > > > - * > > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the > > > > buffers > > > > of Flat CCS > > > > - * render compression formats. Though the general layout is same > > > > as > > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression > > > > algorithm is > > > > - * used. Render compression uses 128 byte compression blocks > > > > - * > > > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the > > > > buffers > > > > of Flat CCS > > > > - * media compression formats. Though the general layout is same > > > > as > > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression > > > > algorithm is > > > > - * used. Media compression uses 256 byte compression blocks. > > > > - * > > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the > > > > buffers of Flat > > > > - * CCS clear color render compression formats. Unified > > > > compression > > > > format for > > > > - * clear color render compression. The genral layout is a tiled > > > > layout using > > > > - * 4Kb tiles i.e Tile4 layout. > > > > - */ > > > > - > > > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags) > > > > -{ > > > > - /* Mask the 3 LSB to use the PPGTT address space */ > > > > - *cmd++ = MI_FLUSH_DW | flags; > > > > - *cmd++ = lower_32_bits(dst); > > > > - *cmd++ = upper_32_bits(dst); > > > > - > > > > - return cmd; > > > > -} > > > > - > > > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private > > > > *i915, > > > > int size) > > > > -{ > > > > - u32 num_cmds, num_blks, total_size; > > > > - > > > > - if (!GET_CCS_SIZE(i915, size)) > > > > - return 0; > > > > - > > > > - /* > > > > - * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte > > > > - * blocks. one XY_CTRL_SURF_COPY_BLT command can > > > > - * trnasfer upto 1024 blocks. > > > > - */ > > > > - num_blks = GET_CCS_SIZE(i915, size); > > > > - num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> > > > > 10; > > > > - total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds; > > > > - > > > > - /* > > > > - * We need to add a flush before and after > > > > - * XY_CTRL_SURF_COPY_BLT > > > > - */ > > > > - total_size += 2 * MI_FLUSH_DW_SIZE; > > > > - return total_size; > > > > -} > > > > - > > > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 > > > > dst_addr, > > > > - u8 src_mem_access, u8 > > > > dst_mem_access, > > > > - int src_mocs, int dst_mocs, > > > > - u16 num_ccs_blocks) > > > > -{ > > > > - int i = num_ccs_blocks; > > > > - > > > > - /* > > > > - * The XY_CTRL_SURF_COPY_BLT instruction is used to copy > > > > the > > > > CCS > > > > - * data in and out of the CCS region. > > > > - * > > > > - * We can copy at most 1024 blocks of 256 bytes using one > > > > - * XY_CTRL_SURF_COPY_BLT instruction. > > > > - * > > > > - * In case we need to copy more than 1024 blocks, we need > > > > to > > > > add > > > > - * another instruction to the same batch buffer. > > > > - * > > > > - * 1024 blocks of 256 bytes of CCS represent a total > > > > 256KB of > > > > CCS. > > > > - * > > > > - * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM. > > > > - */ > > > > - do { > > > > - /* > > > > - * We use logical AND with 1023 since the size > > > > field > > > > - * takes values which is in the range of 0 - 1023 > > > > - */ > > > > - *cmd++ = ((XY_CTRL_SURF_COPY_BLT) | > > > > - (src_mem_access << > > > > SRC_ACCESS_TYPE_SHIFT) | > > > > - (dst_mem_access << > > > > DST_ACCESS_TYPE_SHIFT) | > > > > - (((i - 1) & 1023) << CCS_SIZE_SHIFT)); > > > > - *cmd++ = lower_32_bits(src_addr); > > > > - *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) | > > > > - (src_mocs << XY_CTRL_SURF_MOCS_SHIFT)); > > > > - *cmd++ = lower_32_bits(dst_addr); > > > > - *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) | > > > > - (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT)); > > > > - src_addr += SZ_64M; > > > > - dst_addr += SZ_64M; > > > > - i -= NUM_CCS_BLKS_PER_XFER; > > > > - } while (i > 0); > > > > - > > > > - return cmd; > > > > -} > > > > - > > > > static int emit_clear(struct i915_request *rq, > > > > u64 offset, > > > > int size, > > > >