Re: [PATCH 11/24] xfs: create incore realtime group structures

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Mon, 26 Aug 2024 21:44:42 -0700

On Mon, Aug 26, 2024 at 06:55:58PM -0700, Darrick J. Wong wrote:
> The thing I *don't* know is how will this affect hch's zoned device
> support -- he's mentioned that rtgroups will eventually have both a size
> and a "capacity" to keep the zones aligned to groups, or groups aligned
> to zones, I don't remember which.  I don't know if segmenting
> br_startblock for rt mappings makes things better or worse for that.

This should be fine.  The ZNS zone capacity features where zones have
a size (LBA space allocated to it) and a capacity (LBAs that can
actually be written to) is the hardware equivalent of this.

> So ... would it theoretically make more sense to use an rhashtable here?
> Insofar as the only place that totally falls down is if you want to
> iterate tagged groups; and that's only done for AGs.

It also is an important part of garbage collection for zoned XFS, where
we'll use it on RTGs.

> > 
> > #define for_each_group(grp, gno, grpi)					\
> > 	(gno) = 0;							\
> > 	for ((grpi) = to_grpi((grpi), xfs_group_grab((grp), (gno)));	\
> > 	     (grpi) != NULL;						\
> > 	     (grpi) = to_grpi(grpi, xfs_group_next((grp), to_gi(grpi),	\
> > 					&(gno), (grp)->num_groups))
> > 
> > And now we essentially have common group infrstructure for
> > access, iteration, geometry and address verification purposes...
> 
> <nod> That's pretty much what I had drafted, albeit with different
> helper macros since I kept the for_each_{perag,rtgroup} things around
> for type safety.  Though I think for_each_perag just becomes:
> 
> #define for_each_perag(mp, agno, pag) \
> 	for_each_group((mp)->m_perags, (agno), (pag))
> 
> Right?

Btw, if we touch all of this anyway I'd drop the agno argument.
We can get the group number from the group struct (see my perag xarray
conversion series for an example where I'm doing this for the tagged
iteration).

> 
> The max rtgroup length is defined in blocks; the min is defined in rt
> extents.  I might want to bump up the minimum a bit, but I think
> Christoph should weigh in on that first -- I think his zns patchset
> currently assigns one rtgroup to each zone?  Because he was muttering
> about how 130,000x 256MB rtgroups really sucks.  Would it be very messy
> to have a minimum size of (say) 1GB?

Very messy.  I can live with a minimum of 256 MB, but no byte less :)
This is the size used by all shipping SMR hard drivers.  For ZNS SSDs
there are samples with very small zones size that are basically open
channel devices in disguise - no sane person would want them and they
don't make sense to support in XFS as they require extensive erasure
encoding and error correction.  The ZNS drives with full data integrity
support have zone sizes and capacities way over 1GB and growing.

> > and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> > 
> > So do we need to support millions of rtgs? I'd say no....
> 
> ...but we might.  Christoph, how gnarly does zns support get if you have
> to be able to pack multiple SMR zones into a single rtgroup?

I thought about it, but it creates real accounting nightmares.  It's
not entirely doable, but really messy.