On Fri, Sep 13, 2024 at 01:08:34PM +0100, John Garry wrote: > On 13/09/2024 12:26, Daniel Gomez wrote: > > On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote: > > > On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote: > > > > From: Daniel Gomez <da.gomez@xxxxxxxxxxx> > > > > > > > > Report block alignment in terms of LBA and size during block tracing for > > > > block_rq. Calculate alignment only for read/writes where the length is > > > > greater than 0. Otherwise, report 0 to indicate no alignment calculated. > > > > > > > > Suggested-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Signed-off-by: Daniel Gomez <da.gomez@xxxxxxxxxxx> > > > > --- > > > > This patch introduces LBA and size alignment information for > > > > the block_rq tracepoints (block_rq{insert, issue, merge} and > > > > block_{io_start, io_done}). > > > > > > eh, shouldn't this belong in the description of the patch? > > > > Yes. I'll move this to the commit message. > > > > > > > > And I still don't know what we mean by alignment in this context. > > > > > > From looking at the code, it seems to be the max detected block size > > > granularity. For example, for a 64KB write at a 32KB offset, that would give > > > a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB" > > > alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a > > > 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2 > > note: I meant "8KB is the largest power-of-2" 8KB will be the largest unit at what a device can operate at, for that particular I/O. > > > > which is divisible into 24KB. Is this a correct understanding? > > > > That is correct. > > So maybe it's me, but I just find it odd to call this information > "alignment". To me, what you are looking for is largest block size > granularity. More suggestions are welcome. What about just I/O granularity? Does the term imply LBA and size? > > > Do you think adding examples like yours can help to explain > > this better? > > Below the same examples using fio with the trace output: > > > > > > sudo fio -bs=64k -size=64k -offset=32k -rw=write \ > > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync > > > > sudo fio -bs=64k -size=64k -offset=64k -rw=write \ > > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync > > > > sudo fio -bs=8k -size=8k -offset=64k -rw=write \ > > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync > > > > sudo fio -bs=24k -size=24k -offset=0k -rw=write \ > > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync > > > > fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio] > > fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio] > > fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio] > > fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio] > > > > > > Also, the motivation behind this is explained in the LBS RFC [1] and I should > > have included it here for context. I hope [1] and my description below helps to > > explain what alignment means and why is needed: > > > > [1] Subject: [RFC 00/23] Enable block size > page size in XFS > > https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$ > > > > Tracing alignment information is important for high-capacity and QLC SSDs with > > Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical > > Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU > > boundaries can imply in a read-modify-write (RMW). > > Yes, I get that this might be important to know. > > > > > The graph below is a representation of the device IU vs an I/O block aligned/ > > unaligned. > > > > |--- IU Boundaries ----| |-PS-| > > a) [====][====][====][====][····][····][····]-- > > | | > > b) [····][====][====][====][====][····][····]-- > > | | > > c) [====][====][====][====][····][====][====]-- > > is there meant to be a gap at page index #4? Sorry, that's a copy+paste error. c) can be ignored. > > > | | > > d) [····][····][====][====][····][····][····]-- d) is c) > > | | > > LBA 0 4 > > Key: > > [====] = I/O Block > > [····] = Memory in Page Size (PS) chunks > > > > a) I/O matches IU boundaries (LBA and block size). I/O is aligned. > > b) The size of the I/O matches the IU size but the I/O is not aligned to the > > IU boundaries. I/O is unaligned. > > c) I/O does not match in either size or LBA. I/O is unaligned. > > what about d)? Not aligned to IU, I assume. Yes, c) description is meant for d). So for clarity, the correct graph is: |--- IU Boundaries ----| |-PS-| a) [====][====][====][====][····][····][····]-- | | b) [····][====][====][====][====][····][····]-- | | c) [····][····][====][====][····][····][····]-- | | LBA 0 4 a) I/O matches IU boundaries (LBA and block size). I/O is aligned to IU boundaries. b) The size of the I/O matches the IU size but the I/O is not aligned to the IU boundaries. I/O is unaligned. c) I/O does not match in either size or LBA. I/O is unaligned. Using I/O granularity term: a) 16k I/O granularity b) 4k I/O granularity c) 8k I/O granularity > > > > > > > > > > > > > > The idea of reporting alignment in a tracepoint was first suggested in > > > > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent > > > > tracing tool [2] was developed and used during LBS development, as > > > > mentioned in the patch series [3] and in [1]. > > > > > > > > With this addition, users can check block alignment directly through the > > > > block layer tracepoints without needing any additional tools. > > > > > > > > In case we have a use case, this can be extended to other tracepoints, > > > > such as complete and error. > > > > > > > > Another potential enhancement could be the integration of this > > > > information into blktrace. Would that be a feasible option to consider? > > > > > > > > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@xxxxxxxxxxxxxxxxxxx/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$ > > > > [2] blkalgn tool written in eBPF/bcc: > > > > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$ > > > > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@xxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$ > > > > --- > > > > block/blk-mq.c | 29 +++++++++++++++++++++++++++++ > > > > include/linux/blk-mq.h | 11 +++++++++++ > > > > include/linux/blkdev.h | 6 ++++++ > > > > include/trace/events/block.h | 7 +++++-- > > > > 4 files changed, 51 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > index 831c5cf5d874..714452bc236b 100644 > > > > --- a/block/blk-mq.c > > > > +++ b/block/blk-mq.c > > > > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob, > > > > } > > > > EXPORT_SYMBOL_GPL(blk_rq_poll); > > > > +u32 __blk_rq_lba_algn(struct request *req) > > > > > > why use "algn", and not "align"? > > > > > > "algn" is not a natural abbreviation of "alignment". > > > > That's okay with me, changing the var name to a more natural abbreviation. > > > > > > > > And why can't userspace figure this out? All the info is available already, > > > right? > > > > We are interested in the block alignment (LBA and size) at block device driver > > level, not userspace level. That is, everything that is going out from the block > > layer. Using the block tracing points currently available makes it block-driver > > generic. > > I am just saying that the information already present in the block trace > point can be used to get this "alignment" info, right? And userspace can do > the work of reading those trace events to find this "alignment" info. So, maybe this is better to have integrated in blktrace tool? > > Thanks, > John >