Re: [XFS SUMMIT] Version 3 log format

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Sun, 17 May 2020 21:00:10 -0700

On Mon, May 18, 2020 at 12:58:28PM +1000, Dave Chinner wrote:
> 
> Topic:	Version 3 log format
> 
> Scope:	Performance
> 	Removing sector size limits
> 	Large stripe unit log write alignment
> 
> Proposal:
> 
> The current v2 log format is an extension of the v1 format which was
> limited to 32kB in size. The size limitation was due to the way that
> the log format requires every basic block to be stamped with the LSN
> associated with the iclog that is being written.
> 
> This requirement stems from the fact that log recovery needed this
> LSN stamp to determine where the head and tail of the log lies, and
> whether the iclog was written completely. The implementation
> requires storing the data written to the first 32 bits of each
> sector of iclog data into a special array in the log header, and
> replacing the data with the cycle number of the current iclog write.
> When the log is replayed, before the iclog is read the data is
> extracted from the iclog headers anre written back over the cycle
> numbers so the transaction information is returned to it's original
> state before decoding occurs.
> 
> For V2 logs, a set of extension headers were created, allowing
> another 7 basic blocks full of encoded data, which allows us to remap an
> extra 7 32kB segments of iclog data into the iclog header. This is
> where the 256kB iclog size limit comes from - it's 8 * 32kB
> segments.
> 
> As the iclogs get larger, this whole encoding scheme because more
> CPU expensive, and it largely limits what we can do with expanding
> iclogs. It also doesn't take into account how things have changed
> since v2 logs were first designed.
> 
> That is, we didn't have delayed logging. That meant iclogbuf IO was
> the limiting factor to commit rates, not CPU overhead. We now do
> commits that total up to 32MB of data, and we do that by cycling
> through it iclogbuf at a time. As a result, CIL pushes are largely
> IO bound waiting for iclogbufs to complete IO. Larger iclogbufs here
> would make a substantial difference to performance when the CIL
> is full, resulting in less blocking and fewer cache flushes when
> writing iclogbufs.
> 
> The question is this: do we still need this cycle stamping in every
> single sector? If we don't need it, then a new format is much
> simpler than if we need basic block stamping.
> 
> From the perspective of determining if a iclog write was complete,
> we don't trust the cycle number entirely in log recovery anymore.
> Once we have the log head and the log tail, we do a CRC validation
> walk of the log to validate it. Hence we don't really need cycle
> data in the log data to validate writes were complete - the CRC will
> fail if a iclogbuf write is torn.
> 
> So that comes back to finding the head and tail of the log. This is
> done by doing a binary search of the log based reading basic blocks
> and checking the cycle number in the basic block that was read. We
> really don't need to do this search via single sector IO; what we
> really want to find is the iclog header at the head and the tail of
> the log.
> 
> To do this, we could do a binary search based on the maximum
> supported iclogbuf size and scan the buffers that are read for
> iclog header magic numbers. There may be more than one in a buffer,
> (e.g. head and tail in the same region) but that is an in-memory
> search rather than individual single sector IO. Once we've found an
> iclog header, We can read the LSN out of the header, and that tells
> us the cycle number of that commit. Hence we can do the binary
> search to find the head and tail of the log without needing have the
> cycle number stamped into every sector.
> 
> IOWs, I don't see a reason we need to maintain the per-basic-block
> cycle stamp in the log format. Hence by removing it from the format
> we get rid of the need for the encoding tables, and we remove the
> limitation on log write size that we currently have.  Essentially we
> move entirely to a "validation by CRC" model for detecting
> torn/incomplete log writes, and that greatly reduces the complexity
> of log writing code.
> 
> It also allows us to use arbitrarily large log writes instead of
> fixed sizes, opening up further avenues for optimisation of both
> journal IO patterns and how we format items into the bios for
> dispatch. We already have log vector buffers that we hand off to the
> CIL checkpoint for async processing; it is not a huge stretch to
> consider mapping them directly into bios and using bio chaining to
> submit them rather than copying them into iclogbufs for submission
> (i.e. single copy logging rather than the double copy we do now).
> And for DAX hardware, we can directly map the journal....
> 
> But before we get to that, we really need a new log format that
> allows us to get away from the limitations of the existing "fixed
> size with encoding" log format.
> 
> Discussion:
> 	- does it work?
> 	- implications of a major incompat log format change
> 	- implications of larger "inflight" window in the journal
> 	  to match the "inflight" window the CIL has.

Giant flood of log items overwhelming the floppy disk(s) underlying the
fs? :P

> 	- other problems?
> 	- other potential optimisations a format change allows?

Will have to ponder this in the morning.

> 	- what else might we add to a log format change to solve
> 	  other recovery issues?

Make sure log recovery can be done on any platform?

--D

> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx