[XFS SUMMIT] Version 3 log format

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 18 May 2020 12:58:28 +1000

Topic:	Version 3 log format

Scope:	Performance
	Removing sector size limits
	Large stripe unit log write alignment

Proposal:

The current v2 log format is an extension of the v1 format which was
limited to 32kB in size. The size limitation was due to the way that
the log format requires every basic block to be stamped with the LSN
associated with the iclog that is being written.

This requirement stems from the fact that log recovery needed this
LSN stamp to determine where the head and tail of the log lies, and
whether the iclog was written completely. The implementation
requires storing the data written to the first 32 bits of each
sector of iclog data into a special array in the log header, and
replacing the data with the cycle number of the current iclog write.
When the log is replayed, before the iclog is read the data is
extracted from the iclog headers anre written back over the cycle
numbers so the transaction information is returned to it's original
state before decoding occurs.

For V2 logs, a set of extension headers were created, allowing
another 7 basic blocks full of encoded data, which allows us to remap an
extra 7 32kB segments of iclog data into the iclog header. This is
where the 256kB iclog size limit comes from - it's 8 * 32kB
segments.

As the iclogs get larger, this whole encoding scheme because more
CPU expensive, and it largely limits what we can do with expanding
iclogs. It also doesn't take into account how things have changed
since v2 logs were first designed.

That is, we didn't have delayed logging. That meant iclogbuf IO was
the limiting factor to commit rates, not CPU overhead. We now do
commits that total up to 32MB of data, and we do that by cycling
through it iclogbuf at a time. As a result, CIL pushes are largely
IO bound waiting for iclogbufs to complete IO. Larger iclogbufs here
would make a substantial difference to performance when the CIL
is full, resulting in less blocking and fewer cache flushes when
writing iclogbufs.

The question is this: do we still need this cycle stamping in every
single sector? If we don't need it, then a new format is much
simpler than if we need basic block stamping.

>From the perspective of determining if a iclog write was complete,
we don't trust the cycle number entirely in log recovery anymore.
Once we have the log head and the log tail, we do a CRC validation
walk of the log to validate it. Hence we don't really need cycle
data in the log data to validate writes were complete - the CRC will
fail if a iclogbuf write is torn.

So that comes back to finding the head and tail of the log. This is
done by doing a binary search of the log based reading basic blocks
and checking the cycle number in the basic block that was read. We
really don't need to do this search via single sector IO; what we
really want to find is the iclog header at the head and the tail of
the log.

To do this, we could do a binary search based on the maximum
supported iclogbuf size and scan the buffers that are read for
iclog header magic numbers. There may be more than one in a buffer,
(e.g. head and tail in the same region) but that is an in-memory
search rather than individual single sector IO. Once we've found an
iclog header, We can read the LSN out of the header, and that tells
us the cycle number of that commit. Hence we can do the binary
search to find the head and tail of the log without needing have the
cycle number stamped into every sector.

IOWs, I don't see a reason we need to maintain the per-basic-block
cycle stamp in the log format. Hence by removing it from the format
we get rid of the need for the encoding tables, and we remove the
limitation on log write size that we currently have.  Essentially we
move entirely to a "validation by CRC" model for detecting
torn/incomplete log writes, and that greatly reduces the complexity
of log writing code.

It also allows us to use arbitrarily large log writes instead of
fixed sizes, opening up further avenues for optimisation of both
journal IO patterns and how we format items into the bios for
dispatch. We already have log vector buffers that we hand off to the
CIL checkpoint for async processing; it is not a huge stretch to
consider mapping them directly into bios and using bio chaining to
submit them rather than copying them into iclogbufs for submission
(i.e. single copy logging rather than the double copy we do now).
And for DAX hardware, we can directly map the journal....

But before we get to that, we really need a new log format that
allows us to get away from the limitations of the existing "fixed
size with encoding" log format.

Discussion:
	- does it work?
	- implications of a major incompat log format change
	- implications of larger "inflight" window in the journal
	  to match the "inflight" window the CIL has.
	- other problems?
	- other potential optimisations a format change allows?
	- what else might we add to a log format change to solve
	  other recovery issues?

-- 
Dave Chinner
david@xxxxxxxxxxxxx