Topic: Version 3 log format Scope: Performance Removing sector size limits Large stripe unit log write alignment Proposal: The current v2 log format is an extension of the v1 format which was limited to 32kB in size. The size limitation was due to the way that the log format requires every basic block to be stamped with the LSN associated with the iclog that is being written. This requirement stems from the fact that log recovery needed this LSN stamp to determine where the head and tail of the log lies, and whether the iclog was written completely. The implementation requires storing the data written to the first 32 bits of each sector of iclog data into a special array in the log header, and replacing the data with the cycle number of the current iclog write. When the log is replayed, before the iclog is read the data is extracted from the iclog headers anre written back over the cycle numbers so the transaction information is returned to it's original state before decoding occurs. For V2 logs, a set of extension headers were created, allowing another 7 basic blocks full of encoded data, which allows us to remap an extra 7 32kB segments of iclog data into the iclog header. This is where the 256kB iclog size limit comes from - it's 8 * 32kB segments. As the iclogs get larger, this whole encoding scheme because more CPU expensive, and it largely limits what we can do with expanding iclogs. It also doesn't take into account how things have changed since v2 logs were first designed. That is, we didn't have delayed logging. That meant iclogbuf IO was the limiting factor to commit rates, not CPU overhead. We now do commits that total up to 32MB of data, and we do that by cycling through it iclogbuf at a time. As a result, CIL pushes are largely IO bound waiting for iclogbufs to complete IO. Larger iclogbufs here would make a substantial difference to performance when the CIL is full, resulting in less blocking and fewer cache flushes when writing iclogbufs. The question is this: do we still need this cycle stamping in every single sector? If we don't need it, then a new format is much simpler than if we need basic block stamping. >From the perspective of determining if a iclog write was complete, we don't trust the cycle number entirely in log recovery anymore. Once we have the log head and the log tail, we do a CRC validation walk of the log to validate it. Hence we don't really need cycle data in the log data to validate writes were complete - the CRC will fail if a iclogbuf write is torn. So that comes back to finding the head and tail of the log. This is done by doing a binary search of the log based reading basic blocks and checking the cycle number in the basic block that was read. We really don't need to do this search via single sector IO; what we really want to find is the iclog header at the head and the tail of the log. To do this, we could do a binary search based on the maximum supported iclogbuf size and scan the buffers that are read for iclog header magic numbers. There may be more than one in a buffer, (e.g. head and tail in the same region) but that is an in-memory search rather than individual single sector IO. Once we've found an iclog header, We can read the LSN out of the header, and that tells us the cycle number of that commit. Hence we can do the binary search to find the head and tail of the log without needing have the cycle number stamped into every sector. IOWs, I don't see a reason we need to maintain the per-basic-block cycle stamp in the log format. Hence by removing it from the format we get rid of the need for the encoding tables, and we remove the limitation on log write size that we currently have. Essentially we move entirely to a "validation by CRC" model for detecting torn/incomplete log writes, and that greatly reduces the complexity of log writing code. It also allows us to use arbitrarily large log writes instead of fixed sizes, opening up further avenues for optimisation of both journal IO patterns and how we format items into the bios for dispatch. We already have log vector buffers that we hand off to the CIL checkpoint for async processing; it is not a huge stretch to consider mapping them directly into bios and using bio chaining to submit them rather than copying them into iclogbufs for submission (i.e. single copy logging rather than the double copy we do now). And for DAX hardware, we can directly map the journal.... But before we get to that, we really need a new log format that allows us to get away from the limitations of the existing "fixed size with encoding" log format. Discussion: - does it work? - implications of a major incompat log format change - implications of larger "inflight" window in the journal to match the "inflight" window the CIL has. - other problems? - other potential optimisations a format change allows? - what else might we add to a log format change to solve other recovery issues? -- Dave Chinner david@xxxxxxxxxxxxx