Hi Varada, Sorry, I missed your earlier message. Also adding ceph-devel cc. On Thu, 26 May 2016, Varada Kari wrote: > Hi Sage, > > I figured out some more gaps in my implementation, but the idea what is > proposed is the same. > Now i am thinking about adding some throttling and crash recovery > aspects to it. > > Sorry for the delay in responding, been pulled some other tasks. trying > to get basic implementation > ready, will get back to you soon on this. > > Varada > > On Tuesday 17 May 2016 06:15 PM, Varada Kari wrote: > > Hi Sage, > > > > Regrading the task to remove to compact the log in BlueFS, i am thinking > > of the following approach. Please correct me if i had missed anything in > > understanding. > > Once i have some more understanding on the approach and some kind of > > implementation ready will send it to ceph-devel for review on the proposal. > > > > Approach 1: > > > > > > Thinking of implementing as a double buffered system. > > > > 1. Inode 1, will serve as a log file. There will be active and passive > > shadow files as per bluefs_log_compact_min_size(16MB) > > 2. BlueFS::_maybe_compact_log() will find out if we have to compact the > > log, by looking at the active log file. > > 3. If needs compactoin, switch to other log file and start merging the > > contents from passive file to ino 1. > > 4. Will create a compaction thread, where it can run in background. > > > > > > tasks to handle: > > > > 1. get_total and get_free, have to consult both the files along with > > actual log file. > > 2. Any crash in between, on reboot, have to merge the contents to > > log_file(ino 1), and replay the log before any decision is made. > > > > Thinking of adding fsck as part of the change to verify my logic. > > Hope i am thinking in correct lines about the problem. A few comments, mostly in reverse... There is no need for an explicit fsck function because all the metadata is loaded into RAM (and can/should be verified) on mount. Most of the metadata is implicitly validated at that point (e.g., when the allocator freelist is fabricated based on what isn't explicitly referenced by fnodes). If there's anything we can validate that we aren't, we should just add it there. I don't quite understand the 'double-buffer' idea above. I don't think there is any need to have a separate fnode for the compaction because the metadata is in RAM and will be implicitly thrown out if we crash and restart (because no persisted fnode references the space we were using). I think the process looks like this... 1. Decide we need to compact. This can be a simple heuristic that compares the estimated space we need (number of fnodes * some average size) vs the actual log size. I think this is already in place. 2. Allocate a new extent to continue the log, and then log an event that jumps the log write position to the new extent. At this point, the old extent(s) won't be written to, and reflect everything should compact. [simple version] 3. While still holding the lock, encode a bufferlist that dumps all of the in-memory fnodes and names. This will become the new beginning of the log. The last event will jump to the log continuation extent from #2. 4. Drop the lock so that writes can continue. 5. Queue a write to a new extent for the new beginnging of the log. Wait for it to commit and flush. 6. Retake the lock. 7. Update the log_fnode to splice in the new beginning. Queue an io to update the superblock. Drop the lock. 8. Wait for things to flush. Retake the lock, and release the old log beginnging extents to the allocator for reuse. This does the log compaction in-memory with the lock held. I think it'll be really complex to do it without the lock, so I'm inclined not to worry about that at this point. I don't think it will stall bluefs for long--the IO is generally the slow part. This is a bit wonky in that we might have a superblock fnode that is newer than what is in the log. I think we just need to update the _replay case so that f->fnode only updates if the version is newer. We might also have a case where teh superblock fnode references the old uncompacted log, and a new log event at teh end updates the log fnode to reference the new space. That's also okay for replay, *if* we force a superblock update on replay. We can just do that unconditionally. (Mostly it doesn't matter... i.e. the superblock might have log fnode v2 that has one extent, and as the log grows we log fnodes that add new extents. Replay completes and all is well even though the superblock's fnode is super old--it's still enough to get started reading the log. But with compaction we might deallocate and clobber what the old thing referenced... so just updating the superblock after replay cleans up if we crashed before #8 happened above. That's what I've been thinking, at least--let me know if you have another/better idea! Happy to talk over video chat if that helps. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html