Re: [ANNOUNCE] bcachefs!

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 6 Aug 2015 16:59:09 -0700

On Thu, Aug 06, 2015 at 04:27:51PM -0700, Ming Lin wrote:
> On Thu, Aug 6, 2015 at 3:58 PM, Kent Overstreet
> <kent.overstreet@xxxxxxxxx> wrote:
> > On Tue, Jul 28, 2015 at 11:41:52AM -0700, Ming Lin wrote:
> >> On Fri, Jul 24, 2015 at 1:47 PM, Ming Lin <mlin@xxxxxxxxxx> wrote:
> >> >
> >> > And I want to learn how the btree node insert/delete/update happens on
> >> > disk. These maybe too detail. I'm going to write a small tool to dump
> >> > the file system. Then I could understand better the on disk btree
> >> > format.
> >>
> >> Here is my simple tool to dump parts of the on-disk format.
> >> http://www.minggr.net/cgit/cgit.cgi/bcache-tools/commit/?id=deb258e2
> >>
> >> It's not in good shape, but simple enough to learn the on-disk format.
> >
> > Hey! Sorry for taking so long to respond, just got my computer set up back in
> > Alaska.
> >
> > If you want to keep going with your tool, this might be a starting point for a
> > debugfs tool - which bcache definitely needs at some point.
> 
> Yes, that's my goal.
> I'll improve it once I get more familiar with bcachefs on-disk format.

I imagine the sanest thing to do will be to reuse some of the kernel side code -
at the very least, the bkey packing code. That code is already pretty self
contained, and it's very algorithmic - no point in redoing it, and no real
reason to do it differently.

If it makes things easier, we could probably shuffle code around a bit so that
perhaps bkey.c contains only code that can be easily compiled in userspace.

I'm not sure if there's any other significant code that you'd want to use in
userspace - possibly the mergesort code (i.e.
bch_extent_sort_fix_overlapping()), but that code is going to be harder to lift
out and compile in userspace without changes.

Journal replay is going to be another major issue... the problem is, the btree
isn't up to date until you do journal replay, and the way bcache does journal
replay is with the same index update path that it uses at runtime - which
modifies the btree, i.e. it can't do journal replay without modifying what's on
disk.

We don't want the userspace debugfs tool to be modifying the disk image, so the
method bcache uses is right out.

The method I had in mind was that when you read the journal, you keep that list
of index updates to do around, in memory - then, when you read or are looking at
any given btree node, you iterate over all the keys in the journal replay list
and apply only the ones that apply to the current node. If the insertions don't
fit into the current node (i.e. if we would have to split the node if we were
doing a normal index update) - just grow the node in memory, since we're just
going to be tossing it out when we're done instead of writing out our changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html