Re: My NiLFS2 port to NetBSD: status update

Reinoud Zandijk <reinoud@xxxxxxxxxx> · Wed, 5 Jun 2013 11:03:00 +0200

Hi Ryusuke, hi folks,

On Wed, Jun 05, 2013 at 09:21:30AM +0900, Ryusuke Konishi wrote:
> > I've picked up development, re-starting again, learning from earlier
> > mistakes.  Has anything fundemental changed in the last few months? I've
> > taken in the changes for volume name, the 2nd superblock and such, but is
> > there anything more?
> 
> I think no big changes are made in these days.  Most changes are bug fixes,
> or problem fixes.

Thats reassuring and at the same time worrysome ;)

> > As for the adoption of NiLFS2 in Linux, i've seen mixed responses in
> > reviews.  Most complain about the read/write speed. Since i dont have a
> > reasonable test setup and my implementation is not writing yet, how are
> > your thoughts about this? Is it a fundemental issue or is it more due to
> > way its implemented?
> 
> Yes, it's not satifactory in performance at all, at least for our NILFS2
> implementation of Linux.  Both read and write speed should be improved based
> on measurement.  Unfortunately, I have no time these days to make effort on
> this.  As for performance, at least the following tasks are still remaining.

PS Sorry for the long mail, i think i started getting things thought out for
myself too :) Hopefully they are of some use!

> - fast inode allocator

IIRC there are bitmap blobs detailing free-entries already aren't there?  iWhy
not create a run-length encoded cache of found free space? i know, run-length
encoded list maintenance sucks and is very error prone but if you keep a list
of freed stuff too and merge the two lists at times it can be a lot easier.
Saves a lot of bit-grovelling to get a (next) free spot.

> - fitrim support
i have no idea what this is really, a kind of trunc support? removing
to-be-written-out blocks from the buffer cache/segment writer when possible?

> - btree based directory imlementation
not needed to be bluntly honest and it would complicate stuff quite a lot. As
you might remember from my earlier posts, i've created a hash-table based
directory lookup that gets constructed on initial directory read-in and it
easily speeds up directory operations with a factor 10 to 100.000 in cases. In
the hash-table only the pairs (hash, fileoffset) are recorded. On search, you
hash the name you look for, look it up and one gets interatively a number of
file offsets (on block alignment in this case) to check. Just check those
blocks and see if one can find them, caching does the rest. It transforms
directory lookups from O(n) to O(1). Even small directories are accelerated on
creation since it isn't O(1+2+3+4+5+6+7+8) but O(1+1+1+1+1+1+1+1) speeding it
up from 36 to 8, a factor 5 already for 8 files effectively transforming even
directory creation from O(nlog(n)) to O(n) IIRC. This speed up will be less
dramatic for block aligned results for small directories but would be if the
entries returned were indeed at file offset. Problem with that is deleting
stuff since for some odd reason free-space in a directory block has to be at
the end of the block isn't it? It would complicate renaming/deletion if not
using block aligned.

> - revise gc algorithm
I have to say the current gc algorithm is quite wastefull in its disk updating
yes. Since the number of alive blocks is recorded already, why not go for the
low-hanging fruit first going for the sparsest segments? In more quiet times
the less sparser segments gets handled anyway. It won't then just go round and
round copying segments that are hardly if any touched. If you want a real
optimiser, why not seek essesively chopped up files and explicitly re-lineup
those files? That could be done in parallel to the normal gc or as an extra
service if free space is enough and seen as a nightly maintenance job ;)

> - fsync/osync optimization
I plan on creating a `synchronous' writing implementation and on auto-fsync
flush all dirty buffers from the nodes to the segments. I have a FS wide
rwlock around that uses a multible claimable read lock on every operation
entry/exit and the segment creator optionally aquiring a write lock. This will
assure that no operations are active and the data is consistent if the write
lock is taken/granted. If a snapshot is due or is requested the segment writer
can use the write lock otherwise a read-lock will suffice.

> - direct io (write) support
Actually i have never seen a program use it, but isn't this more a
fire-and-forget? I think its outside the FS in *BSD, i have no idea.

> - improve pre-fetch
On a file/directory read-in my implementation issues the longest continuous
stretch it can get based on translating block numbers to physical block
numbers, upto 64 blocks or so. This gives quite a bonus and is quite easy to
implement. I dont think its worth going over holes, the cost would be too much
and on the next read one can search/readin a chunk in one go again anyway.

> - lock free log write; the currrent segment constructor locks r/w semaphore
> during write, and successive write requests to page cache are blocked.
Ah, your page cache and your segment constructor are intertwined. My plan is
to copy/COW blocks to the segment writer memory space and let the buffer cache
delete their reference if it wants. I plan to use a piggyback snoop to avoid
multiple block updates/reading back in problems. The buffer cache itself can
operate then idecoupled and as long as its not having a miss its not dependent
on the rw lock of the segment constructor for snooping.

> - etc
I really hope you can get some more time to perfect it. It would be a shame if
they let it stay in this state esp. since it has so much more potential.

Sorry for the braindump :) Oh, how is the fsck_nilfs going? i plan on creating
a fsck_nilfs my own as part of the development process, just as an assurance i
dont mess up my btree's :-D That code is still... well, not that good ;)

With regards,
Reinoud

Attachment:
pgpIV9oMNiUkt.pgp

Description: PGP signature