Re: Further work on reiser4: discard support and performance issues

Ivan Shapovalov <intelfx100@xxxxxxxxx> · Sat, 23 Feb 2013 16:21:46 +0400

On 11 February 2013 02:48:05 Edward Shishkin wrote:
> On 02/10/2013 07:20 AM, Иван Шаповалов wrote:
> > Hi Edward,
> 
> Hello Ivan.
> 
> > Sorry for the long silence...
> 
> NP, I just wanted to make sure that things move in right
> direction (if any).
> 
> 
>   I've been extremely busy with real-life
> 
> > things here - so just had no time even to walk through the code and
> > build up a list of questions (not to mention the actual development).
> > I guess I'll finally return in a week or so.
> 
> The same problem here: almost zero spare time in which I try to
> implement different transaction modes to have a pure journalling
> mode (where reiser4 partitions won't quickly accumulate external
> fragmentation) and pure COW (AKA "Write-Anywhere") mode, which is
> needed to implement snapshots; also COW would be an optimal mode
> for SSD drives.
> 
> > But here's what I currently think about discard implementation.
> > In filesystems like jfs, it is implemented pretty straightforward.
> > "Online" discard on block freeing is done through hooking into
> > function dbFree(), which marks the blocks as free in the _working_
> > allocation map. Batch discard via FITRIM ioctl is done through locking
> > the whole allocation group, allocating everything in it, trimming
> > these blocks and freeing them again.
> > 
> > For reiser4, I think it will translate into something like this:
> > With "online" discard, it would be better to discard the blocks at
> > transaction commit time (the time when working bitmap is copied to the
> > persistent one... am I right?)
> 
> I am sorry, but I still don't know the TRIM/discard background well
> enough to make any decisions. I understand that a file system should
> issue some commands to "help" the hardware? What those commands will
> result in?

---- tl;dr area begin

The TRIM is a command in the ATA protocol, operating on a sector range.
It tells the hardware (storage) that the given sector range is not used 
anymore and hence data contained in it can be discarded/removed.
(Similar commands exist in several other protocols, like SCSI UNMAP and
SD ERASE, and the "discard" is an in-kernel abstraction to all such commands.)

The reason why do we need such a command for SSDs is that in flash memory
an "overwrite data" operation is actually an "erase + write data" and is much 
more costly than just a "write data onto free space". Flash memory
is organized into pages (usually 4K), which are further grouped into blocks 
(512K); and while a write is done per-page, an erase is done per-block
(so a controller shall read the whole block into cache and then rewrite all 
pages in it, except the one being updated).

Modern controllers do internal block remapping to achieve some "wear leveling" 
(i. e. spreading use across all blocks instead of continuously rewriting one 
block which is updated by the user), but they obviously need a pool of free 
blocks, and anyway - writes to the locations that the software would 
consider empty still may trigger a read-erase-write cycle.

So, the TRIM command notifies the controller that the block can be erased and 
returned to the free pool. There is a restriction on sector ranges given to 
the command: they should actually represent whole blocks
(otherwise they are ignored, AFAIK).

So, from the software's point of view, an SSD-aware operation looks like
1) putting whatever is likely to be updated simultaneously into the same block 
(TRIM unit);
2) delaying writeback in hope that more adjacent data will be written at once;
3) notify the storage when the blocks are logically freed by issuing a TRIM 
command.

(1) and (2) are largely my guesses (and anyway out of scope), while
(3) is a common practice and is implemented at storage driver, kernel and 
filesystem layers.

---- tl;dr area end

So, inside the filesystem we need to notify the kernel about  we need to 
implement TRIM (more precisely, discard - as we're working with
in-kernel abstractions) support in the filesystem

About the implementation:
There is an API call, blkdev_issue_discard() [1], which does all the 
work and is supposed to be called from the filesystem. The discard properties 
are stored in struct queue_limits.

And for the filesystem itself, there are generally two modes to support discard 
operations [2].
1) "Realtime" or online discard - the filesystem discards blocks as they are 
deallocated (files being deleted, tree nodes being cut, etc.).
2) "Batch" discard - the filesystem discards all free blocks upon a user's 
request (when mounted).
In this "batch" case, the signaling is done through a FITRIM ioctl on any file.

"Batch" mode:
Implementing it should be simple enough (if I'm making correct assumptions 
about how does reiser4 work): we can just lock the bitmap and walk through it, 
issuing a discard for each long enough free sequence.

"Realtime" mode:
It will be more complex given that we have to do the actual work on 
transaction commit.
You are right about the slowness of bitmap comparison (yes, 32K bitops... I 
haven't thought about it); we'll need to store locations to discard in some 
per-atom data structure. 

Let's define a "minimal discard range" to be a block range,
1) whose begin is properly aligned,
2) whose size is equal to discard granularity.
This can be checked using data from struct queue_limits (exact algorithm can 
be derived from code of blkdev_issue_discard()).

Actually, simply storing each deallocated interval in the atom and then 
iterating through the list upon commit will be suboptimal.
Reasons:
- if a single deallocated range is smaller than the discard granularity, then 
this particular range won't be discarded even if it is surrounded by enough 
free blocks to make a minimal discard range;
- we won't be able to merge small adjacent ranges to form a range that's long 
enough.

Solution:
- record all deallocated ranges verbatim (in a list);
- on commit time, for each recorded range find minimal discard range(s) which 
encompass the given range and check if all their blocks can be discarded
(i. e. are free);
- add each suitable minimal discard range to a locally-allocated tree (while 
merging the added ranges);
- issue discard for all found ranges.

Hope this won't be too slow. BTW, kernel sometimes seems to report wrong 
granularity. In my case, granularity is reported as 512 bytes.

[1]:
http://www.kernel.org/doc/htmldocs/kernel-api/API-blkdev-issue-discard.html

[2]:
http://xfs.org/index.php/FITRIM/discard

[3]:
http://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block

> 
> 
>   by performing a comparison between the
> 
> > old (on-disk) and new bitmaps, remembering all changed chunks and
> > issuing discard for them.
> 
> I afraid that comparison the bitmaps is something expensive: it means
> 4K*8 = 32K comparisons per bitmap block.. Maybe it makes sense to
> accumulate the "difference" in special per-atom data structures
> (say, rb-trees)?
> 
> 
>   Also, the discard granularity can be higher
> 
> > than the bitmap granularity. E. g. if we have a bitmap pattern like
> > "0010" and it changes to "0000", it would be better to issue a discard
> > for 4 blocks instead of just one.
> > 
> > And with FITRIM, we could just lock the bitmap and walk through it,
> > discarding all free chunks. Of course, it can only be done if locking
> > policy allows us to "just lock the bitmap"...
> > 
> > BTW, I'm afraid I don't understand what "a proposal" means. Is it a
> > kind of some official document - and if yes, who needs it?
> 
> Nothing official, this is a usual practice in groups that work
> remotely: someone send a kind of roadmap. In the simplest case it
> can be a set of links where one can read about TRIM/discard.
> Maybe "proposal" sounds too official? :)
> 
> > For the other things: the freezing issue seems to be related to
> > fsync() indeed; the freeze rate decreased substantially when I stopped
> > using InnoDB as the MySQL backend. Some of them remained, seemingly
> > related to Dropbox (== concurrent reads and writes to the same file).
> 
> This is a known problem, I'll try to find Reiser's suggestions how to
> resolve this..

Due to transactional fs's nature?

> 
> > And yes, I'll try to do the bisection as soon as enough free time
> > appears... Will a virtual machine be enough, or it is crucial that the
> > tests shall be performed on a real machine?
> 
> It can be remote, but it should be a real machine. BTW, where are you
> territorially?

I'm in Moscow (RU). Actually, I can do that on my primary PC - if those old 
kernels are able to boot a SandyBridge chipset.

BTW, mirror at mirror.sit.wisc.edu is offline... I'll use mirror.linux.org.au - 
and hope that patches will apply to any of the intermediate states.
What is the first known bad version?

Ivan.

> 
> Edward.
> 
> > Thanks,
> > Ivan.
> > 
> > 2013/2/10 Edward Shishkin<edward.shishkin@xxxxxxxxx>:
> >> Hi Ivan,
> >> 
> >> How our TRIM/dsicard is doing?
> >> Any questions, or everything is clear? :)
> >> 
> >> Edward.
> >> 
> >> On 01/17/2013 05:39 PM, Edward Shishkin wrote:
> >>> On 01/07/2013 02:42 AM, Ivan Shapovalov wrote:
> >>>> Hi again Edward,
> >>> 
> >>> Hello.
> >>> 
> >>>> Here's what I want to try to do with reiser4 in meantime. I'd
> >>>> appreciate some
> >>>> hints on that all...
> >>>> 
> >>>> So, first thing I'd like to implement is TRIM/discard support, both
> >>>> online
> >>>> (via -o discard) and in a separate FITRIM ioctl().
> >>>> That's just because I've got an SSD two days ago and thus now have to
> >>>> use in
> >>>> rootfs some discard-aware fs like ext4.
> >>> 
> >>> I think it would be nice for beginning. Moreover, reiser4 still doesn't
> >>> have any setup optimal for SSD.
> >>> 
> >>> Unfortunately I don't have a ready proposal for TRIM/discard support in
> >>> reiser4.
> >>> 
> >>> I have ready proposals for the following features (they can be rather
> >>> complicated for the beginners though):
> >>> 
> >>> 1) Repacker (On-line defragmentation);
> >>> 2) Support of different transaction models:
> >>> a. pure journalling;
> >>> b. pure COW (Copy-On-Write);
> >>> c. smart (the current "mixed" one);
> >>> d. no transaction support (for people with UPSs);
> >>> 3) Subvolumes (AKA "chunkfs");
> >>> 4) Snapshots.
> >>> 
> >>>> And then I want to do something with performance: sometimes during
> >>>> heavy I/O
> >>>> to a slow /home storage (especially when it's multithreaded) many
> >>>> processes,
> >>>> including the DE, just get stuck in "D" state and sit there for a
> >>>> minute or
> >>>> two with load average of apx. 5.5 (on a hyperthreaded 2-core CPU).
> >>> 
> >>> and some process waits for fsync() completion?
> >>> 
> >>>> For the first, I can look into other filesystems' implementations, but
> >>>> I'll
> >>>> probably be unsure at which layer to put the actual discard call (in
> >>>> order not
> >>>> to break reiser4's transactional nature).
> >>> 
> >>> If you decide to proceed with TRIM/discard support, you will need to
> >>> prepare the proposal by yourself. Let's start with some background,
> >>> that is:
> >>> . clarify underlying reasons (specific for SSD geometry?) of
> >>> TRIM/discard support: why do we need such support on the file
> >>> system layer;
> >>> . review of existing hardware and software means for such support;
> >>> . etc..
> >>> 
> >>> And yes, it would be nice to review existing TRIM/discard support
> >>> implementations in other file systems (say, ext4).
> >>> 
> >>> Once we figure out, what bits of reiser4 you should understand
> >>> perfectly to implement TRIM/discard support, I'll provide you with
> >>> respective hints.
> >>> 
> >>>> And for the second, I just don't know why does that happen. Can it be
> >>>> due to
> >>>> some r4-specific things/issues or that's just a horribly slow random
> >>>> access
> >>>> speed of my hw?
> >>> 
> >>> Which hw? SSD?
> >>> 
> >>> I also remember complaints that umount (i.e. the final sync takes 2-3,
> >>> or even more minutes). It looks like in some cases reiser4 accumulates
> >>> too much dirty stuff..
> >>> 
> >>> It would be nice to periodically dump some info about atoms (current
> >>> number of all atoms, size of each atom, etc) to see the full picture of
> >>> their evolution during such freezing. I think, it makes sense to port
> >>> the old reiser4 profiling stuff, and populate it with more info (if
> >>> needed).
> >>> 
> >>> Also there is an oldest issue:
> >>> The following (old) benchmarks created with mongo(*) test suit show x2
> >>> advantage of reiser4 against reiserfs(v3) on CREATE phase (let's
> >>> consider only this phase for simplicity):
> >>> 
> >>> 
> >>> http://web.archive.org/web/20061113154648/http://www.namesys.com/benchma
> >>> rks.html
> >>> 
> >>> 
> >>> I've made similar benchmarks with latest 2.6 kernels (sorry, lost the
> >>> results) and found that the advantage has disappeared (real time in
> >>> CREATE phase is the same as of reiserfs, or even worse). It shouldn't
> >>> be so: it indicates that something wrong is going on.. I remember
> >>> people complained on the performance drop in reiser4 long time ago, but
> >>> didn't have a chance to investigate this.
> >>> 
> >>> The straightforward way to narrow down the problem changeset is to
> >>> bisect starting from 2.6.8-mm2, the archives can be found here:
> >>> http://mirror.sit.wisc.edu/pub/linux/kernel/people/akpm/patches/2.6/
> >>> http://ftp.icm.edu.pl/packages/linux-reiserfs/reiser4-for-2.6/
> >>> 
> >>> http://mirror.sit.wisc.edu/pub/linux/kernel/people/edward/reiser4/reiser
> >>> 4-for-2.6/
> >>> 
> >>> However, it can be rather painful and requires a separate machine.
> >>> 
> >>> Thanks,
> >>> Edward.
> >>> 
> >>> (*)
> >>> 
> >>> http://sourceforge.net/projects/reiser4/files/reiser4-utils/bench-stress
> >>> -tools/
--
To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html