On 07/19/2013 03:52 PM, Bernd Schubert wrote:
Hello Ric, hi all,
On 07/12/2013 07:20 PM, Ric Wheeler wrote:
If you have topics that you would like to add, wait until the
instructions get posted at the link above. If you are impatient, feel
free to email me directly (but probably best to drop the broad mailing
lists from the reply).
sorry, that will be a rather long introduction, the short conclusion is below.
Introduction to the meta-cache issue:
=====================================
For quite a while we are redesigning our FhGFS storage layout to workaround
meta-cache issues of underlying file systems. However, there are constraints
as data and meta-data are distributed on between several targets/servers.
Other distributed file systems, such as Lustre and (I think) cepfs should have
the similar issues.
So the main issue we have is that streaming reads/writes evict meta-pages from
the page-cache. I.e. this results in lots of directory-block reads on creating
files. So FhGFS, Lustre an (I believe) cephfs are using hash-directories to
store object files. Access to files in these hash-directories is rather random
and with increasing number of files, access to hash directory-blocks/pages
also gets entirely random. Streaming IO easily evicts these pages, which
results in high latencies when users perform file creates/deletes, as
corresponding directory blocks have to be re-read from disk again and again.
Now one could argue that hash-directories are poor choice and indeed we are
mostly solving that issue in FhGFS now(currently stable release on the meta
side, upcoming release on the data/storage side).
However, given by the problem of distributed meta-data and distributed data we
have not found a way yet to entirely eliminate hash directories. For example,
recently one of our users created 80 million directories with one or two files
in these directories and even with the new layout that still would be an
issue. It even is an issue with direct access on the underlying file system.
Of course, basically empty directories should be avoided at all, but users
have their own way of doing IO.
Furthermore, the meta-cache vs. streaming-cache issue is not limited to
directory blocks only, but any cached meta-data are affected. Mel recently
wrote a few patches to improve meta-caching ("Obey mark_page_accessed hint
given by filesystems"), but at least for our directory-block issue that
doesn't seem to help.
Conclusion:
===========
From my point of view, there should be a small, but configurable, number pages
reserved for meta-data only. If streaming IO wouldn't be able evict these
pages, our and other file systems meta-cache issues probably would be entire
solved at all.
Example:
========
Just a very basic simple bonnie++ test with 60000 files on ext4 with inlined
data to reduce block and bitmap lookups and writes.
Entirely cached hash directories (16384), which are populated with about 16
million files, so 1000 files per hash-dir.
Version 1.96 ------Sequential Create------ --------Random Create--------
fslab3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
60:32:32 1702 14 2025 12 1332 4 1873 16 2047 13 1266 3
Latency 3874ms 6645ms 8659ms 505ms 7257ms 9627ms
1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
Now after clients did some streaming IO:
Version 1.96 ------Sequential Create------ --------Random Create--------
fslab3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
60:32:32 541 4 2343 16 2103 6 586 5 1947 13 1603 4
Latency 190ms 166ms 3459ms 6762ms 6518ms 9185ms
With longer/more streaming that can go down to 25 creates/s. iostat and btrace
show lots of meta-reads then, which correspond to directory-block reads.
Now after running 'find' over these hash directories to re-read all blocks:
Version 1.96 ------Sequential Create------ --------Random Create--------
fslab3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
60:32:32 1878 16 2766 16 2464 7 1506 13 2054 13 1433 4
Latency 349ms 164ms 1594ms 7730ms 6204ms 8112ms
Would a dedicated meta-cache be a topic for discussion?
Thanks,
Bernd
Hi Bernd,
I think that sounds like an interesting idea to discuss - can you add a proposal
here:
http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals
Thanks!
Ric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html