On Thu, May 31, 2018 at 1:06 PM, Gao Xiang <gaoxiang25@xxxxxxxxxx> wrote: > Hi all, > > Read-only file systems are used in many cases, such as read-only storage media. > We are now focusing on the Android device which several read-only partitions exist. > Due to limited read-only solutions, a new read-only file system EROFS > (Extendable Read-Only File System) is introduced. In which sense is it extendable? > As the other read-only file systems, several meta regions in generic file systems > such as free space bitmap are omitted. But the difference is that EROFS focuses > more on performance than purely on saving storage space as much as possible. > > Furthermore, we also add the compression support called z_erofs. > > Traditional file systems with the compression support use the fixed-sized input > compression, the output compressed units could be arbitrary lengths. > However, data is accessed in the block unit for block devices, which means > (A) if the accessed compressed data is not buffered, some data read from > the physical block cannot be further utilized, which is illustrated as follows: > > ++-----------++-----------++ ++-----------++-----------++ > ...|| || || ... || || || ... original data > ++-----------++-----------++ ++-----------++-----------++ > \ / \ / > \ / \ / > \ / \ / > ++---|-------++--|--------++ ++-----|----++--------|--++ > ||xxx| || |xxxxxxxx|| ... ||xxxxx| || |xx|| compressed data > ++---|-------++--|--------++ ++-----|----++--------|--++ > > The shadow regions read from the block device but cannot be used for decompression. > > (B) If the compressed data is also buffered, it will increase the memory overhead. > Because these are compressed data, it cannot be directly used, and we don't know > when the corresponding compressed blocks are accessed, which is not friendly to > the random read. > > In order to reduce the proportion of the data which cannot be directly decompressed, > larger compressed sizes are preferred to be selected, which is also not friendly to > the random read. > > Erofs implements the compression in a different approach, the details of which will > be discussed in the next section. > > In brief, the following points summarize our design at a high level: > > 1) Use page-sized blocks so that there are no buffer heads. > > 2) By introducing a more general inline data / xattr, metadata and small data have > the opportunity to be read with the inode metadata at the same time. > > 3) Introduce another shared xattr region in order to store the common xattrs (eg. > selinux labels) or xattrs too large to be suitable for meta inline. > > 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs > to organize files and data. > > 5) instead of using the fixed-sized input compression, we put forward a new fixed > output compression to make the full use of IO (which means all data from IO can be > decompressed), reduce the read amplification, improve random read and keep the > relatively lower compression ratios, illustrated as follows: > > > |---- varient-length extent ----|------ VLE ------|--- VLE ---| > /> clusterofs /> clusterofs /> clusterofs /> clusterofs > ++---|-------++-----------++---------|-++-----------++-|---------++-| > ...|| | || || | || || | || | ... original data > ++---|-------++-----------++---------|-++-----------++-|---------++-| > ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++ > size size size size size > \ / / / > \ / / / > \ / / / > ++-----------++-----------++-----------++ > ... || || || || ... compressed clusters > ++-----------++-----------++-----------++ > ++->cluster<-++->cluster<-++->cluster<-++ > size size size > > A cluster could have more than one blocks by design, but currently we only have the > page-sized cluster implementation (page-sized fixed output compression can also have > better compression ratio than fixed input compression). > > All compressed clusters have a fixed size but could be decompressed into extents with > arbitrary lengths. > > In addition, if a buffered IO reads the following shadow region (x), we could make a more > customized path (to replace generic_file_buffered_read) which only reads one compressed > cluster and makes the partial page available. > /> clusterofs > ++---|-------++ > ...|| | xxxx || ... > ||---|-------|| > > Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on > the server and Android phone (kirin970 platform): > > Server (magnetic disk): > > compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read > ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) > > 4 480.3 502.5 69.8 11.1 > 10 472.3 503.3 56.4 10.0 > 15 457.6 495.3 47.0 10.9 > 26 401.5 511.2 34.7 11.1 > 35 389.1 512.5 28.0 11.0 > 48 375.4 496.5 23.2 10.6 > 53 370.2 512.0 21.8 11.0 > 66 349.2 512.0 19.0 11.4 > 76 310.5 497.3 17.3 11.6 > 85 301.2 512.0 16.0 11.0 > 94 292.7 496.5 14.6 11.1 > 100 538.9 512.0 11.4 10.8 > > Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz): What storage was used? An eMMC? > compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read > ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) > > 4 546.7 544.3 157.7 57.9 > 10 535.7 521.0 152.7 62.0 > 15 529.0 520.3 125.0 65.0 > 26 418.0 526.3 97.6 63.7 > 35 367.7 511.7 89.0 63.7 > 48 415.7 500.7 78.2 61.2 > 53 423.0 566.7 72.8 62.9 > 66 334.3 537.3 69.8 58.3 > 76 387.3 546.0 65.2 56.0 > 85 306.3 546.0 63.8 57.7 > 94 345.0 589.7 59.2 49.9 > 100 579.7 556.7 62.1 57.7 How does it compare to existing read only filesystems, such as squashfs? -- Thanks, //richard