Thanks Ted! This information is very useful. We won't pursue testing ext4-no-journal further, as there is no problem e2fsck cannot fix if data loss is tolerated. I wanted to point you to an old paper of mine that has a similar goal of performance at all costs: the No Order File System (http://research.cs.wisc.edu/adsl/Publications/nofs-fast12.pdf). It doesn't use any FLUSH or FUA instructions, and instead obtains consistency from mutual agreement between file-system objects. It requires we are able to atomically write a "backpointer" with each disk block (perhaps in an out-of-band area). I thought you might find it interesting! Thanks, Vijay Chidambaram http://www.cs.utexas.edu/~vijay/ On Mon, Apr 9, 2018 at 10:12 PM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote: > On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote: >> Hi, >> >> We stumbled upon what seems to be a bug that makes a “directory >> unremovable”, on ext4 when mounted with no_journal option. > > Hi Jayashree, > > If you use no_journal mode, you **must** run e2fsck after a crash. > And you do have to potentially be ready for data loss after a crash. > So no, this isn't a bug. The guarantees that you have when use > no_journal is essentially limited to what Posix specifies when you > crash uncleanly --- "the results are undefined". > >> The sequence of operations listed above is making dir Z unremovable >> from dir Y, which seems like unexpected behavior. Could you provide >> more details on the reason for such behavior? We understand we run >> this on no_journal mode of ext4, but would like you to verify if this >> behavior is acceptable. > > We use no_journal mode in Google, but we are preprared to effectively > reinstall the root partition, and we are prepared to lose data on our > data disks, after a crash. We are OK with this because all persistent > data stored on machines is data we are prepared to lose (e.g., cached > data or easily reinstalled system software) or part of our cluster > file system, where we use erasure codes to assure that data in the > cluster file system can remain accessible even if (a) a disk dies > completely, or (b) the entry router on the rack dies, denying access > to all of the disks in a rack from the cluster file system until the > router can be repaired. So losing a file or a directory after running > e2fsck after a crash is actually small beer compared to any number of > other things that can happen to a disk. > > The goal for no_journal mode is performance at all costs, and we are > prepared to sacrifice file system robustness after a crash. This > means we aren't doing any kind of FUA writes or CACHE FLUSH > operations, because those would compromise performance. (As a thought > experiment, I would encouraging you to try to design a file system > that would provide better guarantees without using FUA writes, CACHE > FLUSH operations, and with the HDD's write-back cache enabled.) > > To understand why this is so important, I would recommend that you > read the "Disks for Data Center" paper[1]. There is also a lot of > good stuff in the FAST 2016 keynote that isn't in the paper or the > slides. So listening to the audio recording is also something I > strongly commend for people who want to understand Google's approach > to storage. (Before 2016, we had always considered this part of our > "secret sauce" that we had never disclosed for the past decade, since > it is what gave us a huge storage TCO advantage over other companies.) > > [1] https://research.google.com/pubs/pub44830.html > [2] https://www.usenix.org/node/194391 > > Essentially, we are trying to use all of the two baskets of value > provided by each HDD. That is, we want to use nearly all of the byte > capacity and all of the IOPS that an HDD can provide --- and FUA > writes or CACHE FLUSHES significantly compromises the number of I/O > operations the HDD can provide. (More details about how we do this at > the cluster level can be found in the PDSW 2017 keynote[3], but it > goes well beyond the scope of what gets done on a single file system > on a single HDD.) > > [3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf > > Regards, > > - Ted > > P.S. This is not to say that the work you are doing with Crashmonkey > et. al. is not useless; it's just not applicable for a cluster file > system in a hyper-scale cloud environment. Local disk file systems > and robustness after a crash is still important in applications such > as Android and Chrome OS, for example. Note that we do *not* use > no_journal mode in those environments. :-)