Re: Directory unremovable on ext4 no_journal mode

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Mon, 9 Apr 2018 23:12:42 -0400

On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote:
> Hi,
> 
> We stumbled upon what seems to be a bug that makes a “directory
> unremovable”,  on ext4 when mounted with no_journal option.

Hi Jayashree,

If you use no_journal mode, you **must** run e2fsck after a crash.
And you do have to potentially be ready for data loss after a crash.
So no, this isn't a bug.  The guarantees that you have when use
no_journal is essentially limited to what Posix specifies when you
crash uncleanly --- "the results are undefined".

> The sequence of operations listed above is making dir Z unremovable
> from dir Y, which seems like unexpected behavior. Could you provide
> more details on the reason for such behavior? We understand we run
> this on no_journal mode of ext4, but would like you to verify if this
> behavior is acceptable.

We use no_journal mode in Google, but we are preprared to effectively
reinstall the root partition, and we are prepared to lose data on our
data disks, after a crash.  We are OK with this because all persistent
data stored on machines is data we are prepared to lose (e.g., cached
data or easily reinstalled system software) or part of our cluster
file system, where we use erasure codes to assure that data in the
cluster file system can remain accessible even if (a) a disk dies
completely, or (b) the entry router on the rack dies, denying access
to all of the disks in a rack from the cluster file system until the
router can be repaired.  So losing a file or a directory after running
e2fsck after a crash is actually small beer compared to any number of
other things that can happen to a disk.

The goal for no_journal mode is performance at all costs, and we are
prepared to sacrifice file system robustness after a crash.  This
means we aren't doing any kind of FUA writes or CACHE FLUSH
operations, because those would compromise performance.  (As a thought
experiment, I would encouraging you to try to design a file system
that would provide better guarantees without using FUA writes, CACHE
FLUSH operations, and with the HDD's write-back cache enabled.)

To understand why this is so important, I would recommend that you
read the "Disks for Data Center" paper[1].  There is also a lot of
good stuff in the FAST 2016 keynote that isn't in the paper or the
slides.  So listening to the audio recording is also something I
strongly commend for people who want to understand Google's approach
to storage.  (Before 2016, we had always considered this part of our
"secret sauce" that we had never disclosed for the past decade, since
it is what gave us a huge storage TCO advantage over other companies.)

[1] https://research.google.com/pubs/pub44830.html
[2] https://www.usenix.org/node/194391

Essentially, we are trying to use all of the two baskets of value
provided by each HDD.  That is, we want to use nearly all of the byte
capacity and all of the IOPS that an HDD can provide --- and FUA
writes or CACHE FLUSHES significantly compromises the number of I/O
operations the HDD can provide.  (More details about how we do this at
the cluster level can be found in the PDSW 2017 keynote[3], but it
goes well beyond the scope of what gets done on a single file system
on a single HDD.)

[3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf

Regards,

						- Ted

P.S.  This is not to say that the work you are doing with Crashmonkey
et. al. is not useless; it's just not applicable for a cluster file
system in a hyper-scale cloud environment.  Local disk file systems
and robustness after a crash is still important in applications such
as Android and Chrome OS, for example.  Note that we do *not* use
no_journal mode in those environments.  :-)