Re: Ext3 Question re: Journal and data

Rick Stevens <ricks@xxxxxxxxxxxxxx> · Thu, 20 Apr 2017 11:04:05 -0700

On 04/19/2017 07:03 PM, JD wrote:
> 
> 
> On 04/19/2017 05:07 PM, Rick Stevens wrote:
>> On 04/19/2017 12:53 PM, JD wrote:
>>> On Tue, Apr 18, 2017 at 9:13 PM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx
>>> <mailto:lists@xxxxxxxxxxxxxxxxx>> wrote:
>>>
>>>      On a journaled filesystem, data and journal only are committed with
>>>      sync(). You have to umount or remount readonly to get all
>>> filesystem
>>>      metadata to commit.
>>>
>>>      After sync () it's expected you can crash, and the filesystem will
>>>      be made consistent at next remount when the journal is replayed.
>>>
>>>      If anything tries to find files with data committed, journal
>>>      committed, but fs metadata not committed: such as GRUB or debug
>>>      tools, they will fail.
>>>
>>>      Another option is to freeze/unfreeze. That was originally an XFS
>>>      feature, but is now generic capability. What I'm not totally sure
>>>      about off hand is whether the XFS user space tools is what to use
>>>      for any filesystem,  I'm pretty sure that it is.
>>>
>>>
>>>      Chris Murphy
>>>
>>> Could you explain what the journal is holding: The User Data, the
>>> Metadata, or Both?
>>> If both, should not a sync clear the contents of the journal after the
>>> completion of a sync (assuming no other io operation was done after the
>>> sync)?
>>> If not both (i.e. ONLY metadata), then replaying the journal only
>>> preserves the metadata that describe the files (name, mode, ...etc).
>>>
>>> Another question: What if the FS is mounted with the SYNC option in
>>> fstab, such as:
>>> UUID=71af3828-c4cd-2d26-b1f7-8337def05b8c   /sdd1   ext3 sync,rw     0 0
>>>
>>> Would that cause immediate commit of the DATA, or would that cause the
>>> commit of the METADATA?
>> What the journal holds depends on how you mount the filesystem. The
>> default mode is called "ordered" mode, and data is written directly to
>> the filesystem BEFORE metadata is written to the journal. The journal is
>> flushed into the directory inodes and such periodically by the kernel
>> (or a sync call).
>>
>> In "journal" mode, ALL data (both raw data and metadata) is written to
>> the journal and committed to the filesystem periodically or by a sync
>> call. This is the slowest mode but guarantees consistency and also
>> requires the biggest journal (since you're also journalling data).
>>
>> In "writeback" mode, there's no guarantee when the raw data is written
>> to the filesystem (it could be after the metadata is put into the
>> journal). "writeback" is the fastest, but can cause old data to appear
>> after a crash and journal playback because the journal would have the
>> new metadata but the new data hadn't been written before the crash.
>> IMHO, the slight improvement in I/O using writeback mode isn't worth the
>> risk, but every application and environment is different.
>>
>> You can specify how often the journal flush occurs using the "commit"
>> option of the mount command. The default time is five seconds. You can
>> make it faster at the expense of more CPU time being sucked up by the
>> "jdb2" and "kdmflush" threads.
> 
> Since for the disks in question, io performance is not my primary goal,
> what do you think of the following mount options listed on URL
> https://unix.stackexchange.com/questions/78861/what-mount-option-to-use-for-ext3-file-system-to-minimise-data-loss-or-corruptio
> 
> 
> auto,exec,relatime,sync,barrier=1,commit=1,data=ordered,data_err=abort,noatime

Those are more or less the standard mounts with the following exceptions:

	1. A commit time of one second (default is five seconds)
	2. "noatime" (don't update access times for files)

"noatime" will reduce I/O on the filesytem and I use it on network
mounts (NFS, CIFS, iSCSI) to reduce latency and I/O load, but not so
much on local block devices (I don't think the performance boost is
really worth it). The only thing that would make the data more secure is
to use "data=journal", in which case you may have to redo the filesystem
to create bigger journals.

If it's an ext4 filesystem, you could also add things like:

	journal_ioprio=2
	max_batch_time=10000

The first increases the priority that I/O operations caused by kjournal2
are handled (values are 0-7 with 0 being the highest priority and the
default value is 3). The second option reduces the time the disk
subsystem waits to "batch" operations from the default of 15mS (15000)
to 10mS (10000). The value for the second parameter is in uS
(microseconds). Note that both of these changes will make your system
"busier".

For most folk and workloads, the default settings (or those that you
specified above) are probably more than adequate. Tuning filesystems
and I/O reliability is more an art than a science (and many think it's
a black art as well) and depends on your needs, workloads and tasks.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer, AllDigital    ricks@xxxxxxxxxxxxxx -
- AIM/Skype: therps2        ICQ: 226437340           Yahoo: origrps2 -
-                                                                    -
-        Brain:  The organ with which we think that we think.        -
----------------------------------------------------------------------
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx