On 4/12/19 9:15 AM, Dave Chinner wrote:
On Thu, Apr 04, 2019 at 11:18:22AM +0100, Luis Henriques wrote:
Dave Chinner <david@xxxxxxxxxxxxx> writes:
On Wed, Apr 03, 2019 at 02:19:11PM +0100, Luis Henriques wrote:
Nikolay Borisov <nborisov@xxxxxxxx> writes:
On 3.04.19 г. 12:45 ч., Luis Henriques wrote:
Dave Chinner <david@xxxxxxxxxxxxx> writes:
Makes no sense to me. xfs_io does a write() loop internally with
this pwrite command of 4kB writes - the default buffer size. If you
want xfs_io to loop doing 1MB sized pwrite() calls, then all you
need is this:
$XFS_IO_PROG -f -c "pwrite -w -B 1m 0 ${size}m" $file | _filter_xfs_io
Thank you for your review, Dave. I'll make sure the next revision of
these tests will include all your comments implemented... except for
this one.
The reason I'm using a loop for writing a file is due to the nature of
the (very!) loose definition of quotas in CephFS. Basically, clients
will likely write some amount of data over the configured limit because
the servers they are communicating with to write the data (the OSDs)
have no idea about the concept of quotas (or files even); the filesystem
view in the cluster is managed at a different level, with the help of
the MDS and the client itself.
So, the loop in this function is simply to allow the metadata associated
with the file to be updated while we're writing the file. If I use a
But the metadata will be modified while writing the file even with a
single invocation of xfs_io.
No, that's not true. It would be too expensive to keep the metadata
server updated while writing to a file. So, making sure there's
actually an open/close to the file (plus the fsync in pwrite) helps
making sure the metadata is flushed into the MDS.
/me sighs.
So you want:
loop until ${size}MB written:
write 1MB
fsync
-> flush data to server
-> flush metadata to server
i.e. this one liner:
xfs_io -f -c "pwrite -D -B 1m 0 ${size}m" /path/to/file
Unfortunately, that doesn't do what I want either :-/
(and I guess you meant '-b 1m', not '-B 1m', right?)
Yes. But I definitely did mean "-D" so that RWF_DSYNC was used with
each 1MB write.
[ Zheng: please feel free to correct me if I'm saying something really
stupid below. ]
So, one of the key things in my loop is the open/close operations. When
a file is closed in cephfs the capabilities (that's ceph jargon for what
sort of operations a client is allowed to perform on an inode) will
likely be released and that's when the metadata server will get the
updated file size. Before that, the client is allowed to modify the
file size if it has acquired the capabilities for doing so.
So you are saying that O_DSYNC writes on ceph do not force file
size metadata changes to the metadata server to be made stable?
OTOH, a pwrite operation will eventually get the -EDQUOT even with the
one-liner above because the client itself will realize it has exceeded a
certain threshold set by the MDS and will eventually update the server
with the new file size.
Sure, but if the client crashes without having sent the updated file
size to the server as part of an extending O_DSYNC write, then how
is it recovered when the client reconnects to the server and
accesses the file again?
For DSYNC write, client has already written data to object store. If
client crashes, MDS will set file to 'recovering' state and probe file
size by checking object store. Accessing the file is blocked during
recovery.
Regards
Yan, Zheng
However that won't happen at a deterministic
file size. For example, if quota is 10m and we're writing 20m, we may
get the error after writing 15m.
Does this make sense?
Only makes sense to me if O_DSYNC is ignored by the ceph client...
So, I guess I *could* use your one-liner in the test, but I would need
to slightly change the test logic -- I would need to write enough data
to the file to make sure I would get the -EDQUOT but I wouldn't be able
to actually check the file size as it will not be constant.
Fundamentally, if you find yourself writing a loop around xfs_io to
break up a sequential IO stream into individual chunks, then you are
most likely doing something xfs_io can already do. And if xfs_io
cannot do it, then the right thing to do is to modify xfs_io to be
able to do it and then use xfs_io....
Got it! But I guess it wouldn't make sense to change xfs_io for this
specific scenario where I want several open-write-close cycles.
That's how individual NFS client writes appear to filesystem under
the NFS server. I've previously considered adding an option in
xfs_io to mimic this open-write-close loop per buffer so it's easy
to exercise such behaviours, but never actually required it to
reproduce the problems I was chasing. So it's definitely something
that xfs_io /could/ do if necessary.
Cheers,
Dave.