"Yan, Zheng" <zyan@xxxxxxxxxx> writes: > On 4/12/19 9:15 AM, Dave Chinner wrote: >> On Thu, Apr 04, 2019 at 11:18:22AM +0100, Luis Henriques wrote: >>> Dave Chinner <david@xxxxxxxxxxxxx> writes: >>> >>>> On Wed, Apr 03, 2019 at 02:19:11PM +0100, Luis Henriques wrote: >>>>> Nikolay Borisov <nborisov@xxxxxxxx> writes: >>>>>> On 3.04.19 г. 12:45 ч., Luis Henriques wrote: >>>>>>> Dave Chinner <david@xxxxxxxxxxxxx> writes: >>>>>>>> Makes no sense to me. xfs_io does a write() loop internally with >>>>>>>> this pwrite command of 4kB writes - the default buffer size. If you >>>>>>>> want xfs_io to loop doing 1MB sized pwrite() calls, then all you >>>>>>>> need is this: >>>>>>>> >>>>>>>> $XFS_IO_PROG -f -c "pwrite -w -B 1m 0 ${size}m" $file | _filter_xfs_io >>>>>>>> >>>>>>> >>>>>>> Thank you for your review, Dave. I'll make sure the next revision of >>>>>>> these tests will include all your comments implemented... except for >>>>>>> this one. >>>>>>> >>>>>>> The reason I'm using a loop for writing a file is due to the nature of >>>>>>> the (very!) loose definition of quotas in CephFS. Basically, clients >>>>>>> will likely write some amount of data over the configured limit because >>>>>>> the servers they are communicating with to write the data (the OSDs) >>>>>>> have no idea about the concept of quotas (or files even); the filesystem >>>>>>> view in the cluster is managed at a different level, with the help of >>>>>>> the MDS and the client itself. >>>>>>> >>>>>>> So, the loop in this function is simply to allow the metadata associated >>>>>>> with the file to be updated while we're writing the file. If I use a >>>>>> >>>>>> But the metadata will be modified while writing the file even with a >>>>>> single invocation of xfs_io. >>>>> >>>>> No, that's not true. It would be too expensive to keep the metadata >>>>> server updated while writing to a file. So, making sure there's >>>>> actually an open/close to the file (plus the fsync in pwrite) helps >>>>> making sure the metadata is flushed into the MDS. >>>> >>>> /me sighs. >>>> >>>> So you want: >>>> >>>> loop until ${size}MB written: >>>> write 1MB >>>> fsync >>>> -> flush data to server >>>> -> flush metadata to server >>>> >>>> i.e. this one liner: >>>> >>>> xfs_io -f -c "pwrite -D -B 1m 0 ${size}m" /path/to/file >>> >>> Unfortunately, that doesn't do what I want either :-/ >>> (and I guess you meant '-b 1m', not '-B 1m', right?) >> >> Yes. But I definitely did mean "-D" so that RWF_DSYNC was used with >> each 1MB write. >> >>> [ Zheng: please feel free to correct me if I'm saying something really >>> stupid below. ] >>> >>> So, one of the key things in my loop is the open/close operations. When >>> a file is closed in cephfs the capabilities (that's ceph jargon for what >>> sort of operations a client is allowed to perform on an inode) will >>> likely be released and that's when the metadata server will get the >>> updated file size. Before that, the client is allowed to modify the >>> file size if it has acquired the capabilities for doing so. >> >> So you are saying that O_DSYNC writes on ceph do not force file >> size metadata changes to the metadata server to be made stable? >> >>> OTOH, a pwrite operation will eventually get the -EDQUOT even with the >>> one-liner above because the client itself will realize it has exceeded a >>> certain threshold set by the MDS and will eventually update the server >>> with the new file size. >> >> Sure, but if the client crashes without having sent the updated file >> size to the server as part of an extending O_DSYNC write, then how >> is it recovered when the client reconnects to the server and >> accesses the file again? > > > For DSYNC write, client has already written data to object store. If client > crashes, MDS will set file to 'recovering' state and probe file size by checking > object store. Accessing the file is blocked during recovery. Thank you for chiming in, Zheng. > > Regards > Yan, Zheng > > > > >> >>> However that won't happen at a deterministic >>> file size. For example, if quota is 10m and we're writing 20m, we may >>> get the error after writing 15m. >>> >>> Does this make sense? >> >> Only makes sense to me if O_DSYNC is ignored by the ceph client... >> >>> So, I guess I *could* use your one-liner in the test, but I would need >>> to slightly change the test logic -- I would need to write enough data >>> to the file to make sure I would get the -EDQUOT but I wouldn't be able >>> to actually check the file size as it will not be constant. >>> >>>> Fundamentally, if you find yourself writing a loop around xfs_io to >>>> break up a sequential IO stream into individual chunks, then you are >>>> most likely doing something xfs_io can already do. And if xfs_io >>>> cannot do it, then the right thing to do is to modify xfs_io to be >>>> able to do it and then use xfs_io.... >>> >>> Got it! But I guess it wouldn't make sense to change xfs_io for this >>> specific scenario where I want several open-write-close cycles. >> >> That's how individual NFS client writes appear to filesystem under >> the NFS server. I've previously considered adding an option in >> xfs_io to mimic this open-write-close loop per buffer so it's easy >> to exercise such behaviours, but never actually required it to >> reproduce the problems I was chasing. So it's definitely something >> that xfs_io /could/ do if necessary. Ok, since there seems to be other use-cases for this, I agree it may be worth adding that option then. I'll see if I can come up with a patch for that. Cheers, -- Luis