On Mon, Jun 19, 2017 at 12:12 PM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > On Mon, Jun 19, 2017 at 12:06 PM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: >> On Mon, Jun 19, 2017 at 10:45 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: >>> On Sat, Jun 10, 2017 at 1:56 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: >>>> On Fri, Jun 9, 2017 at 4:14 PM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: >>>>> On Fri, Jun 9, 2017 at 1:49 PM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: >>>>>> On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: >>>>> >>>>>> The challenge is in accounting the number of link-ups atomically with >>>>>> the link-up itself. No ideas off-hand. >>>>> >>>>> Following would work I think: >>>>> >>>>> - copy up to indexdir (it now has st_nlink == 1) >>>>> - set overlay.nlinkup to "1-2" >>>>> + the first number indicates the target number of linkups >>>>> + the second indicates the target st_nlink on the file >>>>> - fsync the file (hopefully this writes the xattrs as well as the data) >>>>> - link the file from indexdir to upper >>>>> - set overlay.nlinkup to "1" >>>>> + this indicates that the linkup was finished and number of linkups is 1. >>>>> >>>> >>>> Or like this: >>>> >>>> struct ovl_nlink_adjust { >>>> int nlink_diff; >>>> bool from_lower; >>>> } >>>> >>>> Init overlay inode: >>>> overlay_nlink := (from_lower ? lower_nlink : upper_nlink) + nlink_diff >>>> >>>> Copy/link up: >>>> - take ovl_inode lock >>>> - set overlay.nlink { (overlay_nlink - lower_link), true } >>>> - link index (won't change diff from lower inode nlink on fail/success) >>>> >>>> Unlink/link/rename: >>>> - take ovl_inode lock >>>> - set overlay.nlink { (overlay_nlink - upper_nlink), false } >>>> - unlink/link/rename (won't change diff from upper nlink on fail/success) >> >> What I don't like about this is how it ends up in different states >> depending on the last operation. How about moving "set overlay.nlink >> { (overlay_nlink - upper_nlink), false }" into the link-up region as >> well? >> > > Sure. that's not a problem. > But then I have 2 options in case get_inode finds diff from lower: > - either set it right a way to diff from upper > - or let the next copyup/nlink operation store the consistent state > > I guess you prefer the former? > > Note that I do have to change the unlink/rename/link operations > to grab the ovl_inode lock anyway, so at least w.r.t. code complexity > it is simple to add those ovl_set_nlink() calls. to unlink/rename/link. I don't really have a preference when to restore the "false" state. If it's simpler to do at unlink/rename/link time than at lookup time, then that's fine. >>>> I don't think fsync matters to this game. >>>> I'd be surprised if the nlink changing operations (link/unlink/rename) can be >>>> reordered with the setxattr operations on a power fail safe fs. >> >> Why not? I know data changes can be reordered around rename without >> an fsync, for example. >> > > Certainly, data changes are not journalled in common ext4/xfs configurations > and the data=ordered of ext4 guaranties only that data is flushed > before *related* > metadata is journalled and does not guaranty at all that data/metadata > are flushed > in order that they were changed. > > The thing with rename vs. data change is that rename doesn't necessarily > change the renamed inode, so journalling the rename operation doesn't > need to journal the inode itself and therefore, there are no *related* metadata > changes that require flushing the inode data. > > But unlike rename(taget, foo), the operations unlink(target), > link(target, foo) and > rename(foo, target) all change nlink of target inode and therefore must journal > the target inode, the same target inode whose xattr are being written before and > after the nlink change, and which also implies ctime change. > > So I'm pretty sure that reordering xattr writes with nlink change of > the target inode > are not possible with the traditional journalling file system, but I > don't know of any > written rules about this. > > Do you suggest that we call vfs_fsync() at the end of ovl_copy_up_inode() > after ovl_set_origin() instead of only at the end of ovl_copy_up_data()? > I am not at ease with the performance overhead this could generate, but > perhaps there is no reason for concern. Link-up is a rare corner case. I don't think we should worry about performance. Thanks, Miklos