On Sat, Aug 11, 2018 at 08:50:49AM -0400, Brian Foster wrote: > On Fri, Aug 10, 2018 at 12:36:51PM -0700, Darrick J. Wong wrote: > > On Fri, Aug 10, 2018 at 03:07:40PM -0400, Brian Foster wrote: > > > On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote: > > > > On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote: > > > > > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote: > > > > > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote: > > > > > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote: > > > > > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote: > > > > > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote: > > > > > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote: > > > > > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote: > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote: > > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote: > > > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote: > > > > > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote: > ... > > > > > > > > Hmm, ok, so to summarize, I see five options: > > > > > > > > 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...) > > > > however many files it needs. > > > > > > > > 2) Pass in a however many file fds we need and segment the space. > > > > > > > > 3) Pass in a single file fd. > > > > > > > > 4) Let the repair code create as many memfd files as it wants. > > > > > > > > 5) Let the repair code create one memfd file and segment the space. > > > > > > > > I'm pretty sure we don't want to support (2) because that just seems > > > > like a requirements communication nightmare and can burn up a lot of > > > > space in struct xfs_scrub_metadata. > > > > > > > > (3) and (5) are basically the same except for where the file comes from. > > > > For (3) we'd have to make sure the fd filesystem supports large sparse > > > > files (and presumably isn't the xfs we're trying to repair), which > > > > shouldn't be too difficult to probe. For (5) we know that tmpfs already > > > > supports large sparse files. Another difficulty might be that on 32-bit > > > > the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE), > > > > though I suppose at this point we only need two files and 8TB should be > > > > enough for anyone. > > > > > > > > (I also think it's reasonable to consider not supporting online repair > > > > on a 32-bit system with a large filesystem...) > > > > > > > > In general, the "pass in a thing from userspace" variants come with the > > > > complication that we have to check the functionality of whatever gets > > > > passed in. On the plus side it likely unlocks access to a lot more > > > > storage than we could get with mem+swap. On the minus side someone > > > > passes in a fd to a drive-managed SMR on USB 2.0, and... > > > > > > > > (1) seems like it would maximize the kernel's flexibility to create as > > > > many (regular, non-sparse) files as it needs, but now we're calling > > > > do_sys_open and managing files ourselves, which might be avoided. > > > > > > > > (4) of course is what we do right now. :) > > > > > > > > Soooo... the simplest userspace interface (I think) is to allow > > > > userspace to pass in a single file fd. Scrub can reject it if it > > > > doesn't measure up (fs is the same, sparse not supported, high offsets > > > > not supported, etc.). If userspace doesn't pass in an fd then we create > > > > a memfd and use that instead. We end up with a hybrid between (3) and (5). > > > > > > > > > > That all sounds about right to me except I was thinking userspace would > > > do the memfd fallback of #5 rather than the kernel, just to keep the > > > policy out of the kernel as much as possible. Is there any major > > > advantage to doing it in the kernel? I guess it would slightly > > > complicate 'xfs_io -c repair' ... > > > > Hm. We'll have to use one of the reserved areas of struct > > xfs_scrub_metadata to pass in the file descriptor. If we create a new > > XFS_SCRUB_IFLAG_FD flag to indicate that we're passing in a file > > descriptor then either we lose compatibility with old kernels (because > > they reject unknown flags) or xfs_scrub will have to try a repair > > without a fd (to see if the kernel even cares) and retry if the repair > > fails with some prearranged error code that means "give me a swapfile, > > please". Alternately we simply require that the fd cannot be fd 0 since > > using stdin for swap space is a stupid idea anyways. > > > > I'm assuming that the kernel would have some basic checks on the fd to > ensure it's usable (seekable, large offsets, etc.), as you mentioned > previously. Of course. :) > With regard to xfs_scrub_metadata, it sounds like we need to deal with > that regardless if we want to support the ability to specify an external > file. Is the issue backwards compatibility with the interface as it > exists today..? Yes, my question is how hard do we try to maintain backwards compatibility with an ioctl that controls an EXPERIMENTAL feature that is disabled by default in Kconfig? :) > > Technically we're not supposed to have flag days, but otoh this is a > > xfs-only ioctl for a feature that's still experimental, so perhaps it's > > not crucial to maintain compatibility with old kernels where the feature > > is incomplete and experimental? > > > > In my mind, I kind of take the experimental status as all bits/interface > may explode and are otherwise subject to change or disappear. Perhaps > others feel differently, it does seem we've kind of hinted towards the > contrary recently with respect to the per-inode dax bits and then now in > this discussion, but IMO that's kind of an inherent risk of doing > incremental work on complex features upstream. > > I dunno, perhaps that's just a misunderstanding on my part. If so, I do > wonder if we should be a bit more cautious (in the future) about > exposing interfaces to experimental features (DEBUG mode only, for > example) for a period of time until the underlying mechanism is fleshed > out enough to establish confidence in the interface. That was my reason for hiding it all behind a 'default N' Kconfig option -- to limit the number of users to those who build their own kernels. > It's one thing if an experimental feature is shiny new and potentially > unstable at the time it is merged, but enough bits are there for > reviewers to understand the design and interface requirements. It's > another thing if the implementation is not yet complete, because then > it's obviously harder to surmise whether the interface is ultimately > sufficient. <nod> I decided that it if we left experimental warnings in dmesg and the xfs_scrub output and forced users to rebuild their kernel to turn on scrub/repair then it was reasonable that we could change the ioctl interface without worrying too much about backwards compatibility. I think it's fine to add a 's32 sm_fd' field that can't be zero and can be picked up by scrub or repair if they want access to more space. > This of course is all higher level discussion from how to handle scrub.. > > > Hmm. We could define the fd field with the requirement that fd > 0, and > > if the repair function requires an fd and one hasn't been provided, it > > can fail out with ENOMEM. If it doesn't need extra memory it can just > > ignore the contents of the fd field. xfs_scrub can then arrange to pass > > in mem fds or file fds or whatever. > > > > Is there a versioning mechanism to the interface? I thought we used that > approach (or planned to..) in other similar internal commands, so a > particular kernel could bump the version and appropriately decide how to > handle older versions. There's plenty of space in the structure that's all required to be zero, so we could easily add a u8 sm_version some day. The IFLAG bit I mentioned would be sufficient for the fd field. --D > > Brian > > > --D > > > > > Brian > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > > (In theory the xbitmap could also be converted to use the fixed record > > > > > > array, but in practice they haven't (yet) become large enough to warrant > > > > > > it, and there's currently no way to insert or delete records from the > > > > > > middle of the array.) > > > > > > > > > > > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it > > > > > > > > > > > swappable or something? > > > > > > > > > > > > > > > > > > > > It's supposed to be. The quick test I ran (allocate a memfd, write 1GB > > > > > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into > > > > > > > > > > the swap file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok. > > > > > > > > > > > > > > > > > > > > If so, that sounds a reasonable option provided the swap space > > > > > > > > > > > requirement can be made clear to users > > > > > > > > > > > > > > > > > > > > We can document it. I don't think it's any worse than xfs_repair being > > > > > > > > > > able to use up all the memory + swap... and since we're probably only > > > > > > > > > > going to be repairing one thing at a time, most likely scrub won't need > > > > > > > > > > as much memory. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison > > > > > > > > > are that 1.) the kernel generally has more of a limit on anonymous > > > > > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.) > > > > > > > > > it's not clear how effectively running the system out of memory via the > > > > > > > > > kernel will behave from a failure perspective. > > > > > > > > > > > > > > > > > > IOW, xfs_repair can run the system out of memory but for the most part > > > > > > > > > that ends up being a simple problem for the system: OOM kill the bloated > > > > > > > > > xfs_repair process. For an online repair in a similar situation, I have > > > > > > > > > no idea what's going to happen. > > > > > > > > > > > > > > > > Back in the days of the huge linked lists the oom killer would target > > > > > > > > other proceses because it doesn't know that the online repair thread is > > > > > > > > sitting on a ton of pinned kernel memory... > > > > > > > > > > > > > > > > > > > > > > Makes sense, kind of what I'd expect... > > > > > > > > > > > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM > > > > > > > > > we'd still be at risk of other subsystems running into memory > > > > > > > > > allocation problems, filling up swap, the OOM killer going after > > > > > > > > > unrelated processes, etc. What if, for example, the OOM killer starts > > > > > > > > > picking off processes in service to a running online repair that > > > > > > > > > immediately consumes freed up memory until the system is borked? > > > > > > > > > > > > > > > > Yeah. One thing we /could/ do is register an oom notifier that would > > > > > > > > urge any running repair threads to bail out if they can. It seems to me > > > > > > > > that the oom killer blocks on the oom_notify_list chain, so our handler > > > > > > > > could wait until at least one thread exits before returning. > > > > > > > > > > > > > > > > > > > > > > Ok, something like that could be useful. I agree that we probably don't > > > > > > > need to go that far until the mechanism is nailed down and testing shows > > > > > > > that OOM is a problem. > > > > > > > > > > > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" + > > > > > > "400M of RAM and no swap" scenario. Granted, pretty much every other > > > > > > xfs utility also blows out on that so I'm not sure how hard I really > > > > > > need to try... > > > > > > > > > > > > > > > I don't know how likely that is or if it really ends up much different > > > > > > > > > from the analogous xfs_repair situation. My only point right now is > > > > > > > > > that failure scenario is something we should explore for any solution > > > > > > > > > we ultimately consider because it may be an unexpected use case of the > > > > > > > > > underlying mechanism. > > > > > > > > > > > > > > > > Ideally, online repair would always be the victim since we know we have > > > > > > > > a reasonable fallback. At least for memfd, however, I think the only > > > > > > > > clues we have to decide the question "is this memfd getting in the way > > > > > > > > of other threads?" is either seeing ENOMEM, short writes, or getting > > > > > > > > kicked by an oom notification. Maybe that'll be enough? > > > > > > > > > > > > > > > > > > > > > > Hm, yeah. It may be challenging to track memfd usage as such. If > > > > > > > userspace has access to the fd on an OOM notification or whatever, it > > > > > > > might be able to do more accurate analysis based on an fstat() or > > > > > > > something. > > > > > > > > > > > > > > Related question... is the online repair sequence currently > > > > > > > interruptible, if xfs_scrub receives a fatal signal while pulling in > > > > > > > entries during an allocbt scan for example? > > > > > > > > > > > > It's interruptible (fatal signals only) during the scan phase, but once > > > > > > it starts logging metadata updates it will run all the way to > > > > > > completion. > > > > > > > > > > > > > > > (To the contrary, just using a cached file seems a natural fit from > > > > > > > > > that perspective.) > > > > > > > > > > > > > > > > Same here. > > > > > > > > > > > > > > > > > > > and the failure characteristics aren't more severe than for userspace. > > > > > > > > > > > An online repair that puts the broader system at risk of OOM as > > > > > > > > > > > opposed to predictably failing gracefully may not be the most useful > > > > > > > > > > > tool. > > > > > > > > > > > > > > > > > > > > Agreed. One huge downside of memfd seems to be the lack of a mechanism > > > > > > > > > > for the vm to push back on us if we successfully write all we need to > > > > > > > > > > the memfd but then other processes need some memory. Obviously, if the > > > > > > > > > > memfd write itself comes up short or fails then we dump the memfd and > > > > > > > > > > error back to userspace. We might simply have to free array memory > > > > > > > > > > while we iterate the records to minimize the time spent at peak memory > > > > > > > > > > usage. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach > > > > > > > > > may simplify things because we could allocate it up front and know right > > > > > > > > > away whether we just don't have enough memory available to repair. > > > > > > > > > > > > > > > > Hmm. Apparently we actually /can/ call fallocate on memfd to grab all > > > > > > > > the pages at once, provided we have some guesstimate beforehand of how > > > > > > > > much space we think we'll need. > > > > > > > > > > > > > > > > So long as my earlier statement about the memory requirements being no > > > > > > > > more than the size of the btree leaves is actually true (I haven't > > > > > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() * > > > > > > > > blocksize) worth of space in the memfd file. Maybe we ask for 1.5x > > > > > > > > that and if we don't get it, we kill the memfd and exit. > > > > > > > > > > > > > > > > > > > > > > Indeed. It would be nice if we could do all of the file management bits > > > > > > > in userspace. > > > > > > > > > > > > Agreed, though no file management would be even better. :) > > > > > > > > > > > > --D > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > +done: > > > > > > > > > > > > > > > > > > > > + /* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */ > > > > > > > > > > > > > > > > > > > > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG); > > > > > > > > > > > > > > > > > > > > + return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo, > > > > > > > > > > > > > > > > > > > > + XFS_AG_RESV_NONE); > > > > > > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644 > > > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp( > > > > > > > > > > > > > > > > > > > > diff = b1->bno - b2->bno; > > > > > > > > > > > > > > > > > > > > return diff; > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */ > > > > > > > > > > > > > > > > > > > > +bool > > > > > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty( > > > > > > > > > > > > > > > > > > > > + struct xfs_perag *pag) > > > > > > > > > > > > > > > > > > > > +{ > > > > > > > > > > > > > > > > > > > > + spin_lock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > > > > + if (pag->pagb_tree.rb_node) { > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good suggestion, thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > > > > + return false; > > > > > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > > > > + return true; > > > > > > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644 > > > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list) > > > > > > > > > > > > > > > > > > > > list_sort(NULL, list, xfs_extent_busy_ag_cmp); > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag); > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > #endif /* __XFS_EXTENT_BUSY_H__ */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > -- > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > -- > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html