On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote: > On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote: > > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote: > > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote: > > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote: > > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote: > > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote: > > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote: > > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote: > > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote: > > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote: > > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote: > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote: > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote: > > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote: > > ... > > > > > > What _seems_ beneficial about that approach is we get (potentially > > > > > > external) persistent backing and memory reclaim ability with the > > > > > > traditional memory allocation model. > > > > > > > > > > > > ISTM that if we used a regular file, we'd need to deal with the > > > > > > traditional file interface somehow or another (file read/pagecache > > > > > > lookup -> record ??). > > > > > > > > > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so > > > > > all we need is a (struct file *). > > > > > > > > > > > We could repurpose some existing mechanism like the directory code or > > > > > > quota inode mechanism to use xfs buffers for that purpose, but I think > > > > > > that would require us to always use an internal inode. Allowing > > > > > > userspace to pass an fd/file passes that consideration on to the user, > > > > > > which might be more flexible. We could always warn about additional > > > > > > limitations if that fd happens to be based on the target fs. > > > > > > > > > > <nod> A second advantage of the struct file/kernel_{read,write} approach > > > > > is that we if we ever decide to let userspace pass in a fd, it's trivial > > > > > to feed that struct file to the kernel io routines instead of a memfd > > > > > one. > > > > > > > > > > > > > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do > > > > something like this anyways. Could/should xfs_scrub be responsible for > > > > allocating a memfd and passing along the fd? Another advantage of doing > > > > that is whatever logic we may need to clean up old repair files or > > > > whatever is pushed to userspace. > > > > > > There are two ways we could do this -- one is to have the kernel manage > > > the memfd creation internally (like my patches do now); the other is for > > > xfs_scrub to pass in creat(O_TMPFILE). > > > > > > When repair fputs the file (or fdputs the fd if we switch to using > > > that), the kernel will perform the usual deletion of the zero-linkcount > > > zero-refcount file. We get all the "cleanup" for free by closing the > > > file. > > > > > > > Ok. FWIW, the latter approach where xfs_scrub creates a file and passes > > the fd along to the kernel seems preferable to me, but perhaps others > > have different opinions. We could accept a pathname from the user to > > create the file or otherwise attempt to allocate an memfd by default and > > pass that along. > > > > > One other potential complication is that a couple of the repair > > > functions need two memfds. The extended attribute repair creates a > > > fixed-record array for attr keys and an xblob to hold names and values; > > > each structure gets its own memfd. The refcount repair creates two > > > fixed-record arrays, one for refcount records and another to act as a > > > stack of rmaps to compute reference counts. > > > > > > > Hmm, I guess there's nothing stopping scrub from passing in two fds. > > Maybe it would make more sense for the userspace option to be a path > > basename or directory where scrub is allowed to create whatever scratch > > files it needs. > > > > That aside, is there any reason the repair mechanism couldn't emulate > > multiple files with a single fd via a magic offset delimeter or > > something? E.g., "file 1" starts at offset 0, "file 2" starts at offset > > 1TB, etc. (1TB is probably overkill, but you get the idea..). > > Hmm, ok, so to summarize, I see five options: > > 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...) > however many files it needs. > > 2) Pass in a however many file fds we need and segment the space. > > 3) Pass in a single file fd. > > 4) Let the repair code create as many memfd files as it wants. > > 5) Let the repair code create one memfd file and segment the space. > > I'm pretty sure we don't want to support (2) because that just seems > like a requirements communication nightmare and can burn up a lot of > space in struct xfs_scrub_metadata. > > (3) and (5) are basically the same except for where the file comes from. > For (3) we'd have to make sure the fd filesystem supports large sparse > files (and presumably isn't the xfs we're trying to repair), which > shouldn't be too difficult to probe. For (5) we know that tmpfs already > supports large sparse files. Another difficulty might be that on 32-bit > the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE), > though I suppose at this point we only need two files and 8TB should be > enough for anyone. > > (I also think it's reasonable to consider not supporting online repair > on a 32-bit system with a large filesystem...) > > In general, the "pass in a thing from userspace" variants come with the > complication that we have to check the functionality of whatever gets > passed in. On the plus side it likely unlocks access to a lot more > storage than we could get with mem+swap. On the minus side someone > passes in a fd to a drive-managed SMR on USB 2.0, and... > > (1) seems like it would maximize the kernel's flexibility to create as > many (regular, non-sparse) files as it needs, but now we're calling > do_sys_open and managing files ourselves, which might be avoided. > > (4) of course is what we do right now. :) > > Soooo... the simplest userspace interface (I think) is to allow > userspace to pass in a single file fd. Scrub can reject it if it > doesn't measure up (fs is the same, sparse not supported, high offsets > not supported, etc.). If userspace doesn't pass in an fd then we create > a memfd and use that instead. We end up with a hybrid between (3) and (5). > That all sounds about right to me except I was thinking userspace would do the memfd fallback of #5 rather than the kernel, just to keep the policy out of the kernel as much as possible. Is there any major advantage to doing it in the kernel? I guess it would slightly complicate 'xfs_io -c repair' ... Brian > --D > > > Brian > > > > > (In theory the xbitmap could also be converted to use the fixed record > > > array, but in practice they haven't (yet) become large enough to warrant > > > it, and there's currently no way to insert or delete records from the > > > middle of the array.) > > > > > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it > > > > > > > > swappable or something? > > > > > > > > > > > > > > It's supposed to be. The quick test I ran (allocate a memfd, write 1GB > > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into > > > > > > > the swap file. > > > > > > > > > > > > > > > > > > > Ok. > > > > > > > > > > > > > > If so, that sounds a reasonable option provided the swap space > > > > > > > > requirement can be made clear to users > > > > > > > > > > > > > > We can document it. I don't think it's any worse than xfs_repair being > > > > > > > able to use up all the memory + swap... and since we're probably only > > > > > > > going to be repairing one thing at a time, most likely scrub won't need > > > > > > > as much memory. > > > > > > > > > > > > > > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison > > > > > > are that 1.) the kernel generally has more of a limit on anonymous > > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.) > > > > > > it's not clear how effectively running the system out of memory via the > > > > > > kernel will behave from a failure perspective. > > > > > > > > > > > > IOW, xfs_repair can run the system out of memory but for the most part > > > > > > that ends up being a simple problem for the system: OOM kill the bloated > > > > > > xfs_repair process. For an online repair in a similar situation, I have > > > > > > no idea what's going to happen. > > > > > > > > > > Back in the days of the huge linked lists the oom killer would target > > > > > other proceses because it doesn't know that the online repair thread is > > > > > sitting on a ton of pinned kernel memory... > > > > > > > > > > > > > Makes sense, kind of what I'd expect... > > > > > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM > > > > > > we'd still be at risk of other subsystems running into memory > > > > > > allocation problems, filling up swap, the OOM killer going after > > > > > > unrelated processes, etc. What if, for example, the OOM killer starts > > > > > > picking off processes in service to a running online repair that > > > > > > immediately consumes freed up memory until the system is borked? > > > > > > > > > > Yeah. One thing we /could/ do is register an oom notifier that would > > > > > urge any running repair threads to bail out if they can. It seems to me > > > > > that the oom killer blocks on the oom_notify_list chain, so our handler > > > > > could wait until at least one thread exits before returning. > > > > > > > > > > > > > Ok, something like that could be useful. I agree that we probably don't > > > > need to go that far until the mechanism is nailed down and testing shows > > > > that OOM is a problem. > > > > > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" + > > > "400M of RAM and no swap" scenario. Granted, pretty much every other > > > xfs utility also blows out on that so I'm not sure how hard I really > > > need to try... > > > > > > > > > I don't know how likely that is or if it really ends up much different > > > > > > from the analogous xfs_repair situation. My only point right now is > > > > > > that failure scenario is something we should explore for any solution > > > > > > we ultimately consider because it may be an unexpected use case of the > > > > > > underlying mechanism. > > > > > > > > > > Ideally, online repair would always be the victim since we know we have > > > > > a reasonable fallback. At least for memfd, however, I think the only > > > > > clues we have to decide the question "is this memfd getting in the way > > > > > of other threads?" is either seeing ENOMEM, short writes, or getting > > > > > kicked by an oom notification. Maybe that'll be enough? > > > > > > > > > > > > > Hm, yeah. It may be challenging to track memfd usage as such. If > > > > userspace has access to the fd on an OOM notification or whatever, it > > > > might be able to do more accurate analysis based on an fstat() or > > > > something. > > > > > > > > Related question... is the online repair sequence currently > > > > interruptible, if xfs_scrub receives a fatal signal while pulling in > > > > entries during an allocbt scan for example? > > > > > > It's interruptible (fatal signals only) during the scan phase, but once > > > it starts logging metadata updates it will run all the way to > > > completion. > > > > > > > > > (To the contrary, just using a cached file seems a natural fit from > > > > > > that perspective.) > > > > > > > > > > Same here. > > > > > > > > > > > > > and the failure characteristics aren't more severe than for userspace. > > > > > > > > An online repair that puts the broader system at risk of OOM as > > > > > > > > opposed to predictably failing gracefully may not be the most useful > > > > > > > > tool. > > > > > > > > > > > > > > Agreed. One huge downside of memfd seems to be the lack of a mechanism > > > > > > > for the vm to push back on us if we successfully write all we need to > > > > > > > the memfd but then other processes need some memory. Obviously, if the > > > > > > > memfd write itself comes up short or fails then we dump the memfd and > > > > > > > error back to userspace. We might simply have to free array memory > > > > > > > while we iterate the records to minimize the time spent at peak memory > > > > > > > usage. > > > > > > > > > > > > > > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach > > > > > > may simplify things because we could allocate it up front and know right > > > > > > away whether we just don't have enough memory available to repair. > > > > > > > > > > Hmm. Apparently we actually /can/ call fallocate on memfd to grab all > > > > > the pages at once, provided we have some guesstimate beforehand of how > > > > > much space we think we'll need. > > > > > > > > > > So long as my earlier statement about the memory requirements being no > > > > > more than the size of the btree leaves is actually true (I haven't > > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() * > > > > > blocksize) worth of space in the memfd file. Maybe we ask for 1.5x > > > > > that and if we don't get it, we kill the memfd and exit. > > > > > > > > > > > > > Indeed. It would be nice if we could do all of the file management bits > > > > in userspace. > > > > > > Agreed, though no file management would be even better. :) > > > > > > --D > > > > > > > Brian > > > > > > > > > --D > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > +done: > > > > > > > > > > > > > > > > > + /* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */ > > > > > > > > > > > > > > > > > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG); > > > > > > > > > > > > > > > > > + return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo, > > > > > > > > > > > > > > > > > + XFS_AG_RESV_NONE); > > > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644 > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c > > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp( > > > > > > > > > > > > > > > > > diff = b1->bno - b2->bno; > > > > > > > > > > > > > > > > > return diff; > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */ > > > > > > > > > > > > > > > > > +bool > > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty( > > > > > > > > > > > > > > > > > + struct xfs_perag *pag) > > > > > > > > > > > > > > > > > +{ > > > > > > > > > > > > > > > > > + spin_lock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > + if (pag->pagb_tree.rb_node) { > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Good suggestion, thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > + return false; > > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > + spin_unlock(&pag->pagb_lock); > > > > > > > > > > > > > > > > > + return true; > > > > > > > > > > > > > > > > > +} > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644 > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h > > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list) > > > > > > > > > > > > > > > > > list_sort(NULL, list, xfs_extent_busy_ag_cmp); > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag); > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > #endif /* __XFS_EXTENT_BUSY_H__ */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > -- > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html