---- 在 星期五, 2020-10-16 12:39:10 Amir Goldstein <amir73il@xxxxxxxxx> 撰写 ---- > On Fri, Oct 16, 2020 at 4:56 AM Chengguang Xu <cgxu519@xxxxxxxxxxxx> wrote: > > > > ---- 在 星期五, 2020-10-16 00:02:42 Amir Goldstein <amir73il@xxxxxxxxx> 撰写 ---- > > > > > > > When an inode is writably mapped via ovarlayfs, you can flag the > > > > > > > overlay inode with "maybe-writably-mapped" and then remove > > > > > > > it from the maybe dirty list when the underlying inode is not dirty > > > > > > > AND its i_writecount is 0 (checked on write_inode() and release()). > > > > > > > > > > > > > > Actually, there is no reason to treat writably mapped inodes and > > > > > > > other dirty inodes differently - insert to suspect list on open for > > > > > > > write, remove from suspect list on last release() or write_inode() > > > > > > > when inode is no longer dirty and writable. > > > > > > > > I have to say inserting to suspect list cannot prevent dropping, > > > > thinking of the problem of previous approach that we write dirty upper > > > > inode with current->flags & PF_MEMALLOC while evicting clean overlay inode. > > > > > > > > > > Sorry, I don't understand what that means. > > > > This is the main problem of my previous patch set V10, evicting clean inode > > expects no write behavior but in the case of dirty upper inode we have to > > write out dirty data in this timing otherwise we will lose the connection with upper inode. > > > > My thinking was that the suspect list holds a reference to the overlay inode. > The question is can we always safely get rid of that reference and remove > from the suspect list when the inode is no longer "writable". Let's see... > > > > > > > > > > > > > > > > > > > > > > > > Did I miss anything? > > > > > > > > > > > > > > > > > > > If we dirty overlay inode that means we have launched writeback mechanism, > > > > > > so in this case, re-dirty overlay inode in time becomes important. > > > > > > > > > > > > > > > > My idea was to use the first call to ovl_sync_fs() with 'wait' false > > > > > to iterate the > > > > > maybe-dirty list and re-dirty overlay inodes whose upper is dirty. > > > > > > > > > > > > > I'm curious how we prevent dropping of clean overlay inode with dirty upper? > > > > Hold another reference during iput_final operation? in the drop_inode() or something > > > > else? > > > > > > No, just return 0 from ovl_drop_inode() and iput_final() will not evict(). > > > > It's not good, it only temporarily skips eviction, the inode in lru list > > will be evicted in some cases like drop cache or memory reclaim. etc. > > > > A solution for this is getting another reference in ->drop_inode so that > > the inode can escape from adding to lru list but this looks awkward and tricky. > > > > Right, that was nonsense. We need to rely on the reference held by the > suspect list. > > > > > > > > > > > > > > > > > Then in the second call to __sync_filesystem, sync_inodes_sb() will take > > > > > care of calling ovl_write_inode() for all the re-dirty inodes. > > > > > > > > > > In current code we sync ALL dirty upper fs inodes and we do not overlay > > > > > inodes with no reference in cache. > > > > > > > > > > The best code would sync only upper fs inodes dirtied by this overlay > > > > > and will be able to evict overlay inodes whose upper inodes are clean. > > > > > > > > > > The compromise code would sync only upper fs inodes dirtied by this overlay, > > > > > and would not evict overlay inodes as long as upper inodes are "open for write". > > > > > I think its a fine compromise considering the alternatives. > > > > > > > > > > Is this workable? > > > > > > > > > > > > > In your approach, the key point is how to prevent dropping overlay inode that has > > > > dirty upper and no reference but I don't understand well how to achieve it from > > > > your descriptions. > > > > > > > > > > > > > > Very well, I will try to explain with code: > > > > > > int ovl_inode_is_open_for_write(struct inode *inode) > > > { > > > struct inode *upper_inode = ovl_inode_upper(inode); > > > > > > return upper_inode && inode_is_open_for_write(upper_inode); > > > } > > > > > > void ovl_maybe_mark_inode_dirty(struct inode *inode) > > > { > > > struct inode *upper_inode = ovl_inode_upper(inode); > > > > > > if (upper_inode && upper_inode->i_state & I_DIRTY_ALL) > > > mark_inode_dirty(inode); > > > } > > > > > > int ovl_open(struct inode *inode, struct file *file) > > > { > > > ... > > > if (ovl_inode_is_open_for_write(file_inode(file))) > > > ovl_add_inode_to_suspect_list(inode); > > > > > > file->private_data = realfile; > > > > > > return 0; > > > } > > > > > > int ovl_release(struct inode *inode, struct file *file) > > > { > > > struct inode *inode = file_inode(file); > > > > > > if (ovl_inode_is_open_for_write(inode)) { > > > ovl_maybe_mark_inode_dirty(inode); > > > ovl_remove_inode_from_suspect_list(inode); > > > > I think in some cases removing from suspect_list will cause losing > > the connection with upper inode that has writable mmap. > > > > First of all I had a bug here. > Need to check for !ovl_inode_is_open_for_write(inode) after fput(). > > If the upper inode has a writable mmap, the upper inode would still > be "writable" (i_writecount held by the map realfile reference). > > So when closing the last overlay file reference while upper inode > writable maps still exist, the remaining issue is when to remove > the overlay inode from the suspect list and allow its eviction and > I did not mention that. > > I *think* that this condition should be safe in the regard that > after the condition is met, there is no way to dirty the upper inode > via overlayfs without going through ovl_open(). > Obviously, the test should be done with some list lock held. > > bool ovl_may_remove_from_suspect_list(struct inode *inode) > { > struct inode *upper_inode = ovl_inode_upper(inode); > > if (upper_inode && upper_inode->i_state & I_DIRTY_ALL) > return false; > > return !inode_is_open_for_write(upper_inode); > } > > Now remains the question of WHEN to check for removal > from the suspect list. > > The first place is in ovl_sync_fs() when iterating the suspect list, > inodes that meet the above criteria are "indefinitely clean" and > may be removed from the list at that timing. > > For eviction during memory pressure, overlay can register a > shrinker to do this garbage collection. Is shrinker being called > on drop_caches? I'm not sure. But we can also do periodic garbage > collection. > > In the end, it is not the common case and we just need the garbage > collection to mitigate DoS (or existing workload) that does many: > - open > - mmap(...PROT_WRITE, MAP_SHARED...) > - close > - munmap > That right, as you mentioned above releasing timing is important and that also means in worst case, overlayfs hold too much memory resource until umount. Maybe more proactive shrinker like dedicated workqueue thread of overlayfs instead of global shrinker will be more helpful. > > > > > > } > > > > > > fput(file->private_data); > > > > > > return 0; > > > } > > > > > > int ovl_drop_inode(struct inode *inode) > > > { > > > struct inode *upper_inode = ovl_inode_upper(inode); > > > > > > if (!upper_inode) > > > return 1; > > > if (upper_inode->i_state & I_DIRTY_ALL) > > > return 0; > > > > > > return !inode_is_open_for_write(upper_inode); > > > > Is this condition just for writable mmap? > > > > No, it's for all inodes that may be written from overlayfs (or > from maps created by overalyfs), but as you wrote, this > test is not needed in drop_inode(). > > > > > > } > > > > > > static int ovl_sync_fs(struct super_block *sb, int wait) > > > { > > > struct ovl_fs *ofs = sb->s_fs_info; > > > struct super_block *upper_sb; > > > int ret; > > > > > > if (!ovl_upper_mnt(ofs)) > > > return 0; > > > > > > /* > > > * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC). > > > * All the super blocks will be iterated, including upper_sb. > > > * > > > * If this is a syncfs(2) call, then we do need to call > > > * sync_filesystem() on upper_sb, but enough if we do it when being > > > * called with wait == 1. > > > */ > > > if (!wait) { > > > /* mark inodes on the suspect list dirty if thier > > > upper inode is dirty */ > > > ovl_mark_suspect_list_inodes_dirty(); > > > return 0; > > > } > > > ... > > > > > > > Why 2 rounds? it seems the simplest way is only syncing dirty upper inode > > on wait==1 just like my previous patch. > > > > We don't have to do it in 2 rounds, but as long as we have a suspect list, > we can use it to start writeback on wait==0 on all dirty upper inodes from > our list just like the caller intended. > > Do we need overlay sbi and mark_inode_dirty() and ovl_write_inode() at all? > I'm not sure. It feels like a good idea to use generic infrastructure as much as > possible. You should know better than me to answer this question. > IMO, if we maintain a special list like suspect-list to organize target overlay inodes, then we don't have to mark overlay inode dirty, marking overlay inode dirty will bring unnecessary complexity in this case and I think there is no special benefit. So this approach may achieve containerized syncfs for overlayfs and I've submitted patch set V12 follow this thinking but I think the complexity is still big obstacle to merge to upstream from feedback of Miklos. The new approach I posted the other day is really simple to understand compare to the previous solution, if we can optimize dirtiness notification part a bit more then I think it will be more acceptable. In the end, I need more feedback and If you guys all think the previous solution is the right way, then I can do more work based on our discussion above and post V13 later. Thanks, Chengguang