On Wed 16-05-18 12:12:00, Andreas Dilger wrote: > On May 15, 2018, at 8:58 PM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote: > > > > On Tue, May 15, 2018 at 10:06:23PM +0900, Wang Shilong wrote: > >> From: Wang Shilong <wshilong@xxxxxxx> > >> > >> During our benchmarking, we found sometimes writing > >> performances are not stable enough and there are some > >> small read during write which could drop throughput(~30%). > > > > Out of curiosity, what sort of benchmarks are you doing? > > This is related to Lustre, though I can't comment on the specific > benchmarks. We've had a variety of reports about this issue. The > current workaround (hack) is "dumpe2fs /dev/XXX > /dev/null" at > mount time and periodically at runtime via cron to keep the bitmaps > loaded, but we wanted to get a better solution in place that works > for everyone. Being able to load all of the bitmaps at mount time > also helps get the server performance "up to speed" quickly, rather > than on-demand loading of a few hundred MB of bitmaps over time. So what I don't like on the explicit pinning is that only a few admins will be able to get it right. > >> It turned out that block bitmaps loading could make > >> some latency here,also for a heavy fragmented filesystem, > >> we might need load many bitmaps to find some free blocks. > >> > >> To improve above situation, we had a patch to load block > >> bitmaps to memory and pin those bitmaps memory until umount > >> or we release the memory on purpose, this could stable write > >> performances and improve performances of a heavy fragmented > >> filesystem. > > > > This is true, but I wonder how realistic this is on real production > > systems. For a 1 TiB file system, pinning all of the block bitmaps > > will require 32 megabytes of memory. Is that really realistic for > > your use case? > > In the case of Lustre servers, they typically have 128GB of RAM or > more, and are serving a few hundred TB of storage each these days, > but do not have any other local users, so the ~6GB RAM usage isn't a > huge problem if it improves the performance/consistency. The real > issue is that the bitmaps do not get referenced often enough compared > to the 10GB/s of data flowing through the server, so they are pushed > out of memory too quickly. OK, and that 10GB/s is mostly use once data? So I could imagine we cache free block information in a more efficient format (something like you or Ted describe), provide a proper shrinker for it (possibly biasing it to make reclaim less likely), and enable it always... That way users don't have to configure it and we don't have to be afraid of eating too much memory in expense of something else. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR