On Thu 06-10-16 13:11:42, Dave Chinner wrote: > On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote: > > On Wed 05-10-16 07:32:02, Dave Chinner wrote: > > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@xxxxxxxx> > > > > > > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since > > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction: > > > > direct compact when a high-order allocation fails"). The main reason > > > > is that the migration of page cache pages might recurse back to fs/io > > > > layer and we could potentially deadlock. This is overly conservative > > > > because all the anonymous memory is migrateable in the GFP_NOFS context > > > > just fine. This might be a large portion of the memory in many/most > > > > workkloads. > > > > > > > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages > > > > (those with a mapping) while isolating pages to be migrated. We cannot > > > > consider clean fs pages because they might need a metadata update so > > > > only isolate pages without any mapping for nofs requests. > > > > > > > > The effect of this patch will be probably very limited in many/most > > > > workloads because higher order GFP_NOFS requests are quite rare, > > > > > > You say they are rare only because you don't know how to trigger > > > them easily. :/ > > > > true > > > > > Try this: > > > > > > # mkfs.xfs -f -n size=64k <dev> > > > # mount <dev> /mnt/scratch > > > # time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \ > > > -d /mnt/scratch/0 -d /mnt/scratch/1 \ > > > -d /mnt/scratch/2 -d /mnt/scratch/3 \ > > > -d /mnt/scratch/4 -d /mnt/scratch/5 \ > > > -d /mnt/scratch/6 -d /mnt/scratch/7 \ > > > -d /mnt/scratch/8 -d /mnt/scratch/9 \ > > > -d /mnt/scratch/10 -d /mnt/scratch/11 \ > > > -d /mnt/scratch/12 -d /mnt/scratch/13 \ > > > -d /mnt/scratch/14 -d /mnt/scratch/15 > > > > Does this simulate a standard or usual fs workload/configuration? I am > > Unfortunately, there was an era of cargo cult configuration tweaks > in the Ceph community that has resulted in a large number of > production machines with XFS filesystems configured this way. And a > lot of them store large numbers of small files and run under > significant sustained memory pressure. I see > I slowly working towards getting rid of these high order allocations > and replacing them with the equivalent number of single page > allocations, but I haven't got that (complex) change working yet. Definitely a good plan! Anyway I was playing with this in my virtual machine (4CPUs, 512MB of RAM split into two NUMA nodes). Started on a freshly created fs after boot, no other load in the guest. The performance numbers should be taken with grain of salt, though, because the host has 4CPUs as well and it wasn't completely idle, but should be OK enough to give us at least some picture. This is what fs_mark told me: Unpatched kernel: # Version 3.3, 16 thread(s) starting at Fri Oct 7 09:55:05 2016 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. # FSUse% Count Size Files/sec App Overhead 1 1600000 0 4300.1 20745838 3 3200000 0 4239.9 23849857 5 4800000 0 4243.4 25939543 6 6400000 0 4248.4 19514050 8 8000000 0 4262.1 20796169 9 9600000 0 4257.6 21288675 11 11200000 0 4259.7 19375120 13 12800000 0 4220.7 22734141 14 14400000 0 4238.5 31936458 16 16000000 0 4231.5 23409901 18 17600000 0 4045.3 23577700 19 19200000 0 2783.4 58299526 21 20800000 0 2678.2 40616302 23 22400000 0 2693.5 83973996 Ctrl+C because it just took too long. For me it was much more interesting to see this in the log: [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240) [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240) [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) It means that the allocation failed and xfs was looping while retrying to move on. I hooked into page allocation tracepoint and collected all GFP_NOFS allocations echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter and the order distribution looks as follows 5287609 order=0 37 order=1 1594905 order=2 3048439 order=3 6699207 order=4 66645 order=5 so there is indeed a _huge_ high order GFP_NOFS pressure. So I've tried with this patch applied (with the follow up fix pointed by Vlastimil). I am not very familiar with the benchmark so I am not going to interpret those numbers but they seem to be very similar for large part of the run # Version 3.3, 16 thread(s) starting at Fri Oct 7 12:58:03 2016 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. # FSUse% Count Size Files/sec App Overhead 1 1600000 0 4289.1 19243934 3 3200000 0 4241.6 32828865 5 4800000 0 4248.7 32884693 6 6400000 0 4314.4 19608921 8 8000000 0 4269.9 24953292 9 9600000 0 4270.7 33235572 11 11200000 0 4346.4 40817101 13 12800000 0 4285.3 29972397 14 14400000 0 4297.2 20539765 16 16000000 0 4219.6 18596767 18 17600000 0 4273.8 49611187 19 19200000 0 4300.4 27944451 21 20800000 0 4270.6 22324585 22 22400000 0 4317.6 22650382 24 24000000 0 4065.2 22297964 while it doesn't seem to drop the Files/sec numbers starting with Count=19200000. I also see only a single [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) which happens much later during the load. So I've checked the tracing output and compared the same time period of the test (5730s - note that I've stopped the later one slightly later and didn't let any of them finish). This is not precise but it should give us at least some idea about how many allocations we could during the same workload during the same time so it shouldn't be completely bogus. Unpatched kernel all orders begin:44.718798 end:5774.618736 allocs:15019288 order > 0 begin:44.718798 end:5773.587195 allocs:10438610 Patched kernel all orders begin:64.612804 end:5794.193619 allocs:16110081 [107.2%] order > 0 begin:64.612804 end:5794.193619 allocs:11741492 [112.5%] which would suggest that diving into the compaction rather than backing off and waiting for kcompactd to make the work for us was indeed a better strategy and helped the throughput. All that with a rather limited amount of anonymous memory because there doesn't seem to be any large source of those pages during this workload. Quite contrary AFAIU. I haven't checked latencies of those allocations because we currently do not have a convenient way to do that. I could measure compaction latencies but that wouldn't be a full picture and it seems that the above should tell us that not restricting the compaction is a good thing to do. So unless I have grossly misinterpreted something here I can add this to the changelog to show the effect of the patch on a really heavy NOFS highorder workload. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>