RE: XFS Preallocation

Peter Vajgel <pv@xxxxxx> · Tue, 1 Feb 2011 19:20:18 +0000

> -----Original Message-----
> From: Dave Chinner [mailto:david@xxxxxxxxxxxxx]
> Sent: Tuesday, February 01, 2011 12:04 AM
> To: Peter Vajgel
> Cc: Jef Fox; xfs@xxxxxxxxxxx
> Subject: Re: XFS Preallocation
> 
> On Tue, Feb 01, 2011 at 04:45:09AM +0000, Peter Vajgel wrote:
> > > Preallocation is the only option. Allowing preallocation without
> > > marking extents as unwritten opens a massive security hole (i.e.
> > > exposes stale data) so I say no to any request for addition of such
> > > functionality (and have for years).
> >
> > How about opening this option to at least root (root can already read
> > the device anyway)?.
> 
> # ls -l foo
> -rw-r--r-- 1 dave dave        0 Aug 16 10:44 foo
> #
> # prealloc_without_unwritten 0 1048576 foo # ls -l foo
> -rw-r--r-- 1 dave dave  1048576 Aug 16 10:44 foo #
> 
> Now user dave can read the stale data exposed by the root only operation. Any
> combination of making the file available to a non-root user after a preallocation-
> without-unwritten-extents
> operation has this problem.  IOWs, just making such a syscall "root only" doesn't
> solve the security problem.

Correct - if an admin made prealloc_without_unwritten runnable by any user then yes - but I would argue that such an admin should not even have root privileges. Vxfs had this ability since version 1 and I don't' remember a single customer complaint about this feature. Most of the times the feature was used by db to preallocate large amounts of space knowing that they won't incur any overhead (even transactional) when doing direct io to the pre-allocated range. It could be that at those times even a transactional overhead was significant enough that we wanted to eliminate it.

> 
> To fix it, we have to require inodes have 0600 perms, owned by root, and cannot be
> chmod/chowned to anyone else, ever. At that point, we're requiring applications to run
> as root to to use this functionality. Same requirement as fiemap + reading from the
> block device, which you can do right without any kernel mods or filesystem hacks...
> 
> > There are cases when creating large
> > files without writing to them is important. A good example is testing
> > xfs overhead when doing a specific workload (like random
> > reads) to large files.
> 
> For testing it doesn't matter how long it takes you to write the file in the first place.

At the scale we operate it does. We have multiple variables so the number of combinations is large. We have hit every single possible hardware and software problem and problem resolution can take months if it takes days to reproduce the problem. Hardware vendors (disk, controller, motherboard manufacturers) are much more responsive when you can reproduce a problem on the fly in seconds (especially in comparative benchmarking). The tests usually run only couple of minutes. With 12x3TB (possibly multiplied by a factor of X with our new platform) it would be unacceptable to wait for writes to finish.

> 
> > In this case we want to hit the disk on every request. Currently we
> > have a workaround (below) but official support would be preferable.
> 
> Officially, we _removed_ the unwritten=0 option from mkfs because of the security
> problems. Not to mention that it was never, ever tested...
> 
> >
> > --pv
> >
> >
> > # create_xfs_files
> >
> > dev=$1
> > mntpt=$2
> > dircount=$3
> > filecount=$4
> > size=$5
> >
> > # Umount.
> > umount $2
> >
> > # Create the fs.
> > mkfs -t xfs -f -d unwritten=0,su=256k,sw=10 -l su=256k -L "/hay" $dev
> 
> Which fails due to:
> 
> unknown option -d unwritten=0
> /* blocksize */         [-b log=n|size=num]
> /* data subvol */       [-d agcount=n,agsize=n,file,name=xxx,size=num,
>                             (sunit=value,swidth=value|su=num,sw=num),
>                             sectlog=n|sectsize=num .....

It still works for us but we tend to be conservative in moving our releases.

> 
> > # Clear unwritten flag - current xfs ignores this flag typeset -i
> > agcount=$(xfs_db -c "sb" -c "print" $dev | grep agcount) typeset -i
> > i=0 while [[ $i != $agcount ]] do
> >   xfs_db -x -c "sb $i" -c "write versionnum 0xa4a4" $dev
> >   i=i+1
> > done
> >
> > # Mount the filesystem.
> > mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev
> > $mntpt
> >
> > i=0
> > while [[ $i != $dircount ]]
> > do
> >   mkdir $mntpt/dir$i
> >   typeset -i j=0
> >   while [[ $j != $filecount ]]
> >   do
> >     file=$mntpt/dir$i/file$j
> >     xfs_io -f -c "resvsp 0 $size" $file
> >     inum=$(ls -i $file | awk '{print $1}')
> >     umount $mntpt
> >     xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
> >     mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g
> > $dev $mntpt
> 
> That's quite a hack to work around the EOF zeroing that extending the file size after
> allocating would do because the preallocated extents beyond EOF are not marked
> unwritten. Perhaps truncating the file first, then preallocating is what you want:
> 
> 	xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $file

I think I had it in reverse before - allocate and truncate but the truncate got stuck in a loop (probably zeroing out the extents?) making the node unresponsive to the point that it was impossible to ssh to it. It eventually returned but it took a while. But that was like 3 years ago. If I get to it I'll try the other order.

> 
> >     j=j+1
> >   done
> >   i=i+1
> > done
> 
> Regardless of all this, perhaps themost important point is that your proposed use of
> XFS is fundamentally unsupportable by the linux XFS
> community: you've got proprietary software on some external hardware writing to the
> disk without going through the linux XFS kernel code.
> You're basically in the same boat as people running proprietary kernel modules -
> unless you can prove the problem is not caused by your hw/sw or manual filesystem
> modifications, then it's a waste of our (limited) resources to even look at the problem.
> That generally comes down to being able to reproduce the problem on a vanilla kernel
> on a filesystem created with a supported mkfs....

Understood. That's why I limit this hack only to testing. I would never even dream to put this into production. Although one could assume that if xfs_check/xfs_repair bless the filesystem before it's mounted you would be safe. But then you might be exposing yourself to bugs in xfs_check/xfs_repair which might have been overlooked since it's not the usual way of using xfs.

Thank you,

Peter

> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs