Hi, On Wed, 2010-03-17 at 10:47 -0400, Jeff Sturm wrote: > We are using GFS to store session files for our web application. I've > spent some time exploring GFS performance and tuning the software for > optimal latency on system calls—we control the software, and the core > libraries are written in C. So I've been following related > discussions of e.g. stat() performance with a great deal of interest. > > > > I hit a wall reducing latency of new file creation. Average create > times are around 10ms and fluctuate from about 1ms up to 100ms or so. > Here's an example: > > > > open("/tb2/session/localhost/1800/ac18c/379/905bbc40.ts", O_WRONLY| > O_CREAT|O_EXCL, 0660) = 4 <0.015415> > > > > The parent directory of this file (379) was created on this node. Our > session storage ensures that no two nodes will attempt to create files > in the same directory. I'm also limiting the number of directories we > have to create so there is about a 50:1 ratio of files to directories > (mkdir performance on GFS is generally awful). > mkdir and open(O_CREAT) are pretty similar in terms of code paths. > > > Here's a breakdown of the most common system calls made from my test > harness: > > > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 91.26 0.046362 228 203 open > > 7.87 0.003998 1999 2 mkdir > > 0.59 0.000298 0 600 2 stat > > 0.19 0.000098 0 200 write > > 0.09 0.000045 0 302 read > > 0.00 0.000000 0 202 close > > > > Note this report doesn't show wall-clock time (I obtained it with > strace –c). Roughly half the calls to open() are creating files, the > rest open existing files. > > > > My questions: > > > > - What exactly happens during open()? I'm guessing that at least > the journal is flushed to disk. Timings for open() are long and > highly variable compared to other filesystems (e.g. ext3). The strace > utility is limited to showing system calls from user space—it'd be > interesting to see what I/O takes place in kernel space, but I don't > have any way to do that (do I?). Am I network bound or I/O bound > here? The latency looks suspiciously like disk seek times to me. > Lets take these one at a time... :-) Firstly, during open, there are two paths depending on whether a file is being created or not. If not then the time taken is likely to be a lot shorter since there is less to do. In the non-create case, open takes a shared glock which implies not only the dlm lock request, but also a disk read in order to read in the inode itself. This is true for both gfs and gfs2 and its a bit of a pain that the only reason that gfs2 requires this is to make an O_LARGEFILE test against the size of the inode. The create case can cause a (potentially) a lot of other I/O to occur. Adding a directory entry most of the time only takes a short period of time, due to there already being space available in the directory. Potentially, if the directory has become full in some sense and needs to be expanded, there can be I/O to allocate a directory leaf block and/or hash table blocks and/or indirect blocks. This is in addition to the block for the inode itself, and if selinux or acls are in use, additional blocks may be allocated to contain their xattrs as well. The blocks are allocated from resource groups so a suitable resource group must be found and locked in order to allow allocation. That requires reading in the rgrp header (or more than one if the fs is nearly full and the rgrp has not got enough blocks). GFS and GFS2 use slightly different algorithms for selecting a suitable resource group, but the same principle of it being a bitmap with summary information applies. > - Is there a strategy I can use to return more quickly from open() > on a GFS filesystem? > At the current time, probably not. > - Before I spend time migrating to GFS2, is there any reason to > believe GFS2 would perform significantly better here? > At the moment, I suspect that there wouldn't be a huge difference in this area, but we are intending to do some work to speed it up in GFS2. Thats one reason that I started working on the xattr code a little while back since that is one of the things which was blocking a more efficient create open, Steve. > > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster