Re: FreeBSD port of GlusterFS racks up a lot of CPU usage

Rick Macklem <rmacklem@xxxxxxxxxxx> · Tue, 19 Jan 2016 17:51:31 -0500 (EST)

Raghavendra Gowdappa wrote:
> 
> 
> ----- Original Message -----
> > From: "Martin Simmons" <martin@xxxxxxxxxxxxx>
> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > Cc: rmacklem@xxxxxxxxxxx, jdarcy@xxxxxxxxxx, raghavendra@xxxxxxxxxxx,
> > gluster-devel@xxxxxxxxxxx,
> > freebsd-fs@xxxxxxxxxxx, xhernandez@xxxxxxxxxx, jkh@xxxxxxxxxxxxx
> > Sent: Tuesday, January 19, 2016 3:35:07 PM
> > Subject: Re:  FreeBSD port of GlusterFS racks up a lot of
> > CPU usage
> > 
> > >>>>> On Tue, 19 Jan 2016 01:01:19 -0500 (EST), Raghavendra Gowdappa said:
> > > 
> > > ----- Original Message -----
> > > > From: "Rick Macklem" <rmacklem@xxxxxxxxxxx>
> > > > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > > > Cc: "Jeff Darcy" <jdarcy@xxxxxxxxxx>, "Raghavendra G"
> > > > <raghavendra@xxxxxxxxxxx>, "freebsd-fs"
> > > > <freebsd-fs@xxxxxxxxxxx>, "Hubbard Jordan" <jkh@xxxxxxxxxxxxx>, "Xavier
> > > > Hernandez" <xhernandez@xxxxxxxxxx>, "Gluster
> > > > Devel" <gluster-devel@xxxxxxxxxxx>
> > > > Sent: Tuesday, January 19, 2016 4:07:09 AM
> > > > Subject: Re:  FreeBSD port of GlusterFS racks up a lot
> > > > of
> > > > CPU usage
> > > > 
> > > > Raghavendra Gowdappa wrote:
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Rick Macklem" <rmacklem@xxxxxxxxxxx>
> > > > > > To: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
> > > > > > Cc: "Raghavendra G" <raghavendra@xxxxxxxxxxx>, "freebsd-fs"
> > > > > > <freebsd-fs@xxxxxxxxxxx>, "Hubbard Jordan"
> > > > > > <jkh@xxxxxxxxxxxxx>, "Xavier Hernandez" <xhernandez@xxxxxxxxxx>,
> > > > > > "Gluster
> > > > > > Devel" <gluster-devel@xxxxxxxxxxx>
> > > > > > Sent: Saturday, January 9, 2016 7:29:59 AM
> > > > > > Subject: Re:  FreeBSD port of GlusterFS racks up a
> > > > > > lot
> > > > > > of
> > > > > > CPU usage
> > > > > > 
> > > > > > Jeff Darcy wrote:
> > > > > > > > > I don't know anything about gluster's poll implementation so
> > > > > > > > > I
> > > > > > > > > may
> > > > > > > > > be totally wrong, but would it be possible to use an eventfd
> > > > > > > > > (or a
> > > > > > > > > pipe if eventfd is not supported) to signal the need to add
> > > > > > > > > more
> > > > > > > > > file descriptors to the poll call ?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The poll call should listen on this new fd. When we need to
> > > > > > > > > change
> > > > > > > > > the fd list, we should simply write to the eventfd or pipe
> > > > > > > > > from
> > > > > > > > > another thread.  This will cause the poll call to return and
> > > > > > > > > we
> > > > > > > > > will
> > > > > > > > > be able to change the fd list without having a short timeout
> > > > > > > > > nor
> > > > > > > > > having to decide on any trade-off.
> > > > > > > > 
> > > > > > > >
> > > > > > > > Thats a nice idea. Based on my understanding of why timeouts
> > > > > > > > are
> > > > > > > > being
> > > > > > > > used, this approach can work.
> > > > > > > 
> > > > > > > The own-thread code which preceded the current poll
> > > > > > > implementation
> > > > > > > did
> > > > > > > something similar, using a pipe fd to be woken up for new
> > > > > > > *outgoing*
> > > > > > > messages.  That code still exists, and might provide some insight
> > > > > > > into
> > > > > > > how to do this for the current poll code.
> > > > > > I took a look at event-poll.c and found something interesting...
> > > > > > - A pipe called "breaker" is already set up by
> > > > > > event_pool_new_poll()
> > > > > > and
> > > > > >   closed by event_pool_destroy_poll(), however it never gets used
> > > > > >   for
> > > > > >   anything.
> > > > > 
> > > > > I did a check on history, but couldn't find any information on why it
> > > > > was
> > > > > removed. Can you send this patch to http://review.gluster.org ? We
> > > > > can
> > > > > review and merge the patch over there. If you are not aware,
> > > > > development
> > > > > work flow can be found at:
> > > > > 
> > > > > http://www.gluster.org/community/documentation/index.php/Developers
> > > > > 
> > > > Actually, the patch turned out to be a flop. Sometimes a fuse mount
> > > > would
> > > > end
> > > > up with an empty file system with the patch. (I don't know why it was
> > > > broken,
> > > > but maybe the original author tan into issues as well?)
> > > 
> > > +static void
> > > +event_pool_changed (struct event_pool *event_pool)
> > > +{
> > > +
> > > +        /* Write a byte into the breaker pipe to wake up poll(). */
> > > +        if (event_pool->breaker[1] >= 0)
> > > +                write(event_pool->breaker[1], "X", 1);
> > > +}
> > > 
> > > breaker is set to non-blocking on both read and write ends. So, probably
> > > write might be failing sometimes with EAGAIN/EBUSY and thereby preventing
> > > the socket from being registered. Probably that might be the reason?
> > 
> > In what situations does write return EAGAIN/EBUSY for a pipe?  If it is
> > only
> > when the pipe's internal buffer is full, then there is still unread data
> > and
> > the reading end will wake up anyway, so there should be no need to write
> > more.
> 
> That's a valid point. I was kind of thinking out loud since I couldn't figure
> out what was wrong in earlier approach. I was particularly interested in
> whether write can fail because of any error. Since we were not checking the
> return value, a write failure can result in a socket not being registered.
> Though I couldn't think of any particular reason for failure, felt that its
> a loose end that can be tied up.
> 
> On a different note, I found a bug in current poll implementation which
> could've resulted in high cpu usage. I've attached the diff, which wouldn't
> do an array copy (of registered events to ufd) every "timeout" seconds and
> which I hope which will bring down cpu usage. The issue was
> event_pool->changed was never reset and hence every timeout seconds, we
> would interpret that event_pool has changed even though it has not.
> 
I saw this and was going to do what your patch does, but I think that the
ufd[] must be re-initialized before another poll call anyhow. (When poll()
times out, it does a "continue;" and skips processing the ufd[] list.)

The man pages don't clarify which of the fields "event, revent" are guaranteed
to remain unchanged when a timeout returns. As such, I would think re-initializing
the ufd[] is the safe way to do it.

Anyhow, this CPU usage isn't a big deal and mostly goes away when you change
the timeout from 1msec->10msec. (Checking for new fds 100 times/sec seems like
it should be often enough.)

> > 
> > 
> > >         if (event_pool->breaker[1] >= 0) {
> > >                  do {
> > >                         ret = write(event_pool->breaker[1], "X", 1);
> > >                  } while (ret != 1);
This was just to ensure the write succeeds, despite the non-blocking setting.
(In the code I was testing, I did have a log message call for when it didn't
 return 1 and I didn't see that message logged, from what I remember.)

To be honest, I am seeing a couple of much more serious problems that I need
to chase. (I reported the one about "rm -r" which somehow ends up with files
on both bricks even though I am using "plain distribute", so this doesn't make
much sense to me. The other is a case of a directory where the "ls" only shows
the files from one of the two bricks and can't find the rest of the files,
although they are on the other brick. I don't know why this happens for one
directory, but not others.)
The FreeBSD port is pretty green and the FreeBSD fuse module is, too.
I am planning on doing some testing using the NFSv3 server instead, to
see if these are fuse related.

Bottom line, I'm sorry I even mentioned this. It isn't a big issue and
is easily avoided by increasing the timeout to 10msec.

Anyhow, thanks for all the comments, rick

> > >         }
> > 
> > This will busy-wait, which doesn't look good.
> > 
> > 
> > > Also similar logic might be required while flushing out junk from read
> > > end
> > > too.
> > 
> > Failure to read everything should be benign because it will just cause poll
> > to
> > return again immediately next time.
> > 
> > 
> > > > 
> > > > Anyhow, I am now using the 3.7.6 event-poll.c code except that I have
> > > > increased
> > > > the timeout from 1msec->10msec. (Going from 1->5->10 didn't seem to
> > > > cause
> > > > a
> > > > problem, but I got slower test runs when I increased to 20msec, so I've
> > > > settled on
> > > > 10mses. This does reduce the CPU usage when the GlusterFS file systems
> > > > aren't
> > > > active.)
> > > > I will submit this one line change to your workflow if it continues to
> > > > test
> > > > ok.
> > > > 
> > > > Thanks for everyone's input, rick
> > > > 
> > > > > > 
> > > > > > So, I added a few lines of code that writes a byte to it whenever
> > > > > > the
> > > > > > list
> > > > > > of
> > > > > > file descriptors is changed and read when poll() returns, if its
> > > > > > revents
> > > > > > is
> > > > > > set.
> > > > > > I also changed the timeout to -1 (infinity) and it seems to work
> > > > > > for
> > > > > > a
> > > > > > trivial
> > > > > > test.
> > > > > > --> Btw, I also noticed the "changed" variable gets set to 1 on a
> > > > > > change,
> > > > > > but
> > > > > >     never reset to 0. I didn't change this, since it looks "racey".
> > > > > >     (ie.
> > > > > >     I
> > > > > >     think you could easily get a race between a thread that clears
> > > > > >     it
> > > > > >     and
> > > > > >     one
> > > > > >     that adds a new fd.)
> > > > > > 
> > > > > > A slightly safer version of the patch would set a long (100msec ??)
> > > > > > timeout
> > > > > > instead
> > > > > > of -1.
> > > > > > 
> > > > > > Anyhow, I've attached the patch in case anyone would like to try it
> > > > > > and
> > > > > > will
> > > > > > create a bug report for this after I've had more time to test it.
> > > > > > (I only use a couple of laptops, so my testing will be minimal.)
> > > > > > 
> > > > > > Thanks for all the help, rick
> > > > > > 
> > > > > > > _______________________________________________
> > > > > > > freebsd-fs@xxxxxxxxxxx mailing list
> > > > > > > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > > > > > > To unsubscribe, send any mail to
> > > > > > > "freebsd-fs-unsubscribe@xxxxxxxxxxx"
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > _______________________________________________
> > > freebsd-fs@xxxxxxxxxxx mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@xxxxxxxxxxx"
> > > 
> > 
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel