Fwd: Ceph Filesystem - Production?

fxmulder@xxxxxxxxx (James Devine) · Thu, 4 Sep 2014 14:24:28 -0500

It took a week to happen again, I had hopes that it was fixed but alas it
is not.  Looking at top logs on the active mds server, the load average was
0.00 the whole time and memory usage never changed much, it is using close
to 100% and some swap but since I changed memory.swappiness swap usage
hasn't gone up but has been slowly coming back down.  Same symptoms, the
mount on the client is unresponsive and a cat on /sys/kernel/debug/ceph/*/mdsc
had a whole list of entries.  A umount and remount seems to fix it.

On Fri, Aug 29, 2014 at 11:26 AM, James Devine <fxmulder at gmail.com> wrote:

> I am running active/standby and it didn't swap over to the standby.  If I
> shutdown the active server it swaps to the standby fine though.  When there
> were issues, disk access would back up on the webstats servers and a cat of
> /sys/kernel/debug/ceph/*/mdsc would have a list of entries whereas
> normally it would only list one or two if any.  I have 4 cores and 2GB of
> ram on the mds machines.  Watching it right now it is using most of the ram
> and some of swap although most of the active ram is disk cache.  I lowered
> the memory.swappiness value to see if that helps.  I'm also logging top
> output if it happens again.
>
>
> On Thu, Aug 28, 2014 at 8:22 PM, Yan, Zheng <ukernel at gmail.com> wrote:
>
>> On Fri, Aug 29, 2014 at 8:36 AM, James Devine <fxmulder at gmail.com> wrote:
>> >
>> > On Thu, Aug 28, 2014 at 1:30 PM, Gregory Farnum <greg at inktank.com>
>> wrote:
>> >>
>> >> On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman
>> >> <bhuffman at etinternational.com> wrote:
>> >> > Is Ceph Filesystem ready for production servers?
>> >> >
>> >> > The documentation says it's not, but I don't see that mentioned
>> anywhere
>> >> > else.
>> >> > http://ceph.com/docs/master/cephfs/
>> >>
>> >> Everybody has their own standards, but Red Hat isn't supporting it for
>> >> general production use at this time. If you're brave you could test it
>> >> under your workload for a while and see how it comes out; the known
>> >> issues are very much workload-dependent (or just general concerns over
>> >> polish).
>> >> -Greg
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users at lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > I've been testing it with our webstats since it gets live hits but isn't
>> > customer affecting.  Seems the MDS server has problems every few days
>> > requiring me to umount and remount the ceph disk to resolve.  Not sure
>> if
>> > the issue is resolved in development versions but as of 0.80.5 we seem
>> to be
>> > hitting it.  I set the log verbosity to 20 so there's tons of logs but
>> ends
>> > with
>>
>> The cephfs client is supposed to be able to handle MDS takeover.
>> what's symptom makes you umount and remount the cephfs ?
>>
>> >
>> > 2014-08-24 07:10:19.682015 7f2b575e7700 10 mds.0.14  laggy, deferring
>> > client_request(client.92141:6795587 getattr pAsLsXsFs #10000026dc1)
>> > 2014-08-24 07:10:19.682021 7f2b575e7700  5 mds.0.14 is_laggy 19.324963
>> > 15
>> > since last acked beacon
>> > 2014-08-24 07:10:20.358011 7f2b554e2700 10 mds.0.14 beacon_send
>> up:active
>> > seq 127220 (currently up:active)
>> > 2014-08-24 07:10:21.515899 7f2b575e7700  5 mds.0.14 is_laggy 21.158841
>> > 15
>> > since last acked beacon
>> > 2014-08-24 07:10:21.515912 7f2b575e7700 10 mds.0.14  laggy, deferring
>> > client_session(request_renewcaps seq 26766)
>> > 2014-08-24 07:10:21.515915 7f2b575e7700  5 mds.0.14 is_laggy 21.158857
>> > 15
>> > since last acked beacon
>> > 2014-08-24 07:10:21.981148 7f2b575e7700 10 mds.0.snap check_osd_map
>> > need_to_purge={}
>> > 2014-08-24 07:10:21.981176 7f2b575e7700  5 mds.0.14 is_laggy 21.624117
>> > 15
>> > since last acked beacon
>> > 2014-08-24 07:10:23.170528 7f2b575e7700  5 mds.0.14 handle_mds_map
>> epoch 93
>> > from mon.0
>> > 2014-08-24 07:10:23.175367 7f2b532d5700  0 -- 10.251.188.124:6800/985
>> >>
>> > 10.251.188.118:0/2461578479 pipe(0x5588a80 sd=23 :6800 s=2 pgs=91 cs=1
>> l=0
>> > c=0x2cbfb20).fault with nothing to send, going to standby
>> > 2014-08-24 07:10:23.175376 7f2b533d6700  0 -- 10.251.188.124:6800/985
>> >>
>> > 10.251.188.55:0/306923677 pipe(0x5588d00 sd=22 :6800 s=2 pgs=7 cs=1 l=0
>> > c=0x2cbf700).fault with nothing to send, going to standby
>> > 2014-08-24 07:10:23.175380 7f2b531d4700  0 -- 10.251.188.124:6800/985
>> >>
>> > 10.251.188.31:0/2854230502 pipe(0x5589480 sd=24 :6800 s=2 pgs=881 cs=1
>> l=0
>> > c=0x2cbfde0).fault with nothing to send, going to standby
>> > 2014-08-24 07:10:23.175438 7f2b534d7700  0 -- 10.251.188.124:6800/985
>> >>
>> > 10.251.188.68:0/2928927296 pipe(0x5588800 sd=21 :6800 s=2 pgs=7 cs=1
>> l=0
>> > c=0x2cbf5a0).fault with nothing to send, going to standby
>> > 2014-08-24 07:10:23.184201 7f2b575e7700 10 mds.0.14      my compat
>> > compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> > ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds
>> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
>> data}
>> > 2014-08-24 07:10:23.184255 7f2b575e7700 10 mds.0.14  mdsmap compat
>> > compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> > ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds
>> > uses versioned encoding,6=dirfrag is stored in omap}
>> > 2014-08-24 07:10:23.184264 7f2b575e7700 10 mds.-1.-1 map says i am
>> > 10.251.188.124:6800/985 mds.-1.-1 state down:dne
>> > 2014-08-24 07:10:23.184275 7f2b575e7700 10 mds.-1.-1  peer mds gid 94665
>> > removed from map
>> > 2014-08-24 07:10:23.184282 7f2b575e7700  1 mds.-1.-1 handle_mds_map i
>> > (10.251.188.124:6800/985) dne in the mdsmap, respawning myself
>> > 2014-08-24 07:10:23.184284 7f2b575e7700  1 mds.-1.-1 respawn
>> > 2014-08-24 07:10:23.184286 7f2b575e7700  1 mds.-1.-1  e:
>> '/usr/bin/ceph-mds'
>> > 2014-08-24 07:10:23.184288 7f2b575e7700  1 mds.-1.-1  0:
>> '/usr/bin/ceph-mds'
>> > 2014-08-24 07:10:23.184289 7f2b575e7700  1 mds.-1.-1  1: '-i'
>> > 2014-08-24 07:10:23.184290 7f2b575e7700  1 mds.-1.-1  2:
>> > 'ceph-cluster1-mds2'
>> > 2014-08-24 07:10:23.184291 7f2b575e7700  1 mds.-1.-1  3: '--pid-file'
>> > 2014-08-24 07:10:23.184292 7f2b575e7700  1 mds.-1.-1  4:
>> > '/var/run/ceph/mds.ceph-cluster1-mds2.pid'
>> > 2014-08-24 07:10:23.184293 7f2b575e7700  1 mds.-1.-1  5: '-c'
>> > 2014-08-24 07:10:23.184294 7f2b575e7700  1 mds.-1.-1  6:
>> > '/etc/ceph/ceph.conf'
>> > 2014-08-24 07:10:23.184295 7f2b575e7700  1 mds.-1.-1  7: '--cluster'
>> > 2014-08-24 07:10:23.184296 7f2b575e7700  1 mds.-1.-1  8: 'ceph'
>> > 2014-08-24 07:10:23.274640 7f2b575e7700  1 mds.-1.-1  exe_path
>> > /usr/bin/ceph-mds
>> > 2014-08-24 07:10:23.606875 7f4c55abb800  0 ceph version 0.80.5
>> > (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 987
>> > 2014-08-24 07:10:49.024862 7f4c506ad700  1 mds.-1.0 handle_mds_map
>> standby
>> > 2014-08-24 07:10:49.199676 7f4c506ad700  0 mds.-1.0 handle_mds_beacon no
>> > longer laggy
>> > 2014-08-24 07:10:50.215240 7f4c506ad700  1 mds.-1.0 handle_mds_map
>> standby
>> > 2014-08-24 07:10:51.290407 7f4c506ad700  1 mds.-1.0 handle_mds_map
>> standby
>> >
>> >
>>
>> Did you use active/standby MDS setup? Did the  MDS use lots of memory
>> before it crashed?
>>
>> Regards
>> Yan, Zheng
>>
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users at lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140904/3714f28e/attachment.htm>