----- Message from Martin B Nielsen <martin@xxxxxxxxxxx> ---------
Date: Mon, 31 Mar 2014 15:55:24 +0200
From: Martin B Nielsen <martin@xxxxxxxxxxx>
Subject: Re: MDS debugging
To: Kenneth Waegeman <Kenneth.Waegeman@xxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Hi,
I can see you're running mon, mds and osd on the same server.
That's true, we have 3 servers all having a monitor and osds..
Also, from a quick glance you're using around 13GB resident memory.
If you only have 16GB in your system I'm guessing you'll be swapping about
now (or close). How much mem does the system hold?
32GB on each of the 3 hosts, no swap used.
Also, how busy are the disks? Or is it primarily cpu-bound? Do you have
many processes waiting for run time or high interrupt count?
I already stopped the sync and rebooted the cluster, at first sight
the performance is OK again, so I am going to re-run the test and I
will report back what will happen in terms of interrupt and processes.
But I looked to the disks while having the problem, and the disks
weren't busy at all at that moment.
At the server with the active mds, there was a lot of cpu usage for
the ceph-mds. on the other hosts, there was a lot of migration/x
processes with high cpu usage.
Thanks a lot!
Kenneth
/Martin
On Mon, Mar 31, 2014 at 1:49 PM, Kenneth Waegeman <Kenneth.Waegeman@xxxxxxxx
wrote:
Hi all,
Before the weekend we started some copying tests over ceph-fuse.
Initially, this went ok. But then the performance started dropping
gradually. Things are going very slow now:
2014-03-31 13:36:37.047423 mon.0 [INF] pgmap v265871: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 44747
kB/s rd, 216 kB/s wr, 10 op/s
2014-03-31 13:36:38.049286 mon.0 [INF] pgmap v265872: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4069
B/s rd, 363 kB/s wr, 24 op/s
2014-03-31 13:36:39.057680 mon.0 [INF] pgmap v265873: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 5092
B/s rd, 151 kB/s wr, 22 op/s
2014-03-31 13:36:40.075718 mon.0 [INF] pgmap v265874: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 25961
B/s rd, 1527 B/s wr, 10 op/s
2014-03-31 13:36:41.087764 mon.0 [INF] pgmap v265875: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71574
kB/s rd, 4564 B/s wr, 17 op/s
2014-03-31 13:36:42.109200 mon.0 [INF] pgmap v265876: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 71238
kB/s rd, 3534 B/s wr, 9 op/s
2014-03-31 13:36:43.128113 mon.0 [INF] pgmap v265877: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 4022
B/s rd, 116 kB/s wr, 24 op/s
2014-03-31 13:36:44.143382 mon.0 [INF] pgmap v265878: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 8030
B/s rd, 117 kB/s wr, 29 op/s
2014-03-31 13:36:45.160405 mon.0 [INF] pgmap v265879: 1300 pgs: 1300
active+clean; 19872 GB data, 59953 GB used, 74117 GB / 130 TB avail; 7049
B/s rd, 4531 B/s wr, 9 op/s
ceph-mds seems very busy, and also only one osd!
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
54279 root 20 0 8561m 7.5g 4408 S 105.6 23.8 3202:05 ceph-mds
50242 root 20 0 1378m 373m 6452 S 0.7 1.2 523:38.77 ceph-osd
49446 root 18 -2 10644 356 320 S 0.0 0.0 0:00.00 udevd
49444 root 18 -2 10644 428 320 S 0.0 0.0 0:00.00 udevd
49319 root 20 0 1444m 405m 5684 S 0.0 1.3 513:41.13 ceph-osd
48452 root 20 0 1365m 364m 5636 S 0.0 1.1 551:52.31 ceph-osd
47641 root 20 0 1567m 388m 5880 S 0.0 1.2 754:50.60 ceph-osd
46811 root 20 0 1441m 393m 8256 S 0.0 1.2 603:11.26 ceph-osd
46028 root 20 0 1594m 398m 6156 S 0.0 1.2 657:22.16 ceph-osd
45275 root 20 0 1545m 510m 9920 S 18.9 1.6 943:11.99 ceph-osd
44532 root 20 0 1509m 395m 7380 S 0.0 1.2 665:30.66 ceph-osd
43835 root 20 0 1397m 384m 8292 S 0.0 1.2 466:35.47 ceph-osd
43146 root 20 0 1412m 393m 5884 S 0.0 1.2 506:42.07 ceph-osd
42496 root 20 0 1389m 364m 5292 S 0.0 1.1 522:37.70 ceph-osd
41863 root 20 0 1504m 393m 5864 S 0.0 1.2 462:58.11 ceph-osd
39035 root 20 0 918m 694m 3396 S 3.3 2.2 55:53.59 ceph-mon
Does this look familiar to someone?
How can we debug this further?
I already have set the debug level of mds to 5. There are a lot of
'lookup' entries, but I can't see any reported warnings or errors.
Thanks!
Kind regards,
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
----- End message from Martin B Nielsen <martin@xxxxxxxxxxx> -----
--
Met vriendelijke groeten,
Kenneth Waegeman
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com