On Mon, 2015-11-30 at 18:58 -0600, Mark Nelson wrote: > Hi Laurent, > > Wow, that's excessive! I'd see if anyone else has any tricks first, but > if nothing else helps, running an OSD through valgrind with massif will > probably help pinpoint what's going on. Have you tweaked the recovery > tunables at all? > Oh, forgot to ask, any core dumps? > Mark Hi Mark, The only options we've touched are: "osd_max_backfills": "1", "osd_recovery_max_active": "1", "osd_recovery_op_priority": "1", Plus all the "noxxx" of ceph -s below. Do you have in mind other options we could tweak? We have no core dump yet, Mehdi is trying a heap dump on some OSD. When looking at "ceph tell osd.X heap stats" on OSD there is nothing in freelist all "in use", here is an OSD with 19G RAM for 2 TB disk: 25118 root 20 0 19.030g 0.014t 2568 S 38.2 46.4 8:43.55 /usr/bin/ceph-osd --cluster=ceph -i 2 -f /dev/sdb1 1.9T 1.6T 271G 86% /var/lib/ceph/osd/ceph-2 root@g3:~# ceph tell osd.2 heap stats osd.2 tcmalloc heap stats:------------------------------------------------ MALLOC: 18568498424 (17708.3 MiB) Bytes in use by application MALLOC: + 189464576 ( 180.7 MiB) Bytes in page heap freelist MALLOC: + 210782296 ( 201.0 MiB) Bytes in central cache freelist MALLOC: + 4416048 ( 4.2 MiB) Bytes in transfer cache freelist MALLOC: + 29157504 ( 27.8 MiB) Bytes in thread cache freelists MALLOC: + 80457888 ( 76.7 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 19082776736 (18198.8 MiB) Actual memory used (physical + swap) MALLOC: + 393216 ( 0.4 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 19083169952 (18199.1 MiB) Virtual address space used MALLOC: MALLOC: 1150671 Spans in use MALLOC: 342 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. Log since latest restart of the 19G OSD below with ceph -s and ceph osd tree (it's rarely below two OSD down, oscillate between 2 and 10 down). Sincerely, Laurent 2015-12-01 08:32:13.928973 7f0554016900 0 ceph version 0.94.5-164-gbf9e1b6 (bf9e1b6307692fdc50465a64590d83e3d7015c9d), process ceph-osd, pid 25118 2015-12-01 08:32:13.939219 7f0554016900 0 filestore(/var/lib/ceph/osd/ceph-2) backend xfs (magic 0x58465342) 2015-12-01 08:32:14.071146 7f0554016900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl is supported and appears to work 2015-12-01 08:32:14.071163 7f0554016900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-12-01 08:32:14.186866 7f0554016900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-12-01 08:32:14.186996 7f0554016900 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature: extsize is supported and kernel 3.19.0-32-generic >= 3.5 2015-12-01 08:32:15.649088 7f0554016900 0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-12-01 08:32:29.476440 7f0554016900 1 journal _open /var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-12-01 08:32:29.540794 7f0554016900 1 journal _open /var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-12-01 08:32:29.597359 7f0554016900 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2015-12-01 08:32:29.623116 7f0554016900 0 osd.2 214588 crush map has features 2303210029056, adjusting msgr requires for clients 2015-12-01 08:32:29.623132 7f0554016900 0 osd.2 214588 crush map has features 2578087936000 was 8705, adjusting msgr requires for mons 2015-12-01 08:32:29.623138 7f0554016900 0 osd.2 214588 crush map has features 2578087936000, adjusting msgr requires for osds 2015-12-01 08:32:29.623158 7f0554016900 0 osd.2 214588 load_pgs 2015-12-01 08:34:44.563581 7f0554016900 0 osd.2 214588 load_pgs opened 1496 pgs 2015-12-01 08:34:44.564545 7f0546afe700 0 -- 192.168.99.253:6811/25118 >> :/0 pipe(0xad4e000 sd=27 :6811 s=0 pgs=0 cs=0 l=0 c=0xb6730000).accept failed to getpeername (107) Transport endpoint is not connected 2015-12-01 08:34:44.564600 7f05466bf700 0 -- 192.168.99.253:6811/25118 >> :/0 pipe(0x893c000 sd=32 :6811 s=0 pgs=0 cs=0 l=0 c=0xb6730420).accept failed to getpeername (107) Transport endpoint is not connected 2015-12-01 08:34:44.577168 7f0554016900 -1 osd.2 214588 log_to_monitors {default=true} 2015-12-01 08:34:44.579147 7f052b4ff700 0 osd.2 214588 ignoring osdmap until we have initialized 2015-12-01 08:34:44.579810 7f052b4ff700 0 osd.2 214588 ignoring osdmap until we have initialized 2015-12-01 08:34:48.054636 7f05469fd700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.243:6802/8218 pipe(0x6c82000 sd=30 :38619 s=1 pgs=0 cs=0 l=0 c=0xb6730b00).connect claims to be 192.168.99.243:6802/14163 not 192.168.99.243:6802/8218 \ - wrong node! 2015-12-01 08:34:48.054691 7f05469fd700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.243:6802/8218 pipe(0x6c82000 sd=30 :38619 s=1 pgs=0 cs=0 l=0 c=0xb6730b00).fault with nothing to send, going to standby 2015-12-01 08:34:48.056487 7f053b9fa700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.251:6802/31468 pipe(0x681a000 sd=32 :0 s=1 pgs=0 cs=0 l=0 c=0xb6730dc0).fault with nothing to send, going to standby 2015-12-01 08:34:49.148364 7f0544b56700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.238:6816/1031547 pipe(0x653d000 sd=37 :0 s=1 pgs=0 cs=0 l=0 c=0xb6732ec0).fault with nothing to send, going to standby 2015-12-01 08:34:49.149146 7f053c52f700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.245:6809/17836 pipe(0x8489000 sd=38 :56560 s=1 pgs=0 cs=0 l=0 c=0xb6733020).connect claims to be 192.168.99.245:6809/29285 not 192.168.99.245:6809/1783\ 6 - wrong node! 2015-12-01 08:34:49.149228 7f053c52f700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.245:6809/17836 pipe(0x8489000 sd=38 :56560 s=1 pgs=0 cs=0 l=0 c=0xb6733020).fault with nothing to send, going to standby 2015-12-01 08:34:49.151638 7f0546afe700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.251:6805/2025832 pipe(0x9397000 sd=39 :0 s=1 pgs=0 cs=0 l=0 c=0xb88fbfa0).fault with nothing to send, going to standby 2015-12-01 08:34:49.152571 7f05466bf700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.244:6807/17661 pipe(0x9b0a000 sd=40 :0 s=1 pgs=0 cs=0 l=0 c=0xb88fcc00).fault with nothing to send, going to standby 2015-12-01 08:34:49.153592 7f053c630700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.243:6811/2013633 pipe(0x6379000 sd=41 :59009 s=1 pgs=0 cs=0 l=0 c=0xb88fd860).connect claims to be 192.168.99.243:6811/14225 not 192.168.99.243:6811/20\ 13633 - wrong node! 2015-12-01 08:34:49.153639 7f053c630700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.243:6811/2013633 pipe(0x6379000 sd=41 :59009 s=1 pgs=0 cs=0 l=0 c=0xb88fd860).fault with nothing to send, going to standby 2015-12-01 08:34:49.171878 7f053b8f9700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.238:6801/31478 pipe(0x55d7e000 sd=43 :0 s=1 pgs=0 cs=0 l=0 c=0xb895b080).fault with nothing to send, going to standby 2015-12-01 08:34:49.176018 7f053b5e8700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.237:6813/12186 pipe(0xb09f000 sd=48 :0 s=1 pgs=0 cs=0 l=0 c=0xb895d440).fault with nothing to send, going to standby 2015-12-01 08:34:49.183555 7f053b4e7700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.247:6803/1013467 pipe(0xafc7000 sd=49 :0 s=1 pgs=0 cs=0 l=0 c=0xb8981b20).fault with nothing to send, going to standby 2015-12-01 08:34:49.200305 7f053afcd700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.238:6809/1031512 pipe(0x994a000 sd=92 :0 s=1 pgs=0 cs=0 l=0 c=0xb891c940).fault with nothing to send, going to standby 2015-12-01 08:34:49.230606 7f053aecc700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.237:6805/1012083 pipe(0x57530000 sd=67 :0 s=1 pgs=0 cs=0 l=0 c=0xb8a2b700).fault with nothing to send, going to standby 2015-12-01 08:34:49.243292 7f053adcb700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.237:6809/12149 pipe(0x87ec000 sd=119 :0 s=1 pgs=0 cs=0 l=0 c=0xb8a29a20).fault with nothing to send, going to standby 2015-12-01 08:34:49.245789 7f053acca700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.247:6805/9657 pipe(0x7a27000 sd=145 :0 s=1 pgs=0 cs=0 l=0 c=0xb8a367e0).fault with nothing to send, going to standby 2015-12-01 08:34:49.253996 7f053abc9700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.237:6820/1012116 pipe(0xafc2000 sd=146 :0 s=1 pgs=0 cs=0 l=0 c=0xb898a6e0).fault with nothing to send, going to standby 2015-12-01 08:34:49.758783 7f053aac8700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.240:6801/20428 pipe(0x9595000 sd=100 :0 s=1 pgs=0 cs=0 l=0 c=0x272e3600).fault with nothing to send, going to standby 2015-12-01 08:34:59.347751 7f0554016900 0 osd.2 214588 done with init, starting boot process 2015-12-01 08:36:23.792638 7f053b9fa700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.252:6808/9305 pipe(0xa18b000 sd=30 :45593 s=2 pgs=4644 cs=1 l=0 c=0x57699860).fault, initiating reconnect 2015-12-01 08:36:23.792864 7f05469fd700 0 -- 192.168.99.253:6811/25118 >> 192.168.99.252:6808/9305 pipe(0xa18b000 sd=30 :45593 s=1 pgs=4644 cs=2 l=0 c=0x57699860).fault 2015-12-01 08:36:23.795104 7f05138b0700 0 -- 192.168.99.253:0/25118 >> 192.168.99.252:6809/9305 pipe(0xb7b1d000 sd=133 :0 s=1 pgs=0 cs=0 l=1 c=0x38b475a0).fault 2015-12-01 08:36:23.795363 7f05137af700 0 -- 192.168.99.253:0/25118 >> 192.168.99.252:6810/9305 pipe(0x14a9f0000 sd=137 :0 s=1 pgs=0 cs=0 l=1 c=0x38b47de0).fault 2015-12-01 08:36:39.271805 7f054958e700 -1 osd.2 215024 heartbeat_check: no reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01 08:36:18.990748 (cutoff 2015-12-01 08:36:19.271805) 2015-12-01 08:36:40.096838 7f051acde700 -1 osd.2 215025 heartbeat_check: no reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01 08:36:18.990748 (cutoff 2015-12-01 08:36:20.096837) 2015-12-01 08:36:40.273052 7f054958e700 -1 osd.2 215025 heartbeat_check: no reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01 08:36:18.990748 (cutoff 2015-12-01 08:36:20.273051) # ceph -s cluster 1fe74663-8dfa-486c-bb80-3bd94c90c967 health HEALTH_WARN 210 pgs backfill 7 pgs backfilling 408 pgs degraded 3831 pgs down 7901 pgs peering 5 pgs recovering 28 pgs recovery_wait 111 pgs stale 251 pgs stuck degraded 8620 pgs stuck inactive 3 pgs stuck stale 8975 pgs stuck unclean 218 pgs stuck undersized 366 pgs undersized 267 requests are blocked > 32 sec recovery 195827/78807007 objects degraded (0.248%) recovery 340087/78807007 objects misplaced (0.432%) recovery 30993/12140933 unfound (0.255%) 1/49 in osds are down noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent flag(s) set monmap e8: 3 mons at {g1=192.168.99.251:6789/0,g2=192.168.99.252:6789/0,g3=192.168.99.253:6789/0} election epoch 1374, quorum 0,1,2 g1,g2,g3 osdmap e215938: 50 osds: 48 up, 49 in; 1826 remapped pgs flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent pgmap v35383395: 11264 pgs, 6 pools, 39368 GB data, 11856 kobjects 66149 GB used, 30846 GB / 96995 GB avail 195827/78807007 objects degraded (0.248%) 340087/78807007 objects misplaced (0.432%) 30993/12140933 unfound (0.255%) 3675 down+peering 2868 peering 2282 active+clean 1169 remapped+peering 382 activating 179 activating+remapped 101 active+remapped+wait_backfill 95 down+remapped+peering 93 active+undersized+degraded+remapped+wait_backfill 88 activating+undersized+degraded+remapped 78 active+undersized+degraded 60 activating+undersized+degraded 56 stale+down+peering 21 stale+remapped+peering 20 active+recovery_wait+degraded 16 undersized+degraded+remapped+wait_backfill+peered 12 stale+peering 10 undersized+degraded+peered 9 activating+degraded 7 activating+degraded+remapped 6 stale+activating 5 stale+down+remapped+peering 4 active+recovering+degraded 3 active+recovery_wait+undersized+degraded+remapped 3 active+undersized+degraded+remapped+backfilling 3 stale+active+undersized+degraded 3 stale+activating+undersized+degraded 2 stale+activating+undersized+degraded+remapped 2 undersized+degraded+remapped+backfilling+peered 2 active+remapped+backfilling 2 active+recovery_wait+degraded+remapped 2 active+recovery_wait+undersized+degraded 1 recovering+undersized+degraded+peered 1 recovery_wait+undersized+degraded+remapped+peered 1 stale+activating+remapped 1 remapped 1 stale+active+clean 1 stale+undersized+degraded+peered recovery io 357 MB/s, 94 objects/s client io 5206 MB/s rd, 3488 MB/s wr, 78940 op/s # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -5 4.17094 root ssd -6 0.40300 host g1-ssd 42 0.40300 osd.42 up 1.00000 1.00000 -7 0.43599 host g2-ssd 4 0.21399 osd.4 up 1.00000 1.00000 7 0.22198 osd.7 up 1.00000 1.00000 -8 0.22198 host g3-ssd 6 0.22198 osd.6 up 1.00000 1.00000 -10 0.40900 host n7-ssd 14 0.40900 osd.14 up 1.00000 1.00000 -11 0.20900 host stri-ssd 12 0.20900 osd.12 up 1.00000 1.00000 -14 0.20799 host g4-ssd 19 0.20799 osd.19 up 1.00000 1.00000 -16 0.20900 host g5-ssd 5 0.20900 osd.5 up 1.00000 1.00000 -17 0.20900 host g6-ssd 10 0.20900 osd.10 up 1.00000 1.00000 -19 0.43100 host g8-ssd 31 0.20900 osd.31 up 1.00000 1.00000 32 0.22198 osd.32 up 1.00000 1.00000 -21 0.53899 host g9-ssd 34 0.44899 osd.34 up 1.00000 1.00000 36 0.09000 osd.36 up 1.00000 1.00000 -23 0.09000 host g10-ssd 40 0.09000 osd.40 up 1.00000 1.00000 -26 0.40300 host g11-ssd 50 0.40300 osd.50 up 1.00000 1.00000 -27 0.40300 host g12-ssd 49 0.40300 osd.49 up 1.00000 1.00000 -1 94.10190 root default -2 6.25699 host g1 0 3.57599 osd.0 up 1.00000 1.00000 20 2.68100 osd.20 up 1.00000 1.00000 -3 5.36099 host g2 1 3.57599 osd.1 up 1.00000 1.00000 21 1.78499 osd.21 up 1.00000 1.00000 -4 5.38899 host g3 16 3.57599 osd.16 up 1.00000 1.00000 2 1.81299 osd.2 up 1.00000 1.00000 -9 6.25099 host n7 8 2.68100 osd.8 up 1.00000 1.00000 27 1.78499 osd.27 up 1.00000 1.00000 29 1.78499 osd.29 up 1.00000 1.00000 -12 5.36200 host stri 15 2.68100 osd.15 up 1.00000 1.00000 13 2.68100 osd.13 up 1.00000 1.00000 -13 14.30399 host g4 17 3.57599 osd.17 up 1.00000 1.00000 23 3.57599 osd.23 up 1.00000 1.00000 22 3.57599 osd.22 down 0 1.00000 18 3.57599 osd.18 up 1.00000 1.00000 -15 8.04300 host g5 25 2.68100 osd.25 up 1.00000 1.00000 24 2.68100 osd.24 up 1.00000 1.00000 26 2.68100 osd.26 down 1.00000 1.00000 -18 6.25699 host g6 28 2.68100 osd.28 up 1.00000 1.00000 9 3.57599 osd.9 up 1.00000 1.00000 -20 7.18799 host g8 30 4.50699 osd.30 up 1.00000 1.00000 41 2.68100 osd.41 up 1.00000 1.00000 -22 7.15199 host g9 35 3.57599 osd.35 up 1.00000 1.00000 33 3.57599 osd.33 up 1.00000 1.00000 -24 8.93799 host g10 39 3.57599 osd.39 up 1.00000 1.00000 38 2.68100 osd.38 up 1.00000 1.00000 37 2.68100 osd.37 up 1.00000 1.00000 -25 5.43900 host g11 3 1.81299 osd.3 up 1.00000 1.00000 43 1.81299 osd.43 up 1.00000 1.00000 44 1.81299 osd.44 up 1.00000 1.00000 -28 8.16100 host g12 45 1.81299 osd.45 up 1.00000 1.00000 46 1.81299 osd.46 up 1.00000 1.00000 47 2.72198 osd.47 up 1.00000 1.00000 48 1.81299 osd.48 up 1.00000 1.00000 > On 11/30/2015 06:52 PM, Laurent GUERBY wrote: > > Hi, > > > > We lost a disk today in our ceph cluster so we added a new machine with > > 4 disks to replace the capacity and we activated straw1 tunable too > > (we also tried straw2 but we quickly backed up this change). > > > > During recovery OSD started crashing on all of our machines > > the issue being OSD RAM usage that goes very high, eg: > > > > 24078 root 20 0 27.784g 0.026t 10888 S 5.9 84.9 > > 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f > > /dev/sda1 2.7T 2.2T 514G 82% /var/lib/ceph/osd/ceph-41 > > > > That's about 8GB resident RAM per TB of disk, way above > > what we provisionned ~ 2-4 GB RAM/TB. > > > > We rebuilt 0.94.5 with the three memory related commits below but > > it didn't change anything. > > > > Right now our cluster is unable to fully restart and recover with the > > machines and RAM we have been working with for the past year. > > > > Any idea on what to look for? > > > > Thanks in advance, > > > > Sincerely, > > > > Laurent > > > > commit 296bec72649884447b59e785c345c53994df9e09 > > Author: xiexingguo <258156334@xxxxxx> > > Date: Mon Oct 26 18:38:01 2015 +0800 > > > > FileStore: potential memory leak if _fgetattrs fails > > > > Memory leak happens if _fgetattrs encounters some error and simply > > returns. > > Fixes: #13597 > > Signed-off-by: xie xingguo <xie.xingguo@xxxxxxxxxx> > > > > (cherry picked from commit ace7dd096b58a88e25ce16f011aed09269f2a2b4) > > > > commit 16aa14ab0208df568e64e2a4f7fe7692eaf6b469 > > Author: Xinze Chi <xmdxcxz@xxxxxxxxx> > > Date: Sun Aug 2 18:36:40 2015 +0800 > > > > bug fix: osd: do not cache unused buffer in attrs > > > > attrs only reference the origin bufferlist (decode from MOSDPGPush > > or > > ECSubReadReply message) whose size is much greater than attrs in > > recovery. > > If obc cache it (get_obc maybe cache the attr), this causes the > > whole origin > > bufferlist would not be free until obc is evicted from obc cache. So > > rebuild > > the bufferlist before cache it. > > > > Fixes: #12565 > > Signed-off-by: Ning Yao <zay11022@xxxxxxxxx> > > Signed-off-by: Xinze Chi <xmdxcxz@xxxxxxxxx> > > (cherry picked from commit c5895d3fad9da0ab7f05f134c49e22795d5c61f3) > > > > commit 51ea1ca7f4a7763bfeb110957cd8a6f33b8a1422 > > Author: xiexingguo <258156334@xxxxxx> > > Date: Thu Oct 29 20:04:11 2015 +0800 > > > > Objecter: pool_op callback may hang forever. > > > > pool_op callback may hang forever due to osdmap update during reply > > handling. > > Fixes: #13642 > > Signed-off-by: xie xingguo <xie.xingguo@xxxxxxxxxx> > > > > (cherry picked from commit 00c6fa9e31975a935ed2bb33a099e2b4f02ad7f2) > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com