On Sat, Jan 26, 2013 at 12:41 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: > On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote: >> On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>> Sorry, I have written too less yesterday because of being sleepy. >>> That`s obviously a cache pressure since dropping caches resulted in >>> disappearance of this errors for a long period. I`m not very familiar >>> with kernel memory mechanisms, but shouldn`t kernel try to allocate >>> memory on the second node if this not prohibited by process` cpuset >>> first and only then report allocation failure(as can be seen only node >>> 0 involved in the failures)? I really have no idea where >>> numa-awareness may be count in case of osd daemons. >> >> Hi Andrey, >> >> You said that the allocation failure doesn't occur if you flush >> caches, but the kernel should evict pages from the cache as needed so >> that the osd can allocate more memory (unless their dirty, but it >> doesn't look like you have many dirty pages in this case). It looks >> like you have plenty of reclaimable pages as well. Does the osd >> remain running after that error occurs? > > Yes, it keeps running flawlessly without even a bit in an osdmap, but > unfortunately logging wasn`t turned on for this moment. As soon as > I`ll end massive test for ``suicide timeout'' bug I`ll check you idea > with dd and also rerun test as below with ``debug osd = 20''. > > My thought is that kernel has ready-to-be-free memory on node1 and for > strange reason osd process trying to reserve pages from node0 (where > it is obviously allocated memory on start, since node1` memory > starting only from high numbers over 32G), then kernel refuses to free > cache on the specific node(it`s a quite misty, at least for me, why > kernel just does not invalidate some buffers, even they are more > preferably to stay in RAM than tail of LRU` ones?). > > Allocation looks like following on the most nodes: > MemTotal: 66081396 kB > MemFree: 278216 kB > Buffers: 15040 kB > Cached: 62422368 kB > SwapCached: 0 kB > Active: 2063908 kB > Inactive: 60876892 kB > Active(anon): 509784 kB > Inactive(anon): 56 kB > Active(file): 1554124 kB > Inactive(file): 60876836 kB > > OSD-node free memory, with two osd processes on each node, libvirt > prints ``Free'' field there: > > > 0: 207500 KiB > 1: 72332 KiB > -------------------- > Total: 279832 KiB > > 0: 208528 KiB > 1: 80692 KiB > -------------------- > Total: 289220 KiB > > Since it is known that kernel reserve more memory on the node with > higher memory pressure, seems very legit - osd processes works mostly > with node 0` memory, so there is a bigger gap than on node 1 where > exists almost only fs cache. > > Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. I have printed timestamps of failure events on selected nodes, just for reference: http://xdel.ru/downloads/ceph-log/allocation-failure/stat.txt >> >> I wonder if you see the same error if you do a long write intensive >> workload on the local disk for the osd in question, maybe dd >> if=/dev/zero of=/data/osd.0/foo >> >> -sam >> >> >>> >>> On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>> Hi, >>>> >>>> Those traces happens only constant high constant writes and seems to >>>> be very rarely. OSD processes do not consume more memory after this >>>> event and peaks are not distinguishable by monitoring. I have able to >>>> catch it having four-hour constant writes on the cluster. >>>> >>>> http://xdel.ru/downloads/ceph-log/allocation-failure/ >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html