On Sat, Jan 26, 2013 at 3:40 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote: > On Fri, Jan 25, 2013 at 10:07 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> Sorry, I have written too less yesterday because of being sleepy. >> That`s obviously a cache pressure since dropping caches resulted in >> disappearance of this errors for a long period. I`m not very familiar >> with kernel memory mechanisms, but shouldn`t kernel try to allocate >> memory on the second node if this not prohibited by process` cpuset >> first and only then report allocation failure(as can be seen only node >> 0 involved in the failures)? I really have no idea where >> numa-awareness may be count in case of osd daemons. > > Hi Andrey, > > You said that the allocation failure doesn't occur if you flush > caches, but the kernel should evict pages from the cache as needed so > that the osd can allocate more memory (unless their dirty, but it > doesn't look like you have many dirty pages in this case). It looks > like you have plenty of reclaimable pages as well. Does the osd > remain running after that error occurs? Yes, it keeps running flawlessly without even a bit in an osdmap, but unfortunately logging wasn`t turned on for this moment. As soon as I`ll end massive test for ``suicide timeout'' bug I`ll check you idea with dd and also rerun test as below with ``debug osd = 20''. My thought is that kernel has ready-to-be-free memory on node1 and for strange reason osd process trying to reserve pages from node0 (where it is obviously allocated memory on start, since node1` memory starting only from high numbers over 32G), then kernel refuses to free cache on the specific node(it`s a quite misty, at least for me, why kernel just does not invalidate some buffers, even they are more preferably to stay in RAM than tail of LRU` ones?). Allocation looks like following on the most nodes: MemTotal: 66081396 kB MemFree: 278216 kB Buffers: 15040 kB Cached: 62422368 kB SwapCached: 0 kB Active: 2063908 kB Inactive: 60876892 kB Active(anon): 509784 kB Inactive(anon): 56 kB Active(file): 1554124 kB Inactive(file): 60876836 kB OSD-node free memory, with two osd processes on each node, libvirt prints ``Free'' field there: 0: 207500 KiB 1: 72332 KiB -------------------- Total: 279832 KiB 0: 208528 KiB 1: 80692 KiB -------------------- Total: 289220 KiB Since it is known that kernel reserve more memory on the node with higher memory pressure, seems very legit - osd processes works mostly with node 0` memory, so there is a bigger gap than on node 1 where exists almost only fs cache. > > I wonder if you see the same error if you do a long write intensive > workload on the local disk for the osd in question, maybe dd > if=/dev/zero of=/data/osd.0/foo > > -sam > > >> >> On Fri, Jan 25, 2013 at 2:42 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>> Hi, >>> >>> Those traces happens only constant high constant writes and seems to >>> be very rarely. OSD processes do not consume more memory after this >>> event and peaks are not distinguishable by monitoring. I have able to >>> catch it having four-hour constant writes on the cluster. >>> >>> http://xdel.ru/downloads/ceph-log/allocation-failure/ >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html