On 04/16/2012 05:53 PM, Idan Kedar wrote: > _ore_get_io_state is supposed to allocate a struct ore_io_state, which is > variable length. Setting aside the uglification argument which gave birth > to the patch "pnfs-obj: Uglify objio_segment allocation for the sake of > the principle :-(", the above function checks if the memory it needs to > allocate is bigger than a page, and if so, separates it into two > allocations. why is this done? > > from include/linux/slab.h: >> /* >> * The largest kmalloc size supported by the slab allocators is >> * 32 megabyte (2^25) or the maximum allocatable page order if that is >> * less than 32 MB. >> * >> * WARNING: Its not easy to increase this value since the allocators have >> * to do various tricks to work around compiler limitations in order to >> * ensure proper constant folding. >> */ >> #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \ >> (MAX_ORDER + PAGE_SHIFT - 1) : 25) >> >> #define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_HIGH) >> #define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_HIGH - PAGE_SHIFT) > > If kmalloc can allocate 32MB, why check one page, like so? > > from fs/exofs/ore.c: >> struct __alloc_all_io_state { >> struct ore_io_state ios; >> struct ore_per_dev_state per_dev[numdevs]; >> union { >> struct osd_sg_entry sglist[sgs_per_dev * numdevs]; >> struct page *pages[num_par_pages]; >> }; >> } *_aios; >> >> if (likely(sizeof(*_aios) <= PAGE_SIZE)) { >> _aios = kzalloc(sizeof(*_aios), GFP_KERNEL); >> if (unlikely(!_aios)) { >> EXOFS_DBGMSG("Failed kzalloc bytes=%zd\n", >> sizeof(*_aios)); >> *pios = NULL; >> return -ENOMEM; >> } >> pages = num_par_pages ? _aios->pages : NULL; >> sgilist = sgs_per_dev ? _aios->sglist : NULL; >> ios = &_aios->ios; >> } else { >> struct __alloc_small_io_state { >> struct ore_io_state ios; >> struct ore_per_dev_state per_dev[numdevs]; >> } *_aio_small; >> union __extra_part { >> struct osd_sg_entry sglist[sgs_per_dev * numdevs]; >> struct page *pages[num_par_pages]; >> } *extra_part; >> _aio_small = kzalloc(sizeof(*_aio_small), GFP_KERNEL); >> if (unlikely(!_aio_small)) { >> EXOFS_DBGMSG("Failed alloc first part bytes=%zd\n", >> sizeof(*_aio_small)); >> *pios = NULL; >> return -ENOMEM; >> } >> extra_part = kzalloc(sizeof(*extra_part), GFP_KERNEL); >> if (unlikely(!extra_part)) { >> EXOFS_DBGMSG("Failed alloc second part bytes=%zd\n", >> sizeof(*extra_part)); >> kfree(_aio_small); >> *pios = NULL; >> return -ENOMEM; >> } >> pages = num_par_pages ? extra_part->pages : NULL; >> sgilist = sgs_per_dev ? extra_part->sglist : NULL; >> /* In this case the per_dev[0].sgilist holds the pointer to >> * be freed >> */ >> ios = &_aio_small->ios; >> ios->extra_part_alloc = true; >> } > > Is there any point to check if the memory is greater than 32MB? > > In theory it can allocate 32MB, in slab. I'm not sure about slob and slub. But in practice contiguous physical pages allocation tends to fail very fast on a system that was up a couple of hours. So we avoid it as plage. Past testing with tables bigger than PAGE_SIZE on the IO path gave catastrophic results. (Again once the system is up for a while and had a chance to fragment physical address space) The all Kernel point of the use of sg-lists is so not to allocate contiguous physical pages and to not have to use virtual-memory. This is done all over the Kernel. MAX_BIO_SIZE max-sg-table ... (BTW I saw this mail by chance. If you direct it to me I see it for sure) Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html