On Thu, Jul 14, 2016 at 12:29 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > > > On 07/14/2016 10:35 AM, Karl Cronburg wrote: >> >> Hello all, >> >> I have written a new version of fiologparser and could use some feedback >> from anyone who uses the tool regularly. It is on branch fiologparser of >> my fork >> of fio: >> >> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py >> >> It uses numpy and cython to speed up the data parsing and percentile >> calculations, >> while decreasing the memory footprint and making it scalable. > > > I was really trying to avoid the extra dependencies. :/ That's why Ben's > last commit was reworked so we could avoid adding numpy as a dependency in > the fio distribution (not that fio itself would need it, but it would be > really swell if none of the tools needed anything other than stock python). > > How much improvement are you seeing and where is the speedup coming from? I All the speedup is coming from using dedicated parsing libraries (pandas), numpy arrays (with vector math) instead of python lists, and streaming smaller portions of the data at a time into memory. On a single 39 MB latency log file it takes ~ a tenth of a second per interval to calculate percentiles for each interval whereas the one in wip-interval (with slight modification to stop a NoneType error) takes 5+ seconds per interval. > suspect we can get similar gains without needing to bring it in. For now > though I really think we need to worry about correctness rather than speed. I would think someone running python on a large data set would be more than willing to install something as widely used and supported as numpy. If anything numpy gives us better correctness guarantees and less technical debt. Who else is using / wants the percentiles fiologparser is calculating? If enough users are put-off by the dependencies I understand, the data scientist in me is just really sad that the exact thing numpy was built for is being re-engineered. > As such, does this version use a similar technique as what's in > wip-interval? Yes - weighted percentiles are calculated using the fraction of a sample falling in the given interval. > > https://github.com/markhpc/fio/commits/wip-interval > > If so, we should probably merge that first and make sure we have test cases > for it to make sure the idea is sound. If this is something new, it would > be good to showcase that it's as good or better than wip-interval. I could > easily be convinced that it is, but let's demonstrate it. Will do - I'll make some test cases to see where wip-interval differs with this implementation. If anything I figure it's good to maintain both for now to see where our thinking differs - just looking at your code I wasn't even sure what numbers should be calculated, but now that I've implemented weighted latency percentiles myself I have a better idea what we want. -Karl- > > Mark > > >> >> -Karl Cronburg- >> -- >> To unsubscribe from this list: send the line "unsubscribe fio" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html