Iñaki Ucar <iucar@xxxxxxxxxxxxxxxxx> writes: > So here you are exclusively talking about BLIS, because other > libraries implement BLAS + some parts of LAPACK. I may agree that > doing the latter is a bad idea given that the reference implementation > offers two distinct libraries. But that's the landscape we have > unfortunately. No, I'm talking more generally. That's just an example. > Being able to switch LAPACK independently while avoiding symbol > collisions is an interesting feature that could be added in FlexiBLAS > (at least I'm gonna propose it upstream). That's certainly not in the > feature set right now, but it's not a possibility either without > FlexiBLAS. E.g., many applications in Fedora are linked against > OpenBLAS serial. How would you change BLAS and LAPACK in those? It's not just a possibility, it just works with dynamic libraries. The fact that things are linked against a specific library rather than effectively a generic ABI is the problem, but I can define a shim to subvert them, the same as I did for OpenBLAS versus ATLAS, Rblas, etc., and the same as I'd have to do for FlexiBLAS if I wanted something different. Now, there are cases where there's the same API provided by different implementations which have a different ABI, MPI being the obvious example in this area. Some sort of multiplexing may be appropriate there -- I'm not sure. I've had trouble making the implementation work. > It's very easy to say "there will be hoops to jump through". That is > insufficient experience. I don't have any evidence suggesting that. > But I acknowledge that software is very heterogeneous out there and > sometimes you find unexpected problems. Those problems cannot be > uncovered unless someone tries to make this happen. Following the same > logic, many (most?) Change Proposals have insufficient experience. I'm offering the experience of doing all this work from various different points of view. I don't off-hand remember instances where particular problems have occurred, because I've done quite a of this. Isn't engineering experience valuable? No-one seems to be offering counter-experience to evaluate trades-off, just assertions about what requirements. >> I'm unclear on exactly how flexiblas currently works, but the above >> seems to be saying that I must have a limited choice of BLAS and LAPACK >> anyway. I can flip an environment variable already to subvert other >> BLASs with BLIS. (I considered dynamically loading different >> implementations when I tackled this initially.) Setting the environment >> other than explicitly in a batch script isn't necessarily easy or robust >> anyway, as opposed to ldconfig on sets of nodes. > > We have a more limited choice right now. See my comment above about > applications linked against OpenBLAS, or ATLAS. With FlexiBLAS, there > are less limitations and more flexibility (and I believe the specific > use case you have in mind for BLIS could be discussed upstream). I don't see how that's true. It's clearly more flexible to be able to have different implementations of an interface more-or-less trivially -- a one-liner -- than to have a fixed implementation with limited choice. Please realize that there's more to research computing than what's packaged in Fedora, and there's more than x86 or single-threaded stuff. (More of it could be there if it wasn't such a dispiriting business.) >> See above about an independent choice of BLAS and LAPACK. What happens >> if I want to use BLIS functions and link against a library that's using >> OpenBLAS and don't know about whatever magic I need in the environment >> to make it work? > > I'm not sure I'm following you anymore. You may also want to use 3 > functions from OpenBLAS serial, 4 functions from OpenBLAS OpenMP, 1 > from ATLAS, 5 from BLIS and the rest from whatever other library. But > it doesn't scale. That wouldn't make sense, and I'm worried by the suggestion. > That's obviously a dramatization. My point is that it's almost > impossible to cover all the very specialised HPC use cases out there. > But I would argue that, if there's a way to cover them all, that could > be achieved by adding features to FlexiBLAS, because it's the most > general way to solve the issue of the implementation disparity in the > BLAS/LAPACK landscape: just exposing a complete API and then > internally rewiring those calls to the appropriate libraries given > some configuration. It isn't the most general way to replace things at runtime. The most general way is to substitute different implementations of the ABI with dynamic linking. Note there's a de facto ABI, not just an API. Consider I want to speed up R by replacing the serial BLAS with a parallel one; that's fine, as in the reference I posted. You're saying I shouldn't have that choice because you're going to define what's serial and parallel in that case. Also I apparently shouldn't be able to substitute a shim to do tracing by this logic, or a malloc implementation to do profiling/debugging. You can't even distribute all the interesting implementations -- specifically CUDA, c.f. AMD's stuff (hooray!). I'd really like to hear experience of doing this stuff seriously for a user population on heterogeneous hardware over a number of years. >> So what's the basis for choosing the default? >> https://github.com/flame/blis/blob/80e6c10b72d50863b4b64d79f784df7befedfcd1/docs/Performance.md >> for instance? An OpenMP default that performs serially if you set >> OMP_NUM_THREADS and OMP_PROC_BIND doesn't seem a sane default. Again >> this seems to be saying I can switch things at will, despite what it >> says above, and that the mechanism is robust. > > We support a number of architectures in which BLIS doesn't perform > well, so I think we would agree that this rules out BLIS as a default. > Then, both @jussilehtola and the authors of FlexiBLAS independently > agree that the OpenMP version of OpenBLAS would be the best default. I haven't seen measurements to support these statements. Where are they? It's not obvious from the ones I referenced that OpenBLAS is generally better than BLIS in the reference I posted. But I'm not arguing for a fixed choice, because it's been time- and (micro-)architecture-dependent as well as threading-dependent in my experience. What users get on specific nodes of a system should be determined by the system manager's installation, and not by fragile environment variables. ldconfig provides that, even in a single image used over stateless nodes, assuming you have something like oneSIS to manage it. (I'm aware that's not something you'd be using Fedora for.) I haven't heard any real argument why that's the wrong thing to do, when it's proven robust in my use. > For the record, I copy&paste below the author's assessment: > > | * OpenBLAS / Serial performance comparable to BLIS and MKL, require > compilation with the USE_LOCKING=1 flag set to be safe in > multithreaded applications. > | * OpenBLAS-PThread, fast, flexible and good automatic threading, let > OpenMP applications crash if BLAS is called from the inside of a > OpenMP parallel region, preferred version on Debian/Ubuntu. Not usable > on ppc64le architectures. [Debian's choice of pthreads looked like a mistake to cover a specific case, the last remember. I don't think it should be a default.] > | * OpenBLAS-OpenMP. around 2 % slower than the Pthread version but > less complicated with multithreaded applications, works on all > platforms, good automatic threading. Can get in trouble with StarPU > parallel applications. but I do not know of any application outside > research that uses this. > OpenBLAS threading seems to have been a continual source of problems, and if you do the obvious thing, you get serial performance, as I said. It's not clear to me it performs better than BLIS with OpenMP in measurements. Again, I don't want to argue specifically for or against any one implementation. > | So I prefer setting OpenBLAS-OpenMP as default on all systems I have > under my control and colleagues do the same with their machines. > > You are arguing from a narrow niche. Yes, if multiple aspect of research computing support using rpm packaging is narrow, but it's the home of this stuff. (I realize linear algebra gets everywhere.) I'm specifically trying not to preclude things, because applications differ, and there's typically a counter-example from research support over the years to assertions I see. > And it may be possible that I'm > not following you correctly and it's my fault, but from your comments, > my impression is that you didn't take the time to read how FlexiBLAS > works (there are links in the change proposal; the documentation is > extensive, and there are links in their home page to their papers, to > a very informative presentation...). I think I understand enough about how it works, though I only studied the previous version, and I did read a paper. Having no development history doesn't help. As I said, I considered implementing something like that, though I was hoping ELF features would help more than they do, but there was a better way. I've also investigated a case where multiplexing might be appropriate, and another like BLAS where you can use the ABI, so this isn't entirely from ignorance. I just haven't seen anything to persuade me that a more general, simpler, more robust approach is wrong. _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx