, Hyper-threading disabled. L1/L2 = 32/256 KB per core. L3 = 20 MB per socket, Bandwidth 59,7 GB/s. Compiler Intel ICC 16.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops

, Libraries Intel OpenMP, vol.5

B. Environment, Intel Xeon Phi accelerator) Processor Intel Xeon Phi 7120 accelerator, 60 cores, 4 threads per core. L1/L2 = 32/512 KB per core, Bandwidth 352 GB/s. Compiler Intel ICC 16.0.0. Options-O3-mmic-fp-model double-fp-model strictfunroll-all-loops

, Libraries Intel OpenMP 5. Intel MKL 11.3. Environment C (Occigen Supercomputer)

, Processor 4212 Intel Xeon E5-2690 v3 (12 cores per socket)

, L3 = 30 MB per socket, Bandwidth 68 GB/s. Compiler Intel ICC 15.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops

, Libraries Intel OpenMP, vol.5

, Processor Intel Core I7-4500U, 2 cores, 4 threads. L1/L2 = 32/256 KB per core. L3 = shared 4 MB. Bandwidth 25,6 GB/s. Compiler Intel ICC 16.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops

, Libraries Intel OpenMP, vol.5

B. Table, 1: Experimental frameworks

, A Reference Implementation for Extended and Mixed Precision BLAS

A. Library, , 2017.

, Linear Algebra Subprograms, 2017.

D. Javier, T. Bruguera, and . Lang, Floating-point fused multiply-add: reduced latency for floating-point addition, 17th IEEE Symposium on Computer Arithmetic (ARITH'05), pp.42-51, 2005.

D. Chapp, T. Johnston, and M. Taufer, On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale, 2015 IEEE International Conference on Cluster Computing, pp.166-175, 2015.

C. Chohra, P. Langlois, and D. Parello, In: SCAN: Scientific Computing, Computer Arithmetic and Validated Numerics, 2014.

C. Chohra, P. Langlois, and D. Parello, In: SYNASC: Symbolic and Numeric Algorithms for Scientific Computing, 2016.

C. Chohra, P. Langlois, and D. Parello, Reproducible, Accurately Rounded and Efficient BLAS, REPPAR: Reproducibility in Parallel Computing, 2016.
URL : https://hal.archives-ouvertes.fr/lirmm-01280324

D. Sylvain-collange, S. Defour, R. Graillat, and . Iakymchuk, Numerical Reproducibility for the Parallel Reduction on Multi-and Many-core Architectures, pp.167-8191, 2015.

J. Theodorus and . Dekker, A Floating-Point Technique for Extending the Available Precision, In: Numer. Math, vol.18, pp.224-242, 1971.

J. W. Demmel and H. Diep-nguyen, Fast Reproducible Floating-Point Summation, Proc. 21th IEEE Symposium on Computer Arithmetic, 2013.

J. W. Demmel and H. Diep-nguyen, Toward Hardware Support for Reproducible Floating-Point Computation, 2014.

J. W. Demmel and H. Diep-nguyen, Parallel Reproducible Summation, pp.2060-2070, 2015.

J. Demmel, P. Ahrens, and H. Nguyen, Efficient Reproducible Floating Point Summation and BLAS, 2016.

J. Demmel, Y. Hida, W. Kahan, S. Xiaoye, S. Li et al., Error bounds from extra-precise iterative refinement, In: ACM Transactions on Mathematical Software (TOMS), vol.32, issue.2, pp.325-351, 2006.

T. E. Mkl-dynamic, , 2016.

A. Edelman, Eigenvalues and Condition Numbers of Random Matrices, In: SIAM Journal on Matrix Analysis and Applications, vol.9, pp.543-560, 1988.

. Exblas-exact and . Blas, , 2015.

G. H. Golub and C. F. Van-loan, Matrix Computations. Second, 1989.

S. Graillat, C. Lauter, P. Tang, N. Yamanaka, and S. Oishi, Efficient Calculations of Faithfully Rounded L2-Norms of n-Vectors, In: ACM Trans. Math. Softw, vol.41, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01511120

J. Hao, Using KMP_AFFINITY to Create OpenMP* Thread Mapping to OS proc IDs

Y. He, H. Q. Chris, and . Ding, Using Accurate Arithmetics to Improve Numerical Reproducibility and Stability in Parallel Applications, Proceedings of the 14th International Conference on Supercomputing. ICS '00, pp.225-234, 2000.

N. J. Higham, Accuracy and Stability of Numerical Algorithms

J. D. Hogg, A Fast Dense Triangular Solve in CUDA, In: SIAM Journal on Scientific Computing, vol.35, pp.303-322, 2013.

, IEEE, pp.0-7381, 2008.

, IEEE Task P754. IEEE 754-2008, p.58, 2008.

R. Iakymchuk, D. Defour, S. Collange, and S. Graillat, Reproducible Triangular Solvers for High-Performance Computing, 2015 12th International Conference on Information Technology-New Generations, pp.353-358, 2015.
URL : https://hal.archives-ouvertes.fr/lirmm-01206371

. Intel, New Microarchitecture for 4th Gen Intel Core Processor Platforms. Tech. rep. available at URL =

. Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual

, Intel® Math Kernel Library (Intel® MKL) | Intel® Software

. Intrinsics, , vol.18, 2016.

, Introduction to Conditional Numerical Reproducibility (CNR)

D. E. Knuth, The Art of Computer Programming, Seminumerical Algorithms. Third. Reading, vol.2, pp.0-201, 1998.

U. Kulisch and V. Snyder, The Exact Dot Product As Basic Tool for Long Interval Arithmetic, In: Computing, vol.91, pp.307-313, 2011.

P. Langlois, R. Nheili, and C. Denis, Recovering Numerical Reproducibility in Hydrodynamic Simulations, 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pp.63-70, 2016.
URL : https://hal.archives-ouvertes.fr/lirmm-01274671

S. Xiaoye, J. W. Li, D. H. Demmel, G. Bailey, Y. Henry et al., Design, Implementation and Testing of Extended and Mixed Precision BLAS, In: ACM Transactions on Mathematical Software, vol.28, issue.2, pp.152-205, 2002.

R. Lucas, . Ang, . Bergman, . Borkar, L. Carlson et al., Top ten exascale research challenges, DOE ASCAC Subcommittee Report, 2014.

M. Vl?-adu¸tiuadu¸tiu, On the Design of Floating Point Units for Interval Arithmetic, 2007.

J. Muller, N. Brisebarre, C. Florent-de-dinechin, V. Jeannerod, G. Lefèvre et al., Handbook of Floating-Point Arithmetic, Birkhäuser Boston, pp.978-978, 2010.
URL : https://hal.archives-ouvertes.fr/ensl-00379167

M. Radford and . Neal, Fast exact summation using small and large superaccumulators, 2015.

J. Hong-diep-nguyen, P. Demmel, and . Ahrens, ReproBLAS: Reproducible BLAS

. Stuart-franklin-oberman, Design Issues in High Performance Floating Point Arithmetic Units, 1997.

T. Ogita, S. M. Rump, and S. Oishi, Accurate sum and dot product, In: SIAM J. Sci. Comput, vol.26, pp.1955-1988, 2005.

M. L. Overton, Numerical Computing with IEEE Floating Point Arithmetic: Including One Theorem, One Rule of Thumb, and One Hundred and One Exercises, Society for Industrial and Applied Mathematics, 2001.

K. Ozaki, T. Ogita, S. M. Shin'ichi-oishi, and . Rump, Generalization of error-free transformation for matrix multiplication and its application, Nonlinear Theory and Its Applications, pp.2-11

D. M. Priest, Algorithms for Arbitrary Precision Floating Point Arithmetic, Proc. 10th IEEE Symposium on Computer Arithmetic, pp.132-143, 1991.

D. M. Priest, On Properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations, Z, p.126, 1992.

R. Rahman, Intel® Xeon Phi? Coprocessor Architecture and Tools: The Guide for Application Developers, 2013.

R. Reed, An Introduction to the ® Xeon Phi? Coprocessor, 2013.

R. W. Robey, J. M. Robey, and R. Aulwes, In search of numerical consistency in parallel programming, In: Parallel Computing, vol.37, pp.167-8191, 2011.

S. M. Rump, Ultimately fast accurate summation, In: SIAM J. Sci. Comput, vol.31, pp.3466-3502, 2009.

S. M. Rump, , 2014.

S. M. Rump, T. Ogita, and S. Oishi, Accurate floating-point summation-Part I: Faithful rounding, In: SIAM J. Sci. Comput. 31, vol.1, pp.189-224, 2008.

S. M. Rump, T. Ogita, and S. Oishi, Accurate floating-point summation-Part II: Sign K-Fold Faithful and Rounding to Nearest, In: SIAM J. Sci. Comput, vol.31, pp.1269-1302, 2008.

H. Ahmed, R. P. Sameh, and . Brent, Solving Triangular Systems on a Parallel Computer, In: SIAM Journal on Numerical Analysis, vol.14, pp.1101-1113, 1977.

J. Shewchuk, Adaptive precision floating-point arithmetic and fast robust geometric predicates, ACM Symposium on Computational Geometry, vol.18, pp.179-5376, 1996.

R. D. Skeel, Scaling for Numerical Stability in Gaussian Elimination, In: J. ACM, vol.26, issue.3, pp.494-526, 1979.

, Specifying Code Branches | Intel® Software-Intel® Developer Zone

P. H. Sterbenz, Floating-Point Computation, pp.0-13, 1974.

M. Taufer, O. Padron, P. Saponaro, and S. Patel, Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs, 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp.1-9, 2010.

G. The and . Mpfr, , 2004.

, The OpenMP API Specification for Parallel Programming, 2017.

, Threading Building Blocks, 2017.

R. Todd, Run-to-Run Numerical Reproducibility with the Intel® Math Kernel Library and Intel® Composer XE 2013, 2013.

O. Villa, D. Chavarria-miranda, V. Gurumoorthi, A. Márquez, and S. Krishnamoorthy, Effects of floating-point non-associativity on numerical computations on massively multithreaded systems, Proceedings of Cray User Group Meeting (CUG), 2009.

D. Viswanath and L. N. Trefethen, Condition Numbers of Random Triangular Matrices, In: SIAM Journal on Matrix Analysis and Applications, vol.19, pp.564-581, 1998.

/. S0895479896312869,

N. Yamanaka, T. Ogita, S. M. Rump, and S. Oishi, A parallel algorithm for accurate dot product, In: Parallel Comput, vol.34, issue.6-8, pp.167-8191, 2008.

Y. Zhu and W. B. Hayes, Correct rounding and hybrid approach to exact floating-point summation, In: SIAM J. Sci. Comput, vol.31, pp.1064-8275, 2009.

Y. Zhu and W. B. Hayes, Algorithm 908: Online Exact Summation of Floating-Point Streams, In: ACM Trans. Math. Software, vol.37, p.13, 2010.

Y. Zhu, J. Yong, and G. Zheng, A New Distillation Algorithm for Floating-Point Summation, In: SIAM Journal on Scientific Computing, vol.26, pp.2066-2078, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00517618

, CUDA Toolkit Documentation-NVIDIA Documentation