, Hyper-threading disabled. L1/L2 = 32/256 KB per core. L3 = 20 MB per socket, Bandwidth 59,7 GB/s. Compiler Intel ICC 16.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops
, Libraries Intel OpenMP, vol.5
Intel Xeon Phi accelerator) Processor Intel Xeon Phi 7120 accelerator, 60 cores, 4 threads per core. L1/L2 = 32/512 KB per core, Bandwidth 352 GB/s. Compiler Intel ICC 16.0.0. Options-O3-mmic-fp-model double-fp-model strictfunroll-all-loops ,
, Libraries Intel OpenMP 5. Intel MKL 11.3. Environment C (Occigen Supercomputer)
, Processor 4212 Intel Xeon E5-2690 v3 (12 cores per socket)
, L3 = 30 MB per socket, Bandwidth 68 GB/s. Compiler Intel ICC 15.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops
, Libraries Intel OpenMP, vol.5
, Processor Intel Core I7-4500U, 2 cores, 4 threads. L1/L2 = 32/256 KB per core. L3 = shared 4 MB. Bandwidth 25,6 GB/s. Compiler Intel ICC 16.0.0. Options-O3-xHost-fp-model double-fp-model strictfunroll-all-loops
, Libraries Intel OpenMP, vol.5
1: Experimental frameworks ,
, A Reference Implementation for Extended and Mixed Precision BLAS
, , 2017.
, Linear Algebra Subprograms, 2017.
Floating-point fused multiply-add: reduced latency for floating-point addition, 17th IEEE Symposium on Computer Arithmetic (ARITH'05), pp.42-51, 2005. ,
On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale, 2015 IEEE International Conference on Cluster Computing, pp.166-175, 2015. ,
In: SCAN: Scientific Computing, Computer Arithmetic and Validated Numerics, 2014. ,
In: SYNASC: Symbolic and Numeric Algorithms for Scientific Computing, 2016. ,
Reproducible, Accurately Rounded and Efficient BLAS, REPPAR: Reproducibility in Parallel Computing, 2016. ,
URL : https://hal.archives-ouvertes.fr/lirmm-01280324
Numerical Reproducibility for the Parallel Reduction on Multi-and Many-core Architectures, pp.167-8191, 2015. ,
A Floating-Point Technique for Extending the Available Precision, In: Numer. Math, vol.18, pp.224-242, 1971. ,
Fast Reproducible Floating-Point Summation, Proc. 21th IEEE Symposium on Computer Arithmetic, 2013. ,
Toward Hardware Support for Reproducible Floating-Point Computation, 2014. ,
Parallel Reproducible Summation, pp.2060-2070, 2015. ,
Efficient Reproducible Floating Point Summation and BLAS, 2016. ,
Error bounds from extra-precise iterative refinement, In: ACM Transactions on Mathematical Software (TOMS), vol.32, issue.2, pp.325-351, 2006. ,
, , 2016.
Eigenvalues and Condition Numbers of Random Matrices, In: SIAM Journal on Matrix Analysis and Applications, vol.9, pp.543-560, 1988. ,
, , 2015.
, Matrix Computations. Second, 1989.
Efficient Calculations of Faithfully Rounded L2-Norms of n-Vectors, In: ACM Trans. Math. Softw, vol.41, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01511120
Using KMP_AFFINITY to Create OpenMP* Thread Mapping to OS proc IDs ,
Using Accurate Arithmetics to Improve Numerical Reproducibility and Stability in Parallel Applications, Proceedings of the 14th International Conference on Supercomputing. ICS '00, pp.225-234, 2000. ,
Accuracy and Stability of Numerical Algorithms ,
A Fast Dense Triangular Solve in CUDA, In: SIAM Journal on Scientific Computing, vol.35, pp.303-322, 2013. ,
, IEEE, pp.0-7381, 2008.
, IEEE Task P754. IEEE 754-2008, p.58, 2008.
Reproducible Triangular Solvers for High-Performance Computing, 2015 12th International Conference on Information Technology-New Generations, pp.353-358, 2015. ,
URL : https://hal.archives-ouvertes.fr/lirmm-01206371
, New Microarchitecture for 4th Gen Intel Core Processor Platforms. Tech. rep. available at URL =
Intel® 64 and IA-32 Architectures Optimization Reference Manual ,
, Intel® Math Kernel Library (Intel® MKL) | Intel® Software
, , vol.18, 2016.
, Introduction to Conditional Numerical Reproducibility (CNR)
The Art of Computer Programming, Seminumerical Algorithms. Third. Reading, vol.2, pp.0-201, 1998. ,
The Exact Dot Product As Basic Tool for Long Interval Arithmetic, In: Computing, vol.91, pp.307-313, 2011. ,
Recovering Numerical Reproducibility in Hydrodynamic Simulations, 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pp.63-70, 2016. ,
URL : https://hal.archives-ouvertes.fr/lirmm-01274671
Design, Implementation and Testing of Extended and Mixed Precision BLAS, In: ACM Transactions on Mathematical Software, vol.28, issue.2, pp.152-205, 2002. ,
Top ten exascale research challenges, DOE ASCAC Subcommittee Report, 2014. ,
On the Design of Floating Point Units for Interval Arithmetic, 2007. ,
Handbook of Floating-Point Arithmetic, Birkhäuser Boston, pp.978-978, 2010. ,
URL : https://hal.archives-ouvertes.fr/ensl-00379167
Fast exact summation using small and large superaccumulators, 2015. ,
, ReproBLAS: Reproducible BLAS
Design Issues in High Performance Floating Point Arithmetic Units, 1997. ,
Accurate sum and dot product, In: SIAM J. Sci. Comput, vol.26, pp.1955-1988, 2005. ,
Numerical Computing with IEEE Floating Point Arithmetic: Including One Theorem, One Rule of Thumb, and One Hundred and One Exercises, Society for Industrial and Applied Mathematics, 2001. ,
Generalization of error-free transformation for matrix multiplication and its application, Nonlinear Theory and Its Applications, pp.2-11 ,
Algorithms for Arbitrary Precision Floating Point Arithmetic, Proc. 10th IEEE Symposium on Computer Arithmetic, pp.132-143, 1991. ,
On Properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations, Z, p.126, 1992. ,
Intel® Xeon Phi? Coprocessor Architecture and Tools: The Guide for Application Developers, 2013. ,
An Introduction to the ® Xeon Phi? Coprocessor, 2013. ,
In search of numerical consistency in parallel programming, In: Parallel Computing, vol.37, pp.167-8191, 2011. ,
Ultimately fast accurate summation, In: SIAM J. Sci. Comput, vol.31, pp.3466-3502, 2009. ,
, , 2014.
Accurate floating-point summation-Part I: Faithful rounding, In: SIAM J. Sci. Comput. 31, vol.1, pp.189-224, 2008. ,
Accurate floating-point summation-Part II: Sign K-Fold Faithful and Rounding to Nearest, In: SIAM J. Sci. Comput, vol.31, pp.1269-1302, 2008. ,
Solving Triangular Systems on a Parallel Computer, In: SIAM Journal on Numerical Analysis, vol.14, pp.1101-1113, 1977. ,
Adaptive precision floating-point arithmetic and fast robust geometric predicates, ACM Symposium on Computational Geometry, vol.18, pp.179-5376, 1996. ,
Scaling for Numerical Stability in Gaussian Elimination, In: J. ACM, vol.26, issue.3, pp.494-526, 1979. ,
,
, Specifying Code Branches | Intel® Software-Intel® Developer Zone
Floating-Point Computation, pp.0-13, 1974. ,
Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs, 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp.1-9, 2010. ,
, , 2004.
, The OpenMP API Specification for Parallel Programming, 2017.
, Threading Building Blocks, 2017.
Run-to-Run Numerical Reproducibility with the Intel® Math Kernel Library and Intel® Composer XE 2013, 2013. ,
Effects of floating-point non-associativity on numerical computations on massively multithreaded systems, Proceedings of Cray User Group Meeting (CUG), 2009. ,
Condition Numbers of Random Triangular Matrices, In: SIAM Journal on Matrix Analysis and Applications, vol.19, pp.564-581, 1998. ,
,
A parallel algorithm for accurate dot product, In: Parallel Comput, vol.34, issue.6-8, pp.167-8191, 2008. ,
Correct rounding and hybrid approach to exact floating-point summation, In: SIAM J. Sci. Comput, vol.31, pp.1064-8275, 2009. ,
Algorithm 908: Online Exact Summation of Floating-Point Streams, In: ACM Trans. Math. Software, vol.37, p.13, 2010. ,
A New Distillation Algorithm for Floating-Point Summation, In: SIAM Journal on Scientific Computing, vol.26, pp.2066-2078, 2005. ,
URL : https://hal.archives-ouvertes.fr/inria-00517618
, CUDA Toolkit Documentation-NVIDIA Documentation