Rules

The idea for the benchmark is to solve a system of linear equations to 64-bit floating-point accuracy by (a) doing a mixed-precision factorization of a matrix and computing an approximate solution from the low-precision factorization (LU decomposition), and then (b) use an iterative method (like GMRES) in 64-bit precision to iterate with the approximate low-precision solution to compute a final solution and obtain the accuracy one would have achieved by LU decomposition in 64-bit floating-point arithmetic. The low-precision LU factors should be used as a preconditioner in the iterative algorithm. The lowest allowed precision is 16-bit floating-point (available in the recent IEEE 754 standard). There is no theoretical analysis available to apply in solving linear systems with general square matrices in precisions using less than 16 bits. The restriction to only disallow less than 16-bit floating-point precision applies to results submitted after the June 2025 release of the HPL-MxP list.

The benchmark should use the HPL benchmark harness with a modification of the matrix generator. The modified generator should produce a non-symmetric matrix that may be factored in numerically stable fashion without requiring pivoting. See Publications page for example generators such as synthetic matrix products or mild scaling of the diagonal. The property of strict diagonal dominance could be used for testing purposes only. For example, by explicitly making sure that the diagonal entries are the sum of magnitudes of each row's off-diagonal elements forces the matrix to be row diagonally dominant.

In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the low-precision system of equations in the benchmark procedure must numerically conform to an LU factorization. In particular, the operation count for the algorithm must be 2/3n3 + O(n2) floating-point operations even though double-precision arithmetic is not required.

The HPL harness computes a backward-error: ||Ax-b|| / (||A|| ||x|| + ||b||) / (n * ε), where ε is the machine precision in 64-bit floating-point arithmetic (on machines implementing IEEE 754 this is ½53) and "n" is the size of the problem. There is no restriction on the problem size "n". However, if the backward-error is greater than 16, the run is invalid.

The implementation is allowed to do balancing and scaling to obtain the numbers within range of the floating-point format used for computing, but the time required for any preprocessing must be included in the time to solution.

The factorization can use mixed-precision during its construction, e.g., the panel factorization and triangular solves can be done in 32-bit arithmetic and the Schur complement (matrix-matrix multiplication of the form A2,2 <- A2,2 - A2,1 * inverse(A1,1) * A1,2) can be computed in 16-bit arithmetic with 32-bit accumulation.

The computation rate is based on the time to solve the problem: factor the matrix in lower precision, perhaps balance the matrix to prevent overflow, perform GMRES, or another iterative method, in 64-bit floating-point arithmetic using the LU factors as a preconditioner. If the implementation takes more than 50 iterations, the method should trigger a failure and the run is invalid.

In computing a rate of execution, 2/3n3+3/2n2 operations (the formula 2/3n3 - ½ n2 accounts for LU factorization and 2n2 for the subsequent back- and forward-solves) will be divided by the total time-to-solution to arrive at the rate of operations per second.

As part of the submission of results we expect the submitter to provide a detailed explanation of the algorithm used in the submission.

We have provided a reference implementation for the purpose of showing how the benchmark could be implemented. We do not expect this to be used when running of the benchmark. Optimizations should be applied to achieve higher performance than the reference implementation could achieve. The reference implementation can be found on Bitbucket.