Precision and accuracy constraints on floating point
Clause [7] [expr]
Christophe de Dinechin

Created on 2000-07-31.00:00:00 last changed 49 months ago


Date: 2015-09-15.00:00:00

Proposed resolution (September, 2015):

Change 6.8.2 [basic.fundamental] paragraph 8 as follows:

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This International Standard imposes no requirements on the accuracy of floating-point operations; see also _N4606_.18.3.2 [limits]. —end note] Integral and floating types are collectively called arithmetic types. Specializations of the standard library template std::numeric_limits (17.3 [support.limits]) shall specify the maximum and minimum values of each arithmetic type for an implementation.
Date: 2016-02-15.00:00:00

[Adopted at the February, 2016 meeting.]

It is not clear what constraints are placed on a floating point implementation by the wording of the Standard. For instance, is an implementation permitted to generate a "fused multiply-add" instruction if the result would be different from what would be obtained by performing the operations separately? To what extent does the "as-if" rule allow the kinds of optimizations (e.g., loop unrolling) performed by FORTRAN compilers?

Date User Action Args
2017-02-06 00:00:00adminsetstatus: tentatively ready -> cd4
2015-11-10 00:00:00adminsetmessages: + msg5593
2015-11-10 00:00:00adminsetstatus: open -> tentatively ready
2000-07-31 00:00:00admincreate