SuperSum, In Defense of Floating Point Arithmetic (2024)

Floating point arithmetic doesn't get the respect it deserves. Many people consider it mysterious, fuzzy, unpredictable. These misgivings often occur in discussion of vector sums. Our provocatively named SuperSum is intended to calm these fears.

Contents

  • Ledgers
  • Checksums
  • Order
  • Speed
  • Three Sums
  • Toy Example
  • Second Test
  • Conclusion

Ledgers

A ledger is a list of transactions in an account. Auditing the ledger involves comparing the total of the items with the change in the account balance.

If v is a vector of transaction amounts, then we need to compute

sum(v)

If this sum is equal to the change in the balance, then it is reasonable to assume that the ledger is correct. If not, the ledger must be examined line-by-line.

Checksums

Have you ever used a checksum for a file transfer? One checksum is computed before the file is sent. After the file has been sent over a questionable channel, a second checksum is computed on the receiving end. If the two checksums agree, the transmission was probably okay. If not, the file must be sent again.

Order

Floating point addition is not associative. This means

(a + b) + c

is not necessarily the same as

a + (b + c).

So the order of doing a computation is important.

Computers are deterministic devices. If the same computation is done in the same order more than once, the results must be the same. If you get different sums from different runs on a fixed computer, then it must be because the order has been changed.

Speed

In recent years, we have made the built-in function sum(x) faster by parallelizing it. The input vector is broken into several pieces, the sum of each piece is computed separately and simultaneously, then the partial sums are combined to give the final result. If the number and size of the pieces varies from run to run, the order varies and consequently the computed sums may vary.

Three Sums

Here are three replacements for nansum(x), the version of sum(x) that skips over NaNs and infs.

simplesum

Avoid the effect of blocking in the built-in sum(x).

function s = simplesum(x) % simplesum. s = simplesum(x), for vector(x). % Force sequential order for sum(x). % Skips NaNs and infs.
 s = 0; for k = 1:length(x) if isfinite(x(k)) s = s + x(k); end endend

KahanSum

This is the Kahan summation algorithm. The sum is accumulated in two words, the more significant s and the correction c. If you were able to form s + c exactly, it would be more accurate than s alone.

function s = KahanSum(x) % KahanSum. s = KahanSum(x), for vector(x). % More accurate, but slower, than sum(x). % Skips NaNs and infs. % https://en.wikipedia.org/wiki/Kahan_summation_algorithm.
 s = 0; % sum c = 0; % correction for k = 1:length(x) if isfinite(x(k)) y = x(k) - c; t = s + y; c = (t - s) - y; s = t; end endend

SuperSum

I gave it a pretentious name to grab attention. Use extended precision, with enough digits to hold any MATLAB double.

function s = SuperSum(x) % SuperSum. s = SuperSum(x), for vector(x). % Symbolic Math Toolbox extended precision. % Skips NaNs and infs. % % 632 decimal digits holds every IEEE-754 double. % 632 = ceil(log10(realmax) - log10(eps*realmin)); % din = digits(632); % Remember current setting s = double(sum(sym(x(isfinite(x)),'D'))); digits(din) % Restoreend

Toy Example

A test case here at MathWorks, known by a French-Canadian name that translates to "toy example", has a vector e2 of length 4243 and values that range from -3.3e7 to 6.8e9.

When running tests involving floating point numbers it is a good idea to set the output format to hex so we can see every last bit.

format hexload jeuTest e2x = e2;
[nansum(x) simplesum(x) KahanSum(x) SuperSum(x)]
ans = 423785e8206150e2 423785e8206150e0 423785e8206150e1 423785e8206150e1

For jeuTest, Kahan summation gets the same result as SuperSum, while nansum and simplesum differ in the last bit or two.

Second Test

The vector length is only three, but the third term completely cancels the first, and the second term rises from obscurity. In this situation, KahanSum is no more accurate than sum.

format hexx = [1 1e-14 -1]'
[nansum(x) simplesum(x) KahanSum(x) SuperSum(x)]
x = 3ff0000000000000 3d06849b86a12b9b bff0000000000000
ans = 3d06800000000000 3d06800000000000 3d06800000000000 3d06849b86a12b9b

Conclusion

I will leave careful timing for another day. Let's just say that in situations like jeuTest, KahanSum is probably all you need. It is usually more accurate than sum, and not much slower.

However, for complete reliability, use SuperSum. There is no substitute for the right answer.


Published with MATLAB® R2024a

SuperSum, In Defense of Floating Point Arithmetic (2024)

FAQs

What is a NaN and what is its significance in floating-point arithmetic? ›

NaN: A floating point value meaning “Not a Number”. Support for NaN is provided for interoperability with other devices or systems that produce NaN values. Positive Infinity or Negative Infinity: Floating point value, sometimes represented as Infinity, –Infinity, INF, or –INF.

What is the problem with floating-point arithmetic? ›

Unfortunately, most decimal fractions cannot be represented exactly as binary fractions. A consequence is that, in general, the decimal floating-point numbers you enter are only approximated by the binary floating-point numbers actually stored in the machine.

What are the floating-point arithmetic exceptions? ›

IEEE 754 defines five basic types of floating point exceptions: invalid operation, division by zero, overflow, underflow and inexact. The first three (invalid, division, and overflow) are sometimes collectively called common exceptions. These exceptions can seldom be ignored when they occur.

How do you check if a floating-point is NaN? ›

NaN is unordered: it is not equal to, greater than, or less than anything, including itself. x == x is false if the value of x is NaN. You can use this to test whether a value is NaN or not, but the recommended way to test for NaN is with the isnan function (see Floating-Point Number Classification Functions).

What is NaN used for? ›

[JavaScript] - What is NaN and why is it used in JavaScript? NaN stands for Not a Number, it is a value in JavaScript used to represent an undefined or unrepresentable value.

Why floating point arithmetic is not always 100% accurate? ›

There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number 0.1. Although it has a finite decimal representation, in binary it has an infinite repeating representation.

Why is floating point arithmetic slow? ›

The logic for handling floating point arithmetic is different from (and much more complex than) what is needed to deal with integers. The ALU is built to handle simple integer operations like adding two's complement inputs, bit shifting, etc.

Why is floating point bad? ›

Floating point numbers in a computer are usually binary floating point, which can't represent the exact value of fractions other than integer multiples of powers of two. It's like how one third in decimal is 0.333333… but in binary, one tenth is also a repeating number.

Do calculators use floating-point arithmetic? ›

Most calculators work in decimal β = 10 to make life more intuitive, but various software calculators work in binary (see xcalc above). Microsoft Excel is a particular issue. It tries hard to give the illusion of having floating point with 15 decimal digits, but internally it uses 64-bit floating point.

How do you resolve a floating-point exception? ›

Floating-point exception subroutines
  1. Change the execution state of the process.
  2. Enable the signaling of exceptions.
  3. Disable exceptions or clear flags.
  4. Determine which exceptions caused the signal.
  5. Test the exception sticky flags.

Why is floating-point arithmetic not exact in JavaScript? ›

That's because all numbers in JavaScript are double precision floating-point numbers in line with the IEEE 754 standard. This format allows for the representation of numbers in the approximate range of 5 x 2^-324 to 1.79 * 2^308 inclusive of fractions.

What are the disadvantages of floating-point arithmetic? ›

One of the main disadvantages of floating-point arithmetic is that it is not exact. Because the fraction and the exponent are stored in a limited number of bits, some real numbers cannot be represented exactly, and have to be rounded or truncated.

Is float arithmetic faster than double? ›

At worst, floats can have roughly the same instruction latency as doubles, but at best, memory transfers are twice as fast, SIMD instructions can compute twice as many floats as doubles, and special functions like sqrt and exp require fewer iterations to reach ULP accuracy.

Why use fixed-point instead of floating-point? ›

Precision Needs: If your application requires a consistent precision, particularly for handling numbers around a specific scale, fixed-point might be the better choice. It offers uniform precision, which is beneficial for financial calculations or fixed-scale measurements.

What is float NaN? ›

NaN (Not a Number): NaN represents missing or undefined data in Python. It is typically encountered while performing mathematical operations that result in an undefined or nonsensical value. NaN is a floating-point value represented by the float('nan') object in Python.

What does NaN mean in statistics? ›

Easily the strangest thing about floating-point numbers is the floating-point value "NaN". Short for "Not a Number", even its name is a paradox. Only floating-point values can be NaN, meaning that from a type-system point of view, only numbers can be "not a number".

What does my NaN mean? ›

Definitions of nan. noun. your grandmother. type of: gran, grandma, grandmother, grannie, granny, nanna.

What is the function of NaN? ›

JavaScript's number feature enables you to represent numeric and floating numbers. JavaScript number has a particular value called NaN function, which means for Not-a-Number. NaN is an attribute of the global javascript object. The global object is used to show the window object in web browsers.

References

Top Articles
Gatlinburg Trolley Schedule 2022
Ah, ca ira! – Wikisource
Woodward Avenue (M-1) - Automotive Heritage Trail - National Scenic Byway Foundation
Kansas City Kansas Public Schools Educational Audiology Externship in Kansas City, KS for KCK public Schools
Overnight Cleaner Jobs
What Happened To Dr Ray On Dr Pol
Shorthand: The Write Way to Speed Up Communication
Rubfinder
Matthew Rotuno Johnson
Blue Beetle Showtimes Near Regal Swamp Fox
Programmieren (kinder)leicht gemacht – mit Scratch! - fobizz
Hijab Hookup Trendy
Tcu Jaggaer
Carolina Aguilar Facebook
Check From Po Box 1111 Charlotte Nc 28201
Rachel Griffin Bikini
Driving Directions To Bed Bath & Beyond
Nine Perfect Strangers (Miniserie, 2021)
Satisfactory: How to Make Efficient Factories (Tips, Tricks, & Strategies)
Uta Kinesiology Advising
Sef2 Lewis Structure
Project Reeducation Gamcore
Www.craigslist.com Austin Tx
Craigslist Wilkes Barre Pa Pets
Copper Pint Chaska
Motorcycle Blue Book Value Honda
Tinyzonehd
Our 10 Best Selfcleaningcatlitterbox in the US - September 2024
897 W Valley Blvd
Alternatieven - Acteamo - WebCatalog
The Monitor Recent Obituaries: All Of The Monitor's Recent Obituaries
Craigslistodessa
Fairwinds Shred Fest 2023
L'alternativa - co*cktail Bar On The Pier
After Transmigrating, The Fat Wife Made A Comeback! Chapter 2209 – Chapter 2209: Love at First Sight - Novel Cool
Fox And Friends Mega Morning Deals July 2022
Stolen Touches Neva Altaj Read Online Free
Everstart Jump Starter Manual Pdf
Cars And Trucks Facebook
Murphy Funeral Home & Florist Inc. Obituaries
Diana Lolalytics
Crystal Mcbooty
Tds Wifi Outage
Myfxbook Historical Data
Home Auctions - Real Estate Auctions
Vindy.com Obituaries
Hk Jockey Club Result
Wood River, IL Homes for Sale & Real Estate
Oefenpakket & Hoorcolleges Diagnostiek | WorldSupporter
Legs Gifs
Clock Batteries Perhaps Crossword Clue
Tìm x , y , z :a, \(\frac{x+z+1}{x}=\frac{z+x+2}{y}=\frac{x+y-3}{z}=\)\(\frac{1}{x+y+z}\)b, 10x = 6y và \(2x^2\)\(-\) \(...
Latest Posts
Article information

Author: Manual Maggio

Last Updated:

Views: 5817

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.