%PDF- %PDF-
Direktori : /usr/share/doc/imath-devel/html/ |
Current File : //usr/share/doc/imath-devel/html/float.html |
<!doctype html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Floating Point Representation — Imath Documentation</title> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <link rel="stylesheet" href="_static/bizstyle.css" type="text/css" /> <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/doctools.js"></script> <script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-AMS-MML_HTMLorMML"></script> <script src="_static/bizstyle.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="next" title="Box" href="classes/Box.html" /> <link rel="prev" title="half-float Conversion Configuration Options" href="half_conversion.html" /> <meta name="viewport" content="width=device-width,initial-scale=1.0" /> <!--[if lt IE 9]> <script src="_static/css3-mediaqueries.js"></script> <![endif]--> </head><body> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="classes/Box.html" title="Box" accesskey="N">next</a> |</li> <li class="right" > <a href="half_conversion.html" title="half-float Conversion Configuration Options" accesskey="P">previous</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">Imath</a> »</li> <li class="nav-item nav-item-this"><a href="">Floating Point Representation</a></li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body" role="main"> <div class="section" id="floating-point-representation"> <h1>Floating Point Representation<a class="headerlink" href="#floating-point-representation" title="Permalink to this headline">¶</a></h1> <p><strong>Representation of a 32-bit float:</strong></p> <p>We assume that a float, f, is an IEEE 754 single-precision floating point number, whose bits are arranged as follows: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">31</span> <span class="p">(</span><span class="n">msb</span><span class="p">)</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">30</span> <span class="mi">23</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">22</span> <span class="mi">0</span> <span class="p">(</span><span class="n">lsb</span><span class="p">)</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">X</span> <span class="n">XXXXXXXX</span> <span class="n">XXXXXXXXXXXXXXXXXXXXXXX</span> <span class="n">s</span> <span class="n">e</span> <span class="n">m</span> </pre></div> </div> S is the sign-bit, e is the exponent and m is the significand.</p> <p>If e is between 1 and 254, f is a normalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="n">e</span><span class="o">-</span><span class="mi">127</span> <span class="n">f</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">1.</span><span class="n">m</span> </pre></div> </div> If e is 0, and m is not zero, f is a denormalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="o">-</span><span class="mi">126</span> <span class="n">f</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">0.</span><span class="n">m</span> </pre></div> </div> If e and m are both zero, f is zero: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">f</span> <span class="o">=</span> <span class="mf">0.0</span> </pre></div> </div> If e is 255, f is an “infinity” or “not a number” (NAN), depending on whether m is zero or not.</p> <p>Examples: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">0</span> <span class="mi">00000000</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">0.0</span> <span class="mi">0</span> <span class="mi">01111110</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="mi">0</span> <span class="mi">01111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="mi">0</span> <span class="mi">10000000</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">2.0</span> <span class="mi">0</span> <span class="mi">10000000</span> <span class="mi">10000000000000000000000</span> <span class="o">=</span> <span class="mf">3.0</span> <span class="mi">1</span> <span class="mi">10000101</span> <span class="mi">11110000010000000000000</span> <span class="o">=</span> <span class="o">-</span><span class="mf">124.0625</span> <span class="mi">0</span> <span class="mi">11111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="o">+</span><span class="n">infinity</span> <span class="mi">1</span> <span class="mi">11111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="o">-</span><span class="n">infinity</span> <span class="mi">0</span> <span class="mi">11111111</span> <span class="mi">10000000000000000000000</span> <span class="o">=</span> <span class="n">NAN</span> <span class="mi">1</span> <span class="mi">11111111</span> <span class="mi">11111111111111111111111</span> <span class="o">=</span> <span class="n">NAN</span> </pre></div> </div> <strong>Representation of a 16-bit half:</strong></p> <p>Here is the bit-layout for a half number, h: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">15</span> <span class="p">(</span><span class="n">msb</span><span class="p">)</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">14</span> <span class="mi">10</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">9</span> <span class="mi">0</span> <span class="p">(</span><span class="n">lsb</span><span class="p">)</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="n">X</span> <span class="n">XXXXX</span> <span class="n">XXXXXXXXXX</span> <span class="n">s</span> <span class="n">e</span> <span class="n">m</span> </pre></div> </div> S is the sign-bit, e is the exponent and m is the significand.</p> <p>If e is between 1 and 30, h is a normalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="n">e</span><span class="o">-</span><span class="mi">15</span> <span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">1.</span><span class="n">m</span> </pre></div> </div> If e is 0, and m is not zero, h is a denormalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">S</span> <span class="o">-</span><span class="mi">14</span> <span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">0.</span><span class="n">m</span> </pre></div> </div> If e and m are both zero, h is zero: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">h</span> <span class="o">=</span> <span class="mf">0.0</span> </pre></div> </div> If e is 31, h is an “infinity” or “not a number” (NAN), depending on whether m is zero or not.</p> <p>Examples: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">0</span> <span class="mi">00000</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">0.0</span> <span class="mi">0</span> <span class="mi">01110</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="mi">0</span> <span class="mi">01111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="mi">0</span> <span class="mi">10000</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">2.0</span> <span class="mi">0</span> <span class="mi">10000</span> <span class="mi">1000000000</span> <span class="o">=</span> <span class="mf">3.0</span> <span class="mi">1</span> <span class="mi">10101</span> <span class="mi">1111000001</span> <span class="o">=</span> <span class="o">-</span><span class="mf">124.0625</span> <span class="mi">0</span> <span class="mi">11111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="o">+</span><span class="n">infinity</span> <span class="mi">1</span> <span class="mi">11111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="o">-</span><span class="n">infinity</span> <span class="mi">0</span> <span class="mi">11111</span> <span class="mi">1000000000</span> <span class="o">=</span> <span class="n">NAN</span> <span class="mi">1</span> <span class="mi">11111</span> <span class="mi">1111111111</span> <span class="o">=</span> <span class="n">NAN</span> </pre></div> </div> <strong>Conversion via Lookup Table:</strong></p> <p>Converting from half to float is performed by default using a lookup table. There are only 65,536 different half numbers; each of these numbers has been converted and stored in a table pointed to by the <code class="docutils literal notranslate"><span class="pre">imath_half_to_float_table</span></code> pointer.</p> <p>Prior to Imath v3.1, conversion from float to half was accomplished with the help of an exponent look table, but this is now replaced with explicit bit shifting.</p> <p><strong>Conversion via Hardware:</strong></p> <p>For Imath v3.1, the conversion routines have been extended to use F16C SSE instructions whenever present and enabled by compiler flags.</p> <p><strong>Conversion via Bit-Shifting</strong></p> <p>If F16C SSE instructions are not available, conversion can be accomplished by a bit-shifting algorithm. For half-to-float conversion, this is generally slower than the lookup table, but it may be preferable when memory limits preclude storing of the 65,536-entry lookup table.</p> <p>The lookup table symbol is included in the compilation even if <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_USE_LOOKUP_TABLE</span></code> is false, because application code using the exported <code class="docutils literal notranslate"><span class="pre">half.h</span></code> may choose to enable the use of the table.</p> <p>An implementation can eliminate the table from compilation by defining the <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_NO_LOOKUP_TABLE</span></code> preprocessor symbol. Simply add: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1">#define IMATH_HALF_NO_LOOKUP_TABLE</span> </pre></div> </div> before including <code class="docutils literal notranslate"><span class="pre">half.h</span></code>, or define the symbol on the compile command line.</p> <p>Furthermore, an implementation wishing to receive <code class="docutils literal notranslate"><span class="pre">FE_OVERFLOW</span></code> and <code class="docutils literal notranslate"><span class="pre">FE_UNDERFLOW</span></code> floating point exceptions when converting float to half by the bit-shift algorithm can define the preprocessor symbol <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_ENABLE_FP_EXCEPTIONS</span></code> prior to including <code class="docutils literal notranslate"><span class="pre">half.h</span></code>: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1">#define IMATH_HALF_ENABLE_FP_EXCEPTIONS</span> </pre></div> </div> <strong>Conversion Performance Comparison:</strong></p> <p>Testing on a Core i9, the timings are approximately:</p> <p>half to float<ul class="simple"> <li><p>table: 0.71 ns / call</p></li> <li><p>no table: 1.06 ns / call</p></li> <li><p>f16c: 0.45 ns / call</p></li> </ul> </p> <p>float-to-half:<ul class="simple"> <li><p>original: 5.2 ns / call</p></li> <li><p>no exp table + opt: 1.27 ns / call</p></li> <li><p>f16c: 0.45 ns / call</p></li> </ul> </p> <p><strong>Note:</strong> the timing above depends on the distribution of the floats in question. </p> </div> <div class="clearer"></div> </div> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> <div class="sphinxsidebarwrapper"> <p class="logo"><a href="index.html"> <img class="logo" src="_static/imath-logo-blue.png" alt="Logo"/> </a></p> <h4>Previous topic</h4> <p class="topless"><a href="half_conversion.html" title="previous chapter">half-float Conversion Configuration Options</a></p> <h4>Next topic</h4> <p class="topless"><a href="classes/Box.html" title="next chapter">Box</a></p> <div role="note" aria-label="source link"> <h3>This Page</h3> <ul class="this-page-menu"> <li><a href="_sources/float.rst.txt" rel="nofollow">Show Source</a></li> </ul> </div> <div id="searchbox" style="display: none" role="search"> <h3 id="searchlabel">Quick search</h3> <div class="searchformwrapper"> <form class="search" action="search.html" method="get"> <input type="text" name="q" aria-labelledby="searchlabel" /> <input type="submit" value="Go" /> </form> </div> </div> <script>$('#searchbox').show(0);</script> </div> </div> <div class="clearer"></div> </div> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="classes/Box.html" title="Box" >next</a> |</li> <li class="right" > <a href="half_conversion.html" title="half-float Conversion Configuration Options" >previous</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">Imath</a> »</li> <li class="nav-item nav-item-this"><a href="">Floating Point Representation</a></li> </ul> </div> <div class="footer" role="contentinfo"> © Copyright 2021, Contributors to the OpenEXR Project. Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.4.3. </div> </body> </html>