z=z^2+c

Die Mandelbrot Menge in C++ (alles außer SIMD)

Sven Johannsen
C++ User Group NRW
Düsseldorf 2016-01-20

Disclaimer

Normalerweise stellen wir an dieser Stelle irgendwas zum Thema C++ vor und zeigen an Code-Beispielen was man damit machen kann.

Dieser Vortrag zeigt einen fertigen Algorithmus aus der C++-Brille.

Durch den gewählten Algorithmus ist der Vortrag etwas "Fließkomma"-lastig. Die Themen Vererbung, Polymorphie und Smart-Pointer fehlen ganz und Templates werden nur am Rand erwähnt.

Ich bitte dies zu entschuldigen.

Inhalt

Der erste Kontakt
Mandelbrot Kata
Der Algorithmus in C++
Mathematik der Komplexen Zahlen in C++
Optimierungen
Datentypen
Multithreading
Alternativen in C++
SIMD
Fragen / Code-guck

The Fractal Geometry of Nature

(from Benoid B. Mandelbrot, 1983)

The Fractal Geometry of Nature

(from Benoid B. Mandelbrot, 1983)

Ende der 80er tauchten tolle Bilder auf, die mit einer einfachen Formel erzeugt wurden.

z=z^2+c

Mit dieser Formel und "quadratic dynamics" sollten Computer diese Bilder erzeugen können.

Ich hatte keine Idee wie das gehen sollte!

Computer Kurzweil III

(Spektrum der Wissenschaft, Sonderheft 8, Computer Kurzweil III, 1989)

Ein erster Lichtblick:

n <- 0
WHILE n < 100 und Betrag(z) < 2
  z = z^2 + c
  n <- n + 1
Färbe den Aufpunkt

für Programmiersprachen, die mit Komplexen Zahlen umgehen können.

Computer Kurzweil III

Variante für andere Programmiersprachen

n <- 0
WHILE n < 100 und x^2 + y^2  < 4
  xx  <- x^2 - y^2 + a
  y <- 2xy + b
  x <- xx
  n <- n+1
Färbe den Aufpunkt

Der Algorithmus enthält die Variablen x und y, die jedoch nichts mit den Bildschirm Koordinaten zu tun haben; n wird mit 0 initialisiert, x, y und xx nicht.

(Die Transformation der Bildschirmkoordinaten zu a und b wird getrennt beschrieben.)

Ohne FPU

Pascal Quelltext von 1991. (Quelle: wahrscheinlich c't?)
Großartig für AT286 ohne 287 Coprozessor

{$B-,R-,I-,S-,X+}
USES Graph,Crt;

CONST
      MaxTiefe    =  100;      
      Ymin        = -1200;
      Ymax        =  1200;
      Xmin        = -2000;
      Xmax        =  650;
      Norm        =  1000;

VAR
      MaxX,MaxY,HalbY : INTEGER;

FUNCTION iterate(nx, ny : LONGINT) : LONGINT;

VAR
    tiefe   : INTEGER;
    X,Y,Xalt,
    Xp,Yp   : LONGINT;

BEGIN
    Xp := ((Xmax - Xmin) * nx) DIV (MaxX -1) + Xmin;
    Yp := ((Ymax - Ymin) * ny) DIV (MaxY -1) + Ymin;
    tiefe := 0;
    x := 0;
    y := 0;
    WHILE (tiefe < MaxTiefe) AND ((X * X + Y * Y) < 4000000) DO BEGIN
            Xalt := x;
            X := (X * X - Y * Y) DIV NORM +Xp;
            Y := (2 * Xalt * Y) DIV NORM +Yp;
            INC(tiefe);
      END;
      iterate := tiefe;
END;

VAR
      Gd, Gm        : INTEGER;
      nx,ny,i       : INTEGER;

BEGIN
      Gd := VGA;
      Gm := VGAHi;
      InitGraph(Gd,Gm,'\ps\tp\bgi');
      MaxX := GetMaxX;
      MaxY := GetMaxY;
      HalbY := MaxY DIV 2;
      nx := 0;
      WHILE nx < MaxX DO BEGIN
            ny := 0;
            WHILE ny < HalbY DO BEGIN
                  i := Iterate(nx,ny) MOD 16 + 1;
                  PutPixel(nX,nY,i);
                  PutPixel(nX,MaxY-nY-4,i);
                  INC(ny);
            END;
            INC(nx);
            IF KeyPressed THEN nx := MaxX;
      END;
      ReadKey;
      CloseGraph;
END.

Das Mandelbrot Kata

''Setze den Mandelbrot Algorithmus in der neuen Programmiersprache um.''

(Mein persönliches Kata, für den ersten Kontakt mit einer neuen Programmiersprache)

Mandelbrot Kata

ascii art

Mandelbrot Kata: lua

start = os.clock()

disp_width = 80
disp_height  = 25
scale_real = 3.0 / disp_width
scale_imag = 3.0 / disp_height
line = ""

for j = 1, disp_height do
  line = ""
  for i = 1, disp_width  do
    x = (i * scale_real) - 1.8
    y = (j * scale_imag) - 1.5
    u = 0.0
    v = 0.0
    u2 = 0.0
    v2 = 0.0
    iter = 0

    while (u2 + v2 < 4.0) and (iter < 100) do
      v = 2 * v * u + y
      u = u2 - v2 + x
      u2 = u*u
      v2 = v*v
      iter = iter + 1
    end

    if iter < 10 then c = ' '
    elseif iter < 20 then c = '.'
    elseif iter < 30 then c = 'o'
    elseif iter < 50 then c = '*'
    elseif iter < 70 then c = 'x'
    elseif iter < 90 then c = 'X'
    then c = '@'
    end

    line = line .. c

  end
  print(line)
  line = ""

end

ende = os.clock()
print("Dauer:" .. ende - start .. "sek.")

Ziel des Mandelbrot Katas

Lernziele:

Variablen / Typen
Schleifen
Verzweigungen
Einfach Ausgaben
Laufzeitumgebungsaufrufe (Zeit)
(Funktionen)

Nebenwirkungen

Wenn man die gleiche Aufgabe mit verschieden Programmiersprachen löst, kann man diese vergleichen.

Sprache	Version	i5-5200U (Linux)
python	(3.4.3)	104.8 KPixel/s
lua	(5.2.3)	133.5 KPixel/s
nodejs	(0.10.25)	11.0 MPixel/s
java	(1.7.0.79)	12.9 MPixel/s
C++	(GCC 4.8.5)	13.1 MPixel/s

Alles Beispiele sind mit dem gleichen Algorithmus erstellt und damit vergleichbar.

C#: 12.6 MPixel/s

Nur bei C++ muss sichergestellt werden, dass die Testroutine nicht "weg-"optimiert wird.

Die folgenden Beispiele wurden anders erstellt.

Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz

Android (Nexus 4)

Mandelbrot demo: (performance only) Dauer: 11125.2ms; 1.28716MPixel/s Perf. test: Performance Demo (different float types) iter. duration [ms] [pixel/s] name 46, 1013.8563, 1.8149, std::complex float 36, 1026.0339, 1.4035, std::complex double 36, 1016.6336, 1.4164, std::complex long double 49, 1012.2082, 1.9364, float (1) 36, 1011.9640, 1.4230, double (1) 36, 1004.9138, 1.4330, long double (1) 56, 1011.5977, 2.2143, float (2) 42, 1007.6606, 1.6672, double (2) 42, 1003.9066, 1.6735, long double (2) 11, 1062.8414, 0.4140, int 2, 1176.7130, 0.0680, int64 2, 1176.5298, 0.0680, long long 48, 1007.4470, 1.9058, complex_test::Complex float 36, 1008.9119, 1.4273, complex_test::Complex double 36, 1009.6750, 1.4262, complex_test::Complex long double Perf. test: compare while loop vs. do loop test: default range [-1.8,1.2; -1.5,1.5] iter. duration [ms] [pixel/s] name 57, 1015.2907, 2.2457, Do loop (float) 43, 1018.4648, 1.6888, Do loop (double) 54, 1005.8599, 2.1474, While loop (float) 41, 1005.4021, 1.6312, While loop (double) 57, 1014.9245, 2.2465, For loop (float) 43, 1015.8706, 1.6931, For loop (double) Perf. test: compare while loop vs. do loop test: 100 iterations iter. duration [ms] [pixel/s] name 12, 1045.8111, 0.4590, Do loop (float) 9, 1079.6582, 0.3334, Do loop (double) 11, 1007.4775, 0.4367, While loop (float) 9, 1098.8250, 0.3276, While loop (double) 12, 1047.2761, 0.4583, For loop (float) 9, 1069.8001, 0.3365, For loop (double) Perf. test: compare while loop vs. do loop test: 1 iteration iter. duration [ms] [pixel/s] name 100, 135.6325, 29.4914, Do-Loop (float) 100, 162.1547, 24.6678, Do-Loop (double) 100, 154.2500, 25.9319, While loop (float) 100, 186.0522, 21.4993, While loop (double) 100, 135.7241, 29.4716, For loop (float) 100, 162.0937, 24.6771, For loop (double) Perf. test: Unroll Performance Demo iter. duration [ms] [pixel/s] name 56, 1002.2890, 2.2349, (float) 54, 1015.3517, 2.1273, unroll 5 (float) 55, 1010.3159, 2.1775, unroll 10 (float) 51, 1005.9210, 2.0280, unroll full (float) 42, 1005.5852, 1.6707, (double) 40, 1013.8868, 1.5781, unroll 5 (double) 41, 1001.6176, 1.6374, unroll 10 (double) 42, 1018.5869, 1.6493, unroll full (double) Perf. test: Performance Demo (parallel OpenMP vs. TBB vs. std::future) iter. duration [ms] [pixel/s] name 56, 1016.1453, 2.2044, Sequential (float) 42, 1010.9873, 1.6617, Sequential (double) 100, 538.5014, 7.4280, OpenMP (float) 100, 587.9444, 6.8034, OpenMP (double) 73, 1016.4200, 2.8728, Simple future (float) 32, 1012.9712, 1.2636, Simple future (double) Perf. test: Performance Demo(different initializations) iter. duration [ms] [pixel/s] name 48, 1004.3034, 1.9118, float 2 (int literal) 42, 1064.9779, 1.5775, double 2 (int literal) 56, 1013.3679, 2.2105, float 2.f (float literal) 42, 1011.9029, 1.6602, double 2.f (float literal) 39, 1004.0287, 1.5537, float 2.0 (double literal) 42, 1006.0736, 1.6699, double 2.0 (double literal) 56, 1007.5080, 2.2233, float template type cast 42, 1004.1508, 1.6731, double template type cast

Computerkurzweil in C++

Eine Implementierung in C++
(basierend auf dem Artikel von 1989)

n <- 0
WHILE n < 100 und Betrag(z) < 2
  z = z^2 + c
  n <- n + 1
Faerbe den Aufpunkt

int pixel(double re, double im)
{
  std::complex<double> z{ 0. }, c{ re, im };
  int n{ 0 };
  while (n < 100 && std::abs(z) < 2.0) {
    z = z * z + c;
    ++n;
  }
  return n;
}

Berechne ein ganzes Bild

Bildschirm-Koordinaten in eine Komplexe Zahlenebene transponieren.

    const int disp_width = 179, disp_height = 80;
    const double start_real = -1.8, start_imag = -1.5;
    const double scale_real = 3.0 / disp_width;
    const double scale_imag = 3.0 / disp_height;

    for (int j = 1; j < disp_height; ++j) {
      for (int i = 1; i < disp_width; ++i) {
        const double re = start_real + i * scale_real;
        const double im = start_imag + j * scale_imag;
        const int iter = pixel(re, im);
        cout << transform(iter);
      }
      cout << '\n';
    }

Bildschirmauflösung 178x79
Wertebereich: [-1.8..1.2; -1.5..1.5] 

i,j: Bildschirmkoordinaten
re,im: Korrdinaten im einer Komplexen Ebende

Transform

Das Ergebnis muss in "Farben" umgewandelt werden.

  char transform(int iter)
  {
    if (iter < 10) return ' ';
    if (iter < 20) return '.';
    if (iter < 30) return 'o';
    if (iter < 50) return '*';
    if (iter < 70) return 'x';
    if (iter < 90) return 'X';
    return '@';
  }

Demo

Projekt: demo

Datei: SimpleExample.h

(check test.cpp)

Demo

ascii art

Ohne komplexe Zahlen

2 alternative Lösungen ohne std::complex<T>

   // (1 von 1989)
   float u{ 0 }, v{ 0 };
   int iter{ 0 };

   while (u*u + v*v < 4.f && iter < 100) {
     const float uu = v*v - u*u + re;
     u = 2.f * u*v + im;
     v = uu;
     ++iter;
   }

   // (2 intel, ...)
   float u{ 0 }, v{ 0 }, u2{ 0 }, v2{ 0 };
   int iter{ 0 };

   while (u2 + v2 < 4.f && iter < 100) {
     v = 2.f * v * u + im;
     u = u2 - v2 + re;
     u2 = u*u;
     v2 = v*v;
     ++iter;
   }

Vergleich: std::complex vs. float

Windows 10 - Visual Studio (64bit)
Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz

Datentype	VS2013	VS2015
complex<float>	1.9 MPixel/s	10.8 MPixel/s
complex<double>	8.0 MPixel/s	8.0 MPixel/s
float (1)	12.8 MPixel/s	12.3 MPixel/s
double (1)	12.8 MPixel/s	12.3 MPixel/s
float (2)	12.3 MPixel/s	12.6 MPixel/s
double (2)	12.8 MPixel/s	12.5 MPixel/s

Windows 7 - Visual Studio (64bit)
Intel(R) Core(TM) i7-2860 CPU @ 2.50GHz

Datentype	VS2013	VS2015
complex<float>	MPixel/s	MPixel/s
complex<double>	MPixel/s	MPixel/s
float (1)	MPixel/s	MPixel/s
double (1)	MPixel/s	MPixel/s
float (2)	MPixel/s	MPixel/s
double (2)	MPixel/s	MPixel/s

Windows 7 - GCC 5.1.0
Intel(R) Core(TM) i7-2860 CPU @ 2.50GHz

Datentype	-O3	-O3 -ffast-math
complex<float>	MPixel/s	MPixel/s
complex<double>	MPixel/s	MPixel/s
float (1)	MPixel/s	MPixel/s
double (1)	MPixel/s	MPixel/s
float (2)	MPixel/s	MPixel/s
double (2)	MPixel/s	MPixel/s

Vergleich: std::complex vs. float

Mint Linux 17.2 - GCC 4.8.5
Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz

Datentype	-O3	-O3 -ffast-math
complex<float>	2.3 MPixel/s	12.9 MPixel/s
complex<double>	6.6 MPixel/s	13.0 MPixel/s
float (1)	14.1 MPixel/s	12.6 MPixel/s
double (1)	14.0 MPixel/s	12.5 MPixel/s
float (2)	13.5 MPixel/s	14.1 MPixel/s
double (2)	13.4 MPixel/s	14.1 MPixel/s

GCC 5.3.0

Clang 3.6

Dauer: 1135.35ms; 12.6128MPixel/s Perf. test: Performance Demo (different float types) iter. duration [ms] [pixel/s] name 10, 1036.1633, 9.6510, std::complex float 11, 1089.4020, 10.0973, std::complex double 6, 1093.3975, 5.4875, std::complex long double 13, 1062.8206, 12.2316, float (1) 13, 1062.3402, 12.2371, double (1) 9, 1107.4328, 8.1269, long double (1) 13, 1021.9822, 12.7204, float (2) 13, 1021.8687, 12.7218, double (2) 9, 1027.6332, 8.7580, long double (2) 10, 1096.5380, 9.1196, int 9, 1063.7199, 8.4609, int64 9, 1068.2356, 8.4251, long long 11, 1067.9092, 10.3005, mandel::Complex float 10, 1104.2594, 9.0558, mandel::Complex double 8, 1068.8192, 7.4849, mandel::Complex long double

Mathematik der Komplexen Zahlen

für C++-Programmierer

Vereinfachte Klasse für Komplexe Zahlen

template<typename FLOAT_TYPE>
struct Complex {

    typedef FLOAT_TYPE value_type;

    FLOAT_TYPE re_;
    FLOAT_TYPE im_;

    FLOAT_TYPE real() const { return re_; }
    FLOAT_TYPE imag() const { return im_; }
};

Complex<double> z = { 0. }, c = { re, im };

Operatoren

Mathematische Operatoren

template<typename FLOAT_TYPE>
inline Complex<FLOAT_TYPE> operator+(const Complex<FLOAT_TYPE>& l, 
                                     const Complex<FLOAT_TYPE>& r)
{
  return Complex<FLOAT_TYPE>{l.re_ + r.re_, 
                             l.im_ + r.im_ };
}

template<typename FLOAT_TYPE>
inline Complex<FLOAT_TYPE> operator*(const Complex<FLOAT_TYPE>& l, 
                                     const Complex<FLOAT_TYPE>& r)
{
  return Complex<FLOAT_TYPE>{ l.re_ * r.re_ - l.im_ * r.im_, 
                              l.re_ * r.im_ + l.im_ * r.re_ };
}

z = z * z + c;

std::abs

Absoluter Betrag.

template<typename T>
T abs(const Complex<T>& z)
{
  return sqrt(z.re_*z.re_ + z.im_*z.im_);
}

while (abs(z) < 2.0 && iter < 100) {
...
}

Beispiel Rechnungen

mit dem Mandelbrot Algorithmus (z = z * z + c)

3 kleine Beispiele für:

0, 0i
-2, 0i
-1, 0i

(Koordinaten auf der Komplexen Zahlenebene)

Berechne (0, 0i)

Startwert c = (0, 0i)

Complex z{ 0 }, c{ 0.0, 0.0 };
while (std::abs(z) < 2.0 && iter < 100) {
  z = z * z + c;
  ++iter;
}

z = (0, 0i)*(0, 0i) + (0, 0i) = (0, 0i)
bei allen Berechnungen

=> Endlosschleife ohne die Abbruchbedingung iter<100

Berechne (-2, 0i)

Startwert c = (-2, 0i)

Complex z{ 0 }, c{ -2.0, 0.0 };
while (std::abs(z) < 2.0 && iter < 100) {
  z = z * z + c;
  ++iter;
}

z = (0, 0i)*(0, 0i) + (-2, 0i) = (-2, 0i)
abs(-2, 0i) = 2

abs(z) < 2.0 ist im ersten Durchgang erfüllt.

iter == 1

Pythagoras als Kreisformel

In der ersten Iteration erhält z den Wert von c.

Die Kreisformel definiert einen Grenzwert um die Mandelbrotmenge:

Berechne (-1, 0i)

z' = z*z + (-1, 0i)
z^2 = z*z

iter	z	z^2	c	z'
1	( 0.0, 0.0i)	(0.0, 0.0i)	(-1.0, 0.0i)	(-1.0, 0.0i)
2	(-1.0, 0.0i)	(1.0, 0.0i)	(-1.0, 0.0i)	( 0.0, 0.0i)
3	( 0.0, 0.0i)	(0.0, 0.0i)	(-1.0, 0.0i)	(-1.0, 0.0i)
4	(-1.0, 0.0i)	(1.0, 0.0i)	(-1.0, 0.0i)	( 0.0, 0.0i)

Der Wert von z wechselt zwischen (0, 0i) und (-1, 0i) und erreicht nie abs(z) >= 2.

=> iter == 100

Berechne die Anzahl der Iterationen für den Startwert c gleich (-1, 0i)

iter	z.re	z^2.re	c.re	z`.re
1	0.0	0.0	-1.0	-1.0
2	-1.0	1.0	-1.0	0.0
3	0.0	0.0	-1.0	-1.0
4	-1.0	1.0	-1.0	0.0

Regionen des Chaos

Blau: 1 Iteration
Rot: 100 Iterationen

Grenzwerte

while (std::abs(z) < 2.0 && iter < 100) {
  z = z * z + c;
  ++iter;
}

Was passiert, wenn man die Konstanten 100 (maximale Iterationen) und 2.0 (Escape Radius) anpasst?

Siehe Verzeichnisse:

d:\develop\git\texte\mandelbrot\source\processing\Mandelbrot 
d:\develop\git\texte\mandelbrot\source\processing\Mandelbrot_border

Mehr Informationen

http://mathworld.wolfram.com/MandelbrotSet.html

https://de.wikipedia.org/wiki/Mandelbrot-Menge

https://en.wikipedia.org/wiki/Mandelbrot_set

Optimierungen

Neben den schönen Bildern gilt die Hauptfrage bei der Mandelbrotmengen-Berechnen nach der Performance. Wie kann man die Bilder möglichst schnell berechnen?

while Schleife
loop unroll
Literale
Schritte berechnen
Aus den Schleifen herausnehmen
teure Operationen (e.g. sqrt, abs)

Optimierung der Schleife

    ...
    std::complex z{0.}, c{x,y};
    int iter{0};    

    while ((std::abs(z) < 2.0) && (iter < 100)) {
        z = z * z + c;
        ++iter;
    }
    ...

while -Abfrage ist immer true beim ersten Test!

Optimierung der Schleife

// `while` - loop
int pixel(double x, double y) 
{
    std::complex z{0.}, c{x,y};
    int iter{0};    
    while ((std::abs(z) < 2.0) && (iter < 100)) {
        z = z * z + c;
        ++iter;
    }        
    return iter;
}

// `do` - loop
int pixel(double x, double y) 
{
    std::complex z{0.}, c{x,y};
    int iter{0};    
    do {
        z = z * z + c;
        ++iter;
    } while ((std::abs(z) < 2.0) && (iter < 100));

    return iter;
}

Mint Linux 17.2 - GCC 4.8.5
Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz

Datentype	-O3	-O3 -ffast-math
do loop float	13.1 MPixel/s	13.0 MPixel/s
do loop double	12.6 MPixel/s	12.6 MPixel/s
while loop float	12.3 MPixel/s	12.6 MPixel/s
while loop double	12.0 MPixel/s	12.6 MPixel/s
sqrt float	9.1 MPixel/s	10.3 MPixel/s
sqrt double	6.3 MPixel/s	6.5 MPixel/s

Optimierung der Schleife

Wie kann die Optimierung gemessen werden?

Blauer Bereich
- 1 statt 2 Abfragen
Roter Bereich
- 100 statt 101 Abfragen

Optimierung der Schleife

Datentype	VS2015	1 Iter.	GCC 4.8	1 Iter.
do/while (float)	12.5	344.6	13.4	253.2
do/while (double)	12.3	346.3	13.4	254.4
while (float)	12.8	349.8	13.4	253.8
while (double)	12.7	348.1	13.4	253.5
for (float)	12.3	353.4	13.5	253.1
for (double)	12.5	349.7	13.4	253.6

Alle Werte in MPixel/s
GCC 4.8.5 auf Mint Linux 13.2, VS2015 unter Windows 10

Clang 3.6

Perf. test: compare while loop vs. do loop test: default range [-1.8,1.2; -1.5,1.5] iter. duration [ms] [pixel/s] name 13, 1009.7863, 12.8740, Do loop (float) 13, 1006.6267, 12.9144, Do loop (double) 13, 1053.1475, 12.3439, While loop (float) 13, 1052.5842, 12.3506, While loop (double) 13, 1043.4124, 12.4591, For loop (float) 13, 1044.0583, 12.4514, For loop (double) Perf. test: compare while loop vs. do loop test: 1 iteration iter. duration [ms] [pixel/s] name 100, 315.4493, 317.0081, Do-Loop (float) 100, 313.6972, 318.7787, Do-Loop (double) 100, 316.5951, 315.8608, While loop (float) 100, 313.3868, 319.0945, While loop (double) 100, 317.2940, 315.1651, For loop (float) 100, 314.1955, 318.2732, For loop (double)

Schleifen ausrollen

Was bringt es Schleifen auszurollen? (loop unrolling)

Demo

Loop unrolling

Compiler/data type	no	5	10	full
GCC 4.8.5 (float)	13.5	13.6	13.7	13.2
GCC 4.8.5 (double)	13.5	13.6	13.6	13.2
VS2013 (float)	12.3	13.1	13.1	13.1
VS2013 (double)	12.8	13.1	13.1	12.7
VS2015 (float)	12.5	13.1	13.2	13.1
VS2015 (double)	12.6	13.1	13.2	12.6

Alle Werte in MPixel/s
GCC auf Mint Linux 13.2, VS2015 unter Windows 10

Literale

  template<typename FLOAT_TYPE, int MAX_ITER>
  struct Pixel
  ...
    do {
        v = 2.0 * v * u + im;
        u = u2 - v2 + re;
        u2 = u*u;
        v2 = v*v;
        ++iter;
      } while (u2 + v2 < 4.0 && iter < MAX_ITER);
  ...

Welches Literal soll für "2.0" verwendet werden?

"2"
"2.f"
"2.0"

Benchmark

Initialisierung	GCC 4.8.5	VS2015
float 2	372 ns	654 ns
double 2	371 ns	585 ns
float 2.f	371 ns	530 ns
double 2.f	372 ns	414 ns
float 2.0	638 ns	623 ns
double 2.0	372 ns	414 ns
float template type	372 ns	418 ns
double template type	372 ns	415 ns

Raus aus der Schleife

Alles aus der Schleife heraus ziehen was möglich ist!

for (int j = 1; j < disp_height; ++j) {
  for (int i = 1; i < disp_width; ++i) {
    const double re = i * scale_real - 1.8;
    const double im = j * scale_imag - 1.5; // (A)
    const int iter = pixel(re, im);
    ...
  }
}

for (int j = 1; j < disp_height; ++j) {
  const double im = j * scale_imag - 1.5;  // (A)
  for (int i = 1; i < disp_width; ++i) {
    const double re = i * scale_real - 1.8; 
    const int iter = pixel(re, im);
    ...
  }
}

Schritt Berechnung

Startwert vs. Addition

for (int j = 1; j < disp_height; ++j) {
  const double im = y1 + j * scale_imag; // (1)
  for (int i = 1; i < disp_width; ++i) {
    const double re = x1 + i * scale_real; // (2)
    const int iter = pixel(re, im);
    ...
  }
}

double im = y1; // (1a)
for (int j = 1; j < disp_height; ++j) {
  im += scale_imag; // (1b)
  double re = x1;   // (2a)
  for (int i = 1; i < disp_width; ++i) {
    re += scale_real; // (2b)
    const int iter = pixel(re, im);
    ...
  }
}

Schritt Berechnung

Von Startwert neu berechnen: Der Fehler bleibt konstant / in der gleichen Größenordnung.

for(int j = 0; ...) {
    const float im = j * scale_imag + starty;
    ...
}

Bei der Addition summieren sich die numerischen Fehler auf!

float im = starty;
for(int j = 0; ...) {
    im += scale_imag;
    ...
}

Schritt Berechnung

Die Addition ist schneller als die Index Berechnung
Die Addition summiert Fehler auf.
- Probleme bei der Unit tests und textueller Darstellung

Vermeide (teure) Operationen

z.B. abs() bzw. sqrt():

do {
...
} while ((std::abs(z) < 2.0) && (iter < 100));

do {
...
} while ((z.imag() * z.imag() + z.real() * z.real() < 4.0) && (iter < 100));

Vermeide (teure) Operationen

Benchmark	GCC 4.8.5	VS2015
BM_cmplx_abs	29 ns	77 ns
BM_cmplx_hypot	13 ns	9 ns
BM_cmplx_sqrt	13 ns	6 ns
BM_cmplx_square	8 ns	3 ns
BM_cmplx_empty	7 ns	3 ns
BM_float_hypot	19 ns	9 ns
BM_float_sqrt	13 ns	7 ns
BM_float_square	8 ns	3 ns
BM_float_empty	7 ns	3 ns

Vermeide (teure) Operationen

std::sqrt(re*re+im*im) >= 2.0 ist merklich teuer als re*re+im*im >= 4.0

std::hypot() ist teurer als std::sqrt() wegen zusätzlicher Checks.

std::abs(std::complex) ist teurer als std::hypot() wegen zusätzlicher NaN, Unendlich und Stabilitäts-Checks.

Datentypen

Vergleiche verschiedene Datentypen

Data type	GCC	VS2015
float	13.5 MPixel/s	12.6 MPixel/s
double	13.4 MPixel/s	12.5 MPixel/s
long double	8.5 MPixel/s	12.5 MPixel/s
int32_t	8.1 MPixel/s	8.0 MPixel/s
int64_t	8.2 MPixel/s	8.1 MPixel/s
complex<float>	2.3 MPixel/s	10.8 MPixel/s
complex<double>	6.6 MPixel/s	8.0 MPixel/s

Warum ist `int` so langsam?

Wieso ist int langsamer als Fließkomma-Typen?

Warum ist `int` so langsam?

int ist nicht das Problem - es is die Division-Korrektur!

Datentyp	64 bit	32 bit
float	12.6 MPixel/s	12.5 MPixel/s
double	12.5 MPixel/s	12.9 MPixel/s
int 32 bit	8.0 MPixel/s	7.5 MPixel/s
int 64 bit	8.1 MPixel/s	1.2 MPixel/s
float (div)	6.0 MPixel/s	6.0 MPixel/s
double (div)	5.3 MPixel/s	5.4 MPixel/s
float (mul)	9.6 MPixel/s	9.5 MPixel/s
double (mul)	9.5 MPixel/s	9.3 MPixel/s

Warum ist `long double` so langsam (auf Linux)?

Der GCC erzeugt für alle Datentypen schnelleren Code - außer für long double. Warum?

Warum ist `long double` so langsam (auf Linux)?

Antwort A:

Linux: sizeof(long double) == 16

Windows (Visual Studio) sizeof(long double) == 8

Kein SIMD für `long double`

Antwort B:

SSE unterstützt nur 4 und 8 Byte Fließkommatypen.

long double muss auf der FPU gerechnet werden (Linux).

Die SSE Recheneinheiten sind schneller als die FPU.

Ist std::complex langsam?

http://www.lomont.org/Math/Papers/2011/Intro%20to%20Intel%20AVX-Final.pdf

Kein Problem - AVX ist viel schneller

Nutze GCC mit -ffast-math

Digging into the generated assembler code showed that std::norm (|z|2) computed the absolute value (|z|) in a way avoiding overflow, and then squared the result. This problem could be fixed by the compile flag -ffast-math.

https://www.c-plusplus.net/forum/271840-full

-fcx-limited-range

When enabled, this option states that a range reduction step is not needed when performing complex division. Also, there is no checking whether the result of a complex multiplication or division is "NaN + I*NaN", with an attempt to rescue the situation in that case. The default is -fno-cx-limited-range, but is enabled by -ffast-math.

This option controls the default setting of the ISO C99 "CX_LIMITED_RANGE" pragma. Nevertheless, the option applies to all languages.

Habe ähnliche Infos hier gefunden: http://old.nabble.com/per ....... -versions-td27187415.html Anscheinend war std::complex bem GCC 4.2 und früher noch nicht so lahm und ab 4.3 hamse std::complex "konform" implementiert, was nicht mehr so schnell ist, wie vorher -- zumindest nicht, wenn man sich ffast-math spart.

http://old.nabble.com/performance-question-with-std::complex%3Cfloat%3E-in-new-g%2B%2B-versions-td27187415.html

Why would a computational scientist need to implement their own version of std::complex? http://scicomp.stackexchange.com/questions/1998/why-would-a-computational-scientist-need-to-implement-their-own-version-of-std

The -ffast-math compile flag makes the behavior of NaN completely unpredictable for newer versions of gcc (even isnan is affected). The only workaround seems to be to avoid any occurrence of NaN in the program, which defeats the purpose for the existence of NaN.

https://gcc.gnu.org/ml/gcc-help/2006-03/msg00011.html

http://benchmarksgame.alioth.debian.org/

Complex types

Data type	GCC 4.8.5	-ffast-math	VS 2015
float	13.3	14.1	12.6
double	13.2	14.1	12.5
std::complex (float)	2.3	12.9	10.8
std::complex (double)	6.6	13.0	8.0
mandel::complex (float)	13.5	12.9	10.8
mandel::complex (double)	13.2	12.9	10.7

Alle Werte in MPixel/s

Data type	GCC 4.8.5	-ffast-math
float	12.5	13.2
double	12.4	12.8
long double	7.8	7.6
std::complex (float)	1.6	12.0
std::complex (double)	5.6	11.8
std::complex (long double)	0.7	7.8
mandel::complex (float)	12.3	12.0
mandel::complex (double)	12.1	11.8
mandel::complex (long double)	7.7	7.8

Zusammenfassung

Es gibt keine Zaubertricks um den Compiler um Längen zu schlagen.
- Loop unroll, ...
Es gibt viele Möglichkeiten den Compiler auszubremsen.
- Float-Literal, ...
Die Wahl des richtigen Datentyps ist entscheidend.
- siehe SIMD
Eine Klasse / Bibliothek mit einem anderen Schwerpunkt kann das Programm wesentlich verlangsamen
- std::complex / NaN-Prüfungen

Multithreading

Siehe:

beide: Herb Sutter.

OpenMP
Intel TBB
Performance Vergleich / Analyse

OpenMP

Äußere Schleife parallelisieren:

#pragma omp parallel for
for (int j = 0; j < height; ++j) {
    const FLOAT_TYPE y = (j * dy) + y1;
    for (int i = 0; i < width; ++i) {
        const FLOAT_TYPE x = (i * dx) + x1;
        auto iter = pixel(x, y);
        ...
    }
}

#pragma omp OpenMP Befehl
parallel for Parallele for-Schleife, muss vor einer for Anweisung stehen

Intel TBB

Äußere Schleife durch tbb::parallel_for ersetzen.

tbb::parallel_for(int(0), int(height), [&](int j) {
    const FLOAT_TYPE y = (j * dy) + y1;
    for (int i = 0; i < width; ++i) {
        const FLOAT_TYPE x = (i * dx) + x1;
        const auto iter = pixel(x, y);
        ...
    }
});

Der Schleifen-body (die innere Schleife) wird als Lambda formuliert.

Intel TBB vs OpenMP

OpenMP

#pragma omp parallel for
for (int j = 0; j < height; ++j) {
    ...
}

Intel TBB

tbb::parallel_for(int(0), int(height), [&](int j) {
    ...
});

Der innere Teil (...) ist in beiden Varianten gleich.

Performance Vergleich

Methode	float	double
Sequentiell	12.6 MPixel/s	12.6 MPixel/s
OpenMP	22.8 MPixel/s	23.1 MPixel/s
Intel TBB	39.2 MPixel/s	39.3 MPixel/s
std::futures	37.3 MPixel/s	37.5 MPixel/s

Visual Studio 2015.

Erste Analyse

2 Ergebnisse:

Die Beschleunigung ist nicht Factor 4 (12.6 x 4 = 50.4)
OpenMP ist scheinbar viel langsamer als TBB/std::futures

Multi-Core Archithektur

Was bedeutet Hyper-Threading?
Was bedeutet Intel Turbo Boost?

Hyper-Threading Tests

Intel Core i7-2860QM 2.5 GHz 4 Kerne, HyperThreading (8 Hardware-Threads)

-	Ohne HT	Mit HT
Ohne TB	9.0 35.5 (x3.94)	9.0 58.9 (x6.54)
Mit TB	12.7 (x1.41) 46.5 (x3.66)	12.7 (x1.41) 76.8 (x6.04)

HT: Hyper-Threading
TB: Intel Turbo Boost
Single threaded vs TBB

Tests auf diesem Rechner, da per BIOS HT und TB ausschaltbar sind.

No HT, no speed step, no turbo boost

Perf. test: Performance Demo (parallel OpenMP vs. TBB vs. std::future) iter. duration [ms] [pixel/s] name 8, 1069.8365, 7.4778, Sequential (float) 8, 1047.2794, 7.6388, Sequential (double) 33, 1023.9131, 32.2293, OpenMP (float) 31, 1011.1410, 30.6584, OpenMP (double) 32, 1028.4542, 31.1147, TBB (float) 32, 1009.7982, 31.6895, TBB (double) 32, 1002.1610, 31.9310, Simple future (float) 32, 1004.4192, 31.8592, Simple future (double) Press any key to continue . . .

no HT, speed step, turbo boost

Perf. test: Performance Demo (parallel OpenMP vs. TBB vs. std::future) iter. duration [ms] [pixel/s] name 11, 1006.9079, 10.9245, Sequential (float) 12, 1067.0528, 11.2459, Sequential (double) 42, 1008.6629, 41.6393, OpenMP (float) 43, 1014.1528, 42.3999, OpenMP (double) 42, 1008.2893, 41.6547, TBB (float) 43, 1012.2705, 42.4788, TBB (double) 44, 1020.8661, 43.1007, Simple future (float) 43, 1008.8386, 42.6233, Simple future (double)

HT, no speed step, no turbo boost

Perf. test: Performance Demo (parallel OpenMP vs. TBB vs. std::future) iter. duration [ms] [pixel/s] name 8, 1055.4881, 7.5794, Sequential (float) 8, 1035.3977, 7.7265, Sequential (double) 34, 1061.4410, 32.0319, OpenMP (float) 43, 1015.7815, 42.3319, OpenMP (double) 53, 1001.6983, 52.9101, TBB (float) 54, 1017.5747, 53.0674, TBB (double) 53, 1009.9883, 52.4759, Simple future (float) 53, 1010.9087, 52.4281, Simple future (double)

HT, speed step, turbo boost

Perf. test: Performance Demo (parallel OpenMP vs. TBB vs. std::future) iter. duration [ms] [pixel/s] name 11, 1008.4739, 10.9076, Sequential (float) 12, 1069.3368, 11.2219, Sequential (double) 69, 1005.8782, 68.5968, OpenMP (float) 69, 1011.5980, 68.2089, OpenMP (double) 68, 1005.0018, 67.6616, TBB (float) 71, 1012.3903, 70.1311, TBB (double) 71, 1009.6205, 70.3235, Simple future (float) 72, 1013.4835, 71.0421, Simple future (double)

Warum ist OpenMP so langsam

OpenMP verwendet als default(*) einen statischen Scheduler, dazu wird der Bereich in gleich große Blöcke aufgeteilt.

Der dynamischen Scheduler verteilt die Arbeit "gleichmäßig" auf die Kerne.

#pragma omp parallel for schedule(dynamic)

(*) unspezifiziert. (aber alle mir bekannten Compiler nutzen den statischen Scheduler als default Scheduler)

Warum ist OpenMP so langsam

regions of the chaos

Performance Vergleich

Methode	float	double
Sequentiell	12.6 MPixel/s	12.6 MPixel/s
OpenMP static	22.8 MPixel/s	23.1 MPixel/s
OMP dynamic	40.6 MPixel/s	39.6 MPixel/s
Intel TBB	39.2 MPixel/s	39.3 MPixel/s
std::futures	37.3 MPixel/s	37.5 MPixel/s

Benchmarks and Multithreading

Demo

Erzeugen des Thread pools
Temperaturabhängige Messungen

Mehr C++

Rekursion
Ranges

Rekursion

Schleifen und Rekursion sind das Gleiche - Man kann sie einfach ineinander überführen.

aus der Sicht eines Mathematikers

Kann man den Mandelbrot Mengen Algorithmus effizient rekursiv programmieren?

Ein erster Versuch

int pixel_detail(int iter, COMPLEX_TYPE z, 
                 const COMPLEX_TYPE c, const int max_iter) 
{
  if (iter < max_iter) {
    z = z * z + c;
    if ((z.imag() * z.imag() + z.real() * z.real() < FLOAT_TYPE(4.0)))
      return pixel_detail(iter + 1, z, c, max_iter);
  }
  return iter;
}

Da war doch noch was Endrekursion bzw. tail recursive optimization.

Endrekursion

Eine rekursive Funktion f ist endrekursiv (englisch tail recursive; auch endständig rekursiv, iterativ rekursiv, repetitiv rekursiv), wenn der rekursive Funktionsaufruf die letzte Aktion zur Berechnung von f ist. Vorteil dieser Funktionsdefinition ist, dass kein zusätzlicher Speicherplatz zur Verwaltung der Rekursion benötigt wird.

Source Wikipedia: https://de.wikipedia.org/wiki/Endrekursion

Endrekursion

Rekursive Funktionen sollten vom Compiler effektiv optimiert werden, wenn:

der rekursive Aufruf in der letzten Zeile erfolgt und
alle lokalen Variablen als Parameter übergeben werden.

Tail Recursive Optimization

int pixel_detail(int iter, COMPLEX_TYPE z, 
                 const COMPLEX_TYPE c, const int max_iter) 
{
  if (iter == max_iter) 
    return max_iter;
  z = z * z + c;
  if ((z.imag() * z.imag() + z.real() * z.real() >= FLOAT_TYPE(4.0)))
    return iter;

  return inner(iter + 1, z, c, max_iter);
}

Keine lokalen Variablen,
die letzte Zeile enthält den rekursiven Aufruf.

Compiler-Vergleich

-	GCC	VS2015
complex (float)	13.3 MPixel/s	10.8 MPixel/s
complex (double)	13.2 MPixel/s	10.7 MPixel/s
cmplx tail rec. (float)	13.4 MPixel/s	2.3 MPixel/s
cmplx tail rec. (double)	13.3 MPixel/s	5.2 MPixel/s
float rec.	6.2 MPixel/s	5.6 MPixel/s
double rec.	6.2 MPixel/s	5.7 MPixel/s
float tail rec.	6.2 MPixel/s	5.7 MPixel/s
double tail rec.	6.2 MPixel/s	5.7 MPixel/s

mandel::Complex statt std::complex

Zusammenfassung

tail recursive optimization

Nette Idee - kann funktionieren - muss aber nicht

Keine Schleifen

Man soll in C++ keine Schleifen selbst programmieren, sondern (STL-)Algorithmen verwenden!

Für den Mandelbrot Algorithmus ist das schwierig, da die Anzahl der Iterationen nicht vorhersehbar ist.

Aber die for-Schleifen über alle Pixel (den Bildschirm) sollte funktionieren.

Berechne ein ganzes Bild (Wdh.)

Überarbeitetes Beispiel:

    const int disp_width = 179, disp_height = 80;
    const double start_real = -1.8, start_imag = -1.5;
    const double scale_real = 3.0 / disp_width;
    const double scale_imag = 3.0 / disp_height;

    for (int j = 1; j < disp_height; ++j) {
      for (int i = 1; i < disp_width; ++i) {
        const double re = start_real + i * scale_real;
        const double im = start_imag + j * scale_imag;
        const int iter = pixel(re, im);
        cout << transform(iter);
      }
      cout << '\n';
    }

Bildschirmauflösung 178x79
Wertebereich: [-1.8..1.2; -1.5..1.5] 

i,j: Bildschirmkoordinaten
re,im: Korrdinaten im einer Komplexen Ebende

Range über alle Pixel

Randbedingungen

Ausschnitt der Komplexen Zahlenebene.
Ein Pixel ist ein Wert der Komplexen Zahlenebene.
Der Ergebnis wird in einem std::vector gespeichert.

(logische) Erweiterung von boost::range::iterator_range bzw. irange

template<class Integer, class StepSize>
iterator_range< range_detail::integer_iterator_with_step<Integer, StepSize> >
irange(Integer first, Integer last, StepSize step_size);

parallel stl / array_view

2D Integer range from array_view proposal

std::experimental::D4087::bounds<2> bnd{ width, height };
std::experimental::parallel::for_each(
    std::experimental::parallel::par, std::begin(bnd), std::end(bnd), 
        [&](std::experimental::D4087::index<2> idx) {

        const auto i = idx[0];
        const auto j = idx[1];
        const double re = start_real + i * scale_real;
        const double im = start_imag + j * scale_imag;
        const auto iter = pixel(val.real(), val.imag());
        const color = transform(iter); // num iterationen -> color
        ...
}

Parallel STL on CodePlex bzw. das array_view proposal:
Multidimensional bounds, offset and array_view, revision 7

Beispiel

ComplexRange2D<complex<double>> range{ -1.0, 1.0, -2.0, 2.0, 3, 5 };
for (const auto v : range) {
    cout << v << " ";
}
cout << endl;

Wertebereich: [-1.0..1.0; -2.0..2.0]
Bildschirmauflösung 3x5 Pixel

(-1,-2) (0,-2) (1,-2) (-1,-1) (0,-1) (1,-1) (-1,0) (0,0) (1,0) (-1,1) (0,1) (1,1) (-1,2) (0,2) (1,2)

Mandelbrot ala `boost::irange`

using cmplx = complex<double>;
Color t(int iteration);  // num iterationen -> color
vector<Color> picture(width*height);
...
ComplexRange2D<cmplx> range(
    start_real, end_real,  
    start_imag, end_imag, 
    width, height);

transform(begin(range), end(range), begin(picture), 
    [&pixel, &t](cmplx val) {
        const auto iter = pixel(val.real(), val.imag());
        return t(iter);
});

Alles ausser SIMD

Warum kein SIMD?

SIMD ist zu wichtig (für die Mandelbrot-Berechnung) um in einem Unterabschnitt abgehandelt zu werden.

Teaser:

Rust ist schneller als C++
benchmarkgame, mandelbrot
Benchmark

The Computer Language Benchmarks Game

http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=mandelbrot

Rang	Language	secs	KB	gz
1.0	Rust	5.10	65,732	868
1.1	C++ g++	5.82	33,964	726
1.2	C gcc	5.92	32,464	694

SIMD Performance Vergleich

Methode	float	double
Sequentiell	12.6 MPixel/s	12.5 MPixel/s
SSE	37.6 MPixel/s	20.3 MPixel/s
AVX*	17.4 MPixel/s	9.0 MPixel/s
AVX2	63,5 MPixel/s	37.4 MPixel/s
TBB + SSE	91.2 MPixel/s	54.2 MPixel/s
TBB + AVX*	55.9 MPixel/s	29.7 MPixel/s
TBB + AVX2	145.0 MPixel/s	88.9 MPixel/s

Visual Studio 2015 | i5-5200U CPU @ 2.20GHz
c++ vector class by Agner Fog
(*) Ergebnis unerwartet

Fragen

[Code-Guck] -> Bei Bedarf

This is the end

Door, 1967

Benchmark	GCC 4.8.5	VS2015
BM_index_calc	43858 ns	29953 ns
BM_index_add	10581 ns	23200 ns

z=z^2+c

Disclaimer

Inhalt

The Fractal Geometry of Nature

The Fractal Geometry of Nature

Computer Kurzweil III

Computer Kurzweil III

Ohne FPU

Das Mandelbrot Kata

Mandelbrot Kata

Mandelbrot Kata: lua

Ziel des Mandelbrot Katas

Nebenwirkungen

Computerkurzweil in C++

Berechne ein ganzes Bild

Transform

Demo

Demo

Ohne komplexe Zahlen

Vergleich: std::complex vs. float

Vergleich: std::complex vs. float

Mathematik der Komplexen Zahlen

für C++-Programmierer

Operatoren

std::abs

Beispiel Rechnungen

Berechne (0, 0i)

Berechne (-2, 0i)

Pythagoras als Kreisformel

Berechne (-1, 0i)

Regionen des Chaos

Grenzwerte

Mehr Informationen

Optimierungen

Optimierung der Schleife

Optimierung der Schleife

Optimierung der Schleife

Optimierung der Schleife

Schleifen ausrollen

Demo

Loop unrolling

Literale

Benchmark

Raus aus der Schleife

Schritt Berechnung

Schritt Berechnung

Schritt Berechnung

Vermeide (teure) Operationen

Vermeide (teure) Operationen

Vermeide (teure) Operationen

Datentypen

Warum ist int so langsam?

Warum ist int so langsam?

Warum ist long double so langsam (auf Linux)?

Warum ist long double so langsam (auf Linux)?

Antwort A:

Kein SIMD für long double

Antwort B:

Ist std::complex langsam?

Complex types

Zusammenfassung

Multithreading

OpenMP

Intel TBB

Intel TBB vs OpenMP

OpenMP

Intel TBB

Performance Vergleich

Erste Analyse

Multi-Core Archithektur

Hyper-Threading Tests

Warum ist OpenMP so langsam

Warum ist OpenMP so langsam

Performance Vergleich

Benchmarks and Multithreading

Demo

Mehr C++

Rekursion

Ein erster Versuch

Endrekursion

Warum ist `int` so langsam?

Warum ist `int` so langsam?

Warum ist `long double` so langsam (auf Linux)?

Warum ist `long double` so langsam (auf Linux)?

Kein SIMD für `long double`

Mandelbrot ala `boost::irange`