stammtisch ß6.05.2020

2020-05-07 14:54:28 +02:00 · 2020-05-07 14:54:28 +02:00 · 9e5eed71ff
commit 9e5eed71ff
parent c55a1cab4b
5 changed files with 904 additions and 1 deletions
--- a/post/CCC_Why_what_is_SIMD.md
+++ b/post/CCC_Why_what_is_SIMD.md
@ -0,0 +1,259 @@
+# ARMv7-l SIMD and using NEON
+
+## Motivation for SIMD
+
+<img src="pics/Adaline.jpg">
+
+[(WikiCommons:Adaline)](https://commons.wikimedia.org/wiki/File:Adaline.jpg)
+
+
+$\hat{y} = f(W\times \hat{x}+B)$
+
+For a time series:
+
+$X = \left[\hat{x_0},\hat{x_1},\dots,\hat{x_n}\right]$
+
+We get:
+
+$Y = f(W\times X+B)$
+
+### That is what Thensorflow, numpy and lots of others are good about ...
+
+## Matrix Mutlipication
+
+Eor each element in the resulting matrix a scalar product of a specific column of the matrix W and a specifc row of matrix X is required. 
+
+<img src="pics/mxm.png">
+
+Now that takes a while:
+
+Simple approach in C
+
+~~~
+   for (c = 0; c < m; c++) {
+      for (d = 0; d < q; d++) {
+        for (k = 0; k < p; k++) {
+          sum = sum + first[c][k]*second[k][d];
+        }
+ 
+        multiply[c][d] = sum;
+        sum = 0;
+      }
+    }
+~~~
+
+That requires for matrixs (100,100) x (100,1000) 100*100*1000 = 10 MFLOPS. What can be done to optimize the speed?
+
+## Techniques to optimize the calculation
+
+### Only splitting the Matrix
+
+Two reasons:
+1. optimize cache usage (**not today**)
+2. using **SIMD power**
+
+<img src="pics/matrix.png">
+
+
+# SIMD (single instruction multiple data)
+
+### Just a few words to Inlining Assembler in C (or C++)
+
+Assembler [examples see](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html)
+
+The most simple one, works on x86:
+
+~~~
+#include <stdio.h>
+
+int main(void)
+{
+        int foo = 10, bar = 15;
+        asm volatile ("addl  %%ebx,%%eax"
+                      :"=a"(foo)
+                      :"a"(foo), "b"(bar));
+        printf("foo+bar=%d\n", foo);
+        return 0;
+}
+~~~
+
+From the gcc manual
+~~~
+asm asm-qualifiers ( AssemblerTemplate 
+                 : OutputOperands 
+                 [ : InputOperands
+                 [ : Clobbers ] ])
+~~~
+
+## My sources
+
+All points are from the link above The 
+[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf)
+
+- general idea behind SIMD (not talking about MIMD)
+- ARM NEON comparision with others (1.2 - pp 1-4)
+- the instruction timing is not clear - depends as all calculations mainly on data fetching time.  
+- Fundamentals of NEON technology (1.4 - pp 1-10)
+  - 1.4.1 Registers q, d, s
+  - 1.4.2 Datatypes
+
+
+
+
+### What is it
+
+With a single instruction a vector (or other structurs) can be calculated in parallel.
+
+One assembler instruction multi/adds vectors of 4x4:
+
+<img src="pics/matrix-simd.png">
+
+Each of the 9 patches requires 4 x 4 = 16 SIMD instruction (compared to 4 x 4 x 4 = 64 ops ) fmla.f32. (multipy/Add)
+
+## Remark about this document
+
+This study is only for a better understanding of the SIMD instructions and SIMD performance of
+the ARMV7-A core (actually this one is a CORTEX-A53, but the OS supports only the 32 bit
+alternative.)
+
+## Documents and Sources
+
+[ARM ® and Thumb ® -2 Instruction Set](http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QRC0001_UAL.pdf)
+
+[ARM Architecture Reference Manual ARMv7-A and ARMv7-R](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf)
+
+The 
+[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) provides all the information required to realy do SIMD on ARMV7-A and R. 
+The document explains the register structure of the single, double and 128 bit registers as well as the instructions. 
+
+Besides other examples (Swapping color channel, FIR, 
+cross product), 
+there is also an example for matrix matrix multiplication. 
+
+The example examined here is based on this document and the 4 x 4 matrix multiplication given (chapter 7.1, pp. 115.)
+
+## About the example: my_sgemm
+
+The matrix matrix multiplication calculates patches of 4 x 4 at one time the rest of the
+calculation is straight forward.
+
+~~~
+for (i ...)
+  for (j ...)
+     for (k ...)
+~~~
+the inner loop calls the optimized 4 x 4 multiplication.
+
+
+## Shape of the matrixes
+
+All  matrixes in C are column-based. matrix_a is regular and matrix_b is transposed. (Therefore, all scalar products
+of columns [B] with rows of [A] are column $\times$ column multipilications.)
+
+The calculation is performing
+
+$C = A \times B^\mathsf{T} + C$
+
+Assuming the matrix A contains n rows and m columns, then
+the element A[i,j] has in the c-array representing the matrix the index i * m + j.
+If we want to extract a patch out of the matrix:
+A[k:k+4,l:l+4], the for rows of the matrix could be calculated by,
+- first row starts at k*m+l
+- the next row starts with some offset o = m-4.
+- same for the thrid and forth rows.
+
+## The assembler SIMD part for the 4 x 4 multiplication
+
+Purpose of the 4x4 matrix multiplication: It multiplies of a small 4 x 4 patch of some large 
+colom-based matrixes, important to know: matrix_a is regular,
+matrix_b is transposed.
+
+~~~
+static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
+                                float *output,
+                                int off_a, int off_b, int off_o ) {
+  /** \code */
+  asm volatile (
+    "# Start manual code \n\t"
+    "# Matrix Multiplication \n\n\t"
+~~~
+Macro section
+This macro performs the actual multiplication. It provides the output row for one column of matrix_a and the matrix_b (q8 - q11). The rows are stored in col0 and col1 (which corresponts to two 128 bit registers), the colums are stored in 
+q8-q11. res_q gives the resulting output row.
+~~~
+    ".macro  mul_col_f32 res_q, col0_d, col1_d\n\t"
+    "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
+    "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
+    "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
+    "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
+    ".endm\n\n\t"
+~~~
+End macro section
+
+Start loading the 128 registers with 4 single floats. q12-q15 are first loaded 
+with the current state of the output. 
+
+After each register is loaded some
+offset has to be added, since the next row starts with some offset. The same
+mechanismus applies to all matrixes. 
+
+
+load current state of output -> q12 - q15 */
+~~~
+    "vld1.32 {q12}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"        /* add some offset until start of next row */
+    "vld1.32 {q13}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"
+    "vld1.32 {q14}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"
+    "vld1.32 {q15}, [%6]!\n\t"
+~~~
+load matrix_b (transposed!) -> q8 - q11 */
+~~~
+    "vld1.32 {q8}, [%2]!\n\t"
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q9}, [%2]!\n\t"
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q10}, [%2]!\n\t"
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q11}, [%2]!\n\t"
+~~~
+load matrix_a -> q0 - q3
+~~~
+    "vld1.32 {q0}, [%1]!\n\t"
+    "add %1, %1, %3\n\t"
+    "vld1.32 {q1}, [%1]!\n\t"
+    "add %1,%1, %3\n\t"
+    "vld1.32 {q2}, [%1]!\n\t"
+    "add %1, %1, %3\n\t"
+    "vld1.32 {q3}, [%1]!\n\t"
+~~~
+End load registers
+
+Start doing the actual matrix multiplication as defined in macro
+~~~  
+    "mul_col_f32 q12, d0, d1\n\t"
+    "mul_col_f32 q13, d2, d3\n\t"
+    "mul_col_f32 q14, d4, d5\n\t"
+    "mul_col_f32 q15, d6, d7\n\n\t"
+ ~~~
+store the result [q12 - 115] into output
+ ~~~
+    "vst1.32 {q12}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q13}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q14}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q15}, [%0]!\n\t"
+~~~
+start argument section of inline assembler
+~~~
+    :"+r"((long) output)
+    :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
+     "r"(off_o),"r"(&output[0]));
+  /** \endcode */
+  return;
+}
+~~~
+
--- a/post/matrix_matrix.c
+++ b/post/matrix_matrix.c
@ -0,0 +1,300 @@
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+#include <sys/time.h>
+
+/**
+ *  gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
+ *  or cross compile
+ *  arm-linux-gnueabihf-gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
+ *
+ * test with a = np.arange(4*n*m + 4m).reshape(4*n,4*m)
+ * a.dot(a.T)
+ * 
+ * this file contains all the elements to show ARM7VL usage of 
+ * SIMD architecture
+ *
+ */
+
+void simple_transpose( float *, int, int, float *);
+static inline void my_sgemm_4x4(float *, float *, float *,
+				int, int, int );
+
+/**
+ * help routine (only for documentation demonstration
+ * 
+ * transpose some matrix matrix_a (size n_rows, n_cols) 
+ * to matrix output
+ *
+ */
+
+void simple_transpose( float *matrix_a, int n_rows_a, int n_cols_a,
+		       float *output ) {
+  for (int ra = 0; ra < n_rows_a; ra++)
+    for (int ca = 0; ca < n_cols_a; ca++ )
+      output[n_rows_a*ca+ra] = matrix_a[n_cols_a*ra+ca];
+      //output[ra][ca] = matrix_a[ca][ra];
+  return;
+}
+
+
+/**
+ * Kernal function including the optimization and the 
+ * 4 x 4 multiplication of 4 x4 fragments of large column based
+ * matrixes matrix_a and matrix_b
+ *
+ * arguments:
+ * - (float *) matrix_a: square matrix of size 4 x 4,
+ * - (float *) matrix_b: square matrix of size 4 x 4,  
+ * - (float *) output: 4 x 4 result (return value)
+ * - (int) off_a,b,o: offset between to elements last element of one row
+ *         and 1 element of next row matrix_a, matrix_b and output 
+ * 
+ * details documented here ![Using_SIMD](/home/eduard/work/wikiwhat/doc/Using_SIMD.md)
+ */
+
+static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
+				float *output,
+				int off_a, int off_b, int off_o ) {
+  /** \code */
+  asm volatile (
+    "# Start manual code \n\t"
+    "# Matrix Multiplication \n\n\t"
+    /* Maco section */
+    ".macro  mul_col_f32 res_q, col0_d, col1_d\n\t"
+    "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
+    "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
+    "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
+    "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
+    ".endm\n\n\t"
+    /* end macro section */
+    /* load current state of output -> q12 - q15 */
+    "vld1.32 {q12}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"        /* add some offset until start of next row */
+    "vld1.32 {q13}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"
+    "vld1.32 {q14}, [%6]!\n\t"
+    "add %6, %6, %5\n\t"
+    "vld1.32 {q15}, [%6]!\n\t"
+    /* load matrix_b (transposed!) -> q8 - q11 */
+    "vld1.32 {q8}, [%2]!\n\t"
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q9}, [%2]!\n\t"   
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q10}, [%2]!\n\t"   
+    "add %2, %2, %4\n\t"
+    "vld1.32 {q11}, [%2]!\n\t"   
+    /* load matrix_a -> q0 - q3 */
+    "vld1.32 {q0}, [%1]!\n\t"   
+    "add %1, %1, %3\n\t"
+    "vld1.32 {q1}, [%1]!\n\t"   
+    "add %1,%1, %3\n\t"
+    "vld1.32 {q2}, [%1]!\n\t"   
+    "add %1, %1, %3\n\t"
+    "vld1.32 {q3}, [%1]!\n\t"
+    /* end load registers
+     * start doing the actual matrix multiplication as defined in macro */
+    "mul_col_f32 q12, d0, d1\n\t"
+    "mul_col_f32 q13, d2, d3\n\t"
+    "mul_col_f32 q14, d4, d5\n\t"
+    "mul_col_f32 q15, d6, d7\n\n\t"
+    /* store the result [q12 - 115] into output */
+    "vst1.32 {q12}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q13}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q14}, [%0]!\n\t"
+    "add %0, %0, %5\n\t"
+    "vst1.32 {q15}, [%0]!\n\t"
+    /* start argument section of inline assembler */
+    :"+r"((long) output)
+    :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
+     "r"(off_o),"r"(&output[0]));
+  /** \endcode */
+  return;
+}
+
+/**
+ * matrix matrix multiplication of some matrix_a and some matrix_b
+ * (works only for size 4*n x 4*m)
+ * the order is column based and output = a x b.transpose()
+ *
+ * the multiplication based on patch-wise standard multiplication algorithm
+ * each patch of size 4 x 4
+ */
+
+void my_sgemm(float *matrix_a, int n_rows_a, int n_cols_a,
+	      float *matrix_b, int n_rows_b, int n_cols_b,
+	      float *output ) {
+  int offset_a = 4*(n_cols_a-4);
+  int offset_b = 4*(n_cols_b-4);
+  for(int i=0;i<n_rows_a;i = i+4 ) {
+    for(int j=0;j<n_cols_b;j = j+4 ) {    
+      for(int k=0;k<n_cols_a;k = k+4 ) {    
+	my_sgemm_4x4(&matrix_a[n_cols_a*i+k],
+		     &matrix_b[n_cols_b*k+j], 
+		     &output[n_cols_b*i+j],
+		     offset_a, offset_b, offset_b);
+      }    
+    }
+  }
+  return;
+}
+
+/**
+ * standard algorithm for matrix matrix multiplication
+ * output = matrix_a.dot(matrix_bb.transpose())
+ * - arguments:
+ *  - (float *) a: column-based matrix size n_colums_a x n_rows_b
+ *  - (int)     n_rows_a, n_cols_a: size of matrix a
+ *  - (float *) b: column-based matrix size n_colums_a x n_rows_b
+ *  - (int)     n_rows_b, n_cols_b: size of matrix b
+ *  - (float *) output: column-based matrix = a.dot(b.T)
+ * - return: void
+ */
+
+void simple_mm( float *a, int n_rows_a, int n_cols_a,
+		float *b, int n_rows_b, int n_cols_b,
+		float *output ) {
+  for(int i=0;i<n_rows_a;i++)    
+    for(int j=0;j<n_cols_b;j++) {    
+      output[n_cols_b*i+j]=0;    
+      for(int k=0;k<n_cols_a;k++) {    
+	output[n_cols_b*i+j]+=a[n_cols_a*i+k]*b[n_cols_b*k+j];    
+      }    
+    }    
+  return;
+}
+
+/**
+ * int main():
+ *     calls simple and optimized function and compare speed
+ *     size defined by macros (N_ROWS_x, N_COLS_B: 4 * n, 4 * m)
+ *     Matrix defined as:
+ *     matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
+ *     matrix_b = a.T
+ *     1. call optimized my_sgemm
+ *     2. call simple_mm
+ *     compare times, check identity
+ *     print results
+ * size of example matrix 
+ */
+
+#define N_COLS_A 256
+#define N_ROWS_A 256
+#define N_COLS_B N_ROWS_A
+#define N_ROWS_B N_COLS_A
+
+int main() {
+  float matrix_a[N_ROWS_A][N_COLS_A];
+  float matrix_b[N_ROWS_B][N_COLS_B];
+  float matrix_aa[N_ROWS_A][N_COLS_A];
+  float matrix_bb[N_ROWS_B][N_COLS_B];
+  float buffer[N_ROWS_A][N_COLS_B];
+  float reference[N_ROWS_A][N_COLS_B];
+  struct timeval t1, t2, t3;
+  long int durationf, durations;
+
+  /**
+   * matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
+   */
+
+  for (int ra = 0; ra < N_ROWS_A; ra++) {
+     for (int ca = 0; ca < N_COLS_A; ca++) {
+       matrix_a[ra][ca] = N_COLS_A*ra+ca;
+       matrix_aa[ra][ca] = N_COLS_A*ra+ca;
+     }
+  }
+  
+  /**
+   * calculate matrix_b as matrix_a.T 
+   *
+   */
+  
+  simple_transpose(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
+  		   &matrix_b[0][0]);
+
+  simple_transpose(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
+  		   &matrix_bb[0][0]);
+
+  /**
+   * set outputs buffer (outpot of my_sgemm) and referece (simple_mm)
+   * to zero
+   */
+  
+  for (int ra = 0; ra < N_ROWS_A; ra++)
+    for (int cb = 0; cb < N_COLS_B; cb++) {
+      buffer[ra][cb] = 0.0;
+      reference[ra][cb] = 0.0;
+    }
+  
+  /**
+   * 1. set timer to t1 (start of optimized algorithm)
+   * 2. call optimized algorithm
+   */
+  
+  gettimeofday(&t1, NULL);
+
+  my_sgemm(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
+  	   &matrix_bb[0][0], N_ROWS_B, N_COLS_B,
+  	   &buffer[0][0]);
+  
+  /**
+   * 3. set timer to t2 (end of optimized and start of simple algorithm)
+   * 4. call optimized algorithm
+   */
+  
+  gettimeofday(&t2, NULL);
+  
+  simple_mm(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
+	    &matrix_b[0][0], N_ROWS_B, N_COLS_B,
+	    &reference[0][0]);
+  
+  /**
+   * 3. set timer to t3 (end of simple algorithm)
+   */
+  
+  gettimeofday(&t3, NULL);
+
+  /**
+   * calculate durations for optimized and simple algorithm
+   */
+  
+  durationf = 1e6*(t2.tv_sec - t1.tv_sec)+(t2.tv_usec - t1.tv_usec);
+  durations = 1e6*(t3.tv_sec - t2.tv_sec)+(t3.tv_usec - t2.tv_usec);
+
+  /**
+   * output (6 x 6 patch of result (both algorithm)
+   */
+  
+  printf("my_sgemm\n");
+  for (int ra=0; ra<6; ra++ ) {
+    for (int cb=0; cb<6; cb++ ) printf("%.2e ", buffer[ra][cb]);
+    printf("\n");
+  }
+  printf("reference\n");
+  for (int ra=0; ra<6; ra++ ) {
+    for (int cb=0; cb<6; cb++ ) printf("%.2e ", reference[ra][cb]);
+    printf("\n");
+  }
+
+  /**
+   * calculate mean sqare error 
+   */
+  
+  float mse = 0.0F;
+  for (int ra=0; ra<N_ROWS_A; ra++ ) {
+    for (int cb=0; cb<N_COLS_B; cb++ ) {
+      mse += (reference[ra][cb]-buffer[ra][cb]) *
+	(reference[ra][cb]-buffer[ra][cb]);
+    }
+  }
+
+  /**
+   * print mse and ration of times optimized_time / simple_time
+   */
+  
+  printf("MSE: %.5f [durationrate f/s %.5f]\n",mse,
+	 (float)durationf/(float)durations);
+  return 0;
+}
--- a/post/motivation_matrix_mult.ipynb
+++ b/post/motivation_matrix_mult.ipynb
@ -0,0 +1,318 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Motivation\n",
+    "## How to solve matrix-matrix multiplication"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Simple approach in C\n",
+    "\n",
+    "~~~\n",
+    "   for (c = 0; c < m; c++) {\n",
+    "      for (d = 0; d < q; d++) {\n",
+    "        for (k = 0; k < p; k++) {\n",
+    "          sum = sum + first[c][k]*second[k][d];\n",
+    "        }\n",
+    " \n",
+    "        multiply[c][d] = sum;\n",
+    "        sum = 0;\n",
+    "      }\n",
+    "    }\n",
+    "~~~"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def mymatrixmult(A,B):\n",
+    "    y = np.zeros((A.shape[0], B.shape[1]))\n",
+    "    for i in range(A.shape[0]):\n",
+    "        for j in range(B.shape[1]):\n",
+    "            for k in range(A.shape[0]):\n",
+    "                 y[i][j] += A[i][k]*B[k][j]\n",
+    "    return y\n",
+    "                \n",
+    "m = np.arange(40000).reshape(200,200)\n",
+    "m1 = m/np.average(m)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Takes a while (200**3) = 8 MFLOPS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
+       "        1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
+       "       [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
+       "        5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
+       "       [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
+       "        9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
+       "       ...,\n",
+       "       [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
+       "        7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
+       "       [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
+       "        7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
+       "       [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
+       "        7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "mymatrixmult(m1, m1.T)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
+       "        1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
+       "       [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
+       "        5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
+       "       [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
+       "        9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
+       "       ...,\n",
+       "       [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
+       "        7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
+       "       [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
+       "        7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
+       "       [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
+       "        7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "m1.dot(m1.T)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Now numpy: (2000**3) = 8 GFLOPS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "M = np.arange(4000000).reshape(2000,2000)   \n",
+    "M1 = M/np.average(M) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[6.66167083e-04, 1.66566758e-03, 2.66516808e-03, ...,\n",
+       "        1.99666867e+00, 1.99766817e+00, 1.99866767e+00],\n",
+       "       [1.66566758e-03, 4.66516908e-03, 7.66467058e-03, ...,\n",
+       "        5.99167016e+00, 5.99466966e+00, 5.99766917e+00],\n",
+       "       [2.66516808e-03, 7.66467058e-03, 1.26641731e-02, ...,\n",
+       "        9.98667166e+00, 9.99167116e+00, 9.99667067e+00],\n",
+       "       ...,\n",
+       "       [1.99666867e+00, 5.99167016e+00, 9.98667166e+00, ...,\n",
+       "        7.98001466e+03, 7.98400966e+03, 7.98800466e+03],\n",
+       "       [1.99766817e+00, 5.99466966e+00, 9.99167116e+00, ...,\n",
+       "        7.98400966e+03, 7.98800666e+03, 7.99200366e+03],\n",
+       "       [1.99866767e+00, 5.99766917e+00, 9.99667067e+00, ...,\n",
+       "        7.98800466e+03, 7.99200366e+03, 7.99600267e+03]])"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "M1.dot(M1.T)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11],\n",
+       "       [ 12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23],\n",
+       "       [ 24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35],\n",
+       "       [ 36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47],\n",
+       "       [ 48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59],\n",
+       "       [ 60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71],\n",
+       "       [ 72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83],\n",
+       "       [ 84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95],\n",
+       "       [ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107],\n",
+       "       [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],\n",
+       "       [120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131],\n",
+       "       [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]])"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "n = 12\n",
+    "np.arange(n*n).reshape(n,n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Splitting the matrix\n",
+    "\n",
+    "Two reasons:\n",
+    "1. optimize cache usage\n",
+    "2. using SIMD power\n",
+    "\n",
+    "<img src=\"pics/matrix.png\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ?\n",
+    "\n",
+    "No idea, about the following: $y = tanh(M)$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y = np.tanh(M1.dot(M1.T))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(1000, 1000)"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "y.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "16.0"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "((4*128)**3)*16/((128)**3*64)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# das wichtig"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/post/stammtisch20200506.md
+++ b/post/stammtisch20200506.md
@ -0,0 +1,26 @@
+---
+date: "2020-05-06"
+tages: ["chaostreff", "veranstaltung", "protokoll"]
+title: "Kurzprotokoll 06.05.2020"
+---
+
+
+Chaostreff-relevante Punkte:
+
+1. Der Chaostreff findet nun regelmäßig an folgenden Terminen statt:
+    1. 1. Mittwoch im Monat, ab 19:42, Ort: Jitsi
+    2. 3. Dienstag im Monat, ab 19:42, Ort: Jitsi
+
+    Jitsi-Link: https://meet.jit.si/chaoslb
+
+
+Einige Themen mit und ohne Links:
+
+1. Es gab einen spannenden Vortrag von fritzthekit/eduard zu SIMD und sauschnellen Matrixberechnungen auf einem
+   Raspberry Pi. Links zum [Jupyter-Notebook](motivation_matrix_mult.ipynb), [C-Code](matrix_matrix.c) und zur [Präsentation als Markdown](CCC_Why_what_is_SIMD.md)
+2. Go für kleine Plätze [TinyGo](https://tinygo.org/)
+3. Zooms Datenschutzbeurteilungen waren ein Thema (die jetzt doch irgendwie halb DSGVO-konform sind)
+3. ...
+
+
+Nächster Stammtisch ist am 19.05.2020 mit einem Vortrag von Harvey.