Merge pull request #3 from steffenfritz/master

merging upstream master
2020-05-20 19:15:30 +02:00 · 2020-05-20 19:15:30 +02:00 · 2a7ff83875
commit 2a7ff83875
parent c997a10e6a a6b8dd501e
10 changed files with 939 additions and 3 deletions
--- a/_index.md
+++ b/_index.md
@ -13,7 +13,7 @@ Der Chaostreff Ludwigsburg ist ein lockeres Zusammentreffen von Hackern, die sic
 #### Wann und Wo trefft ihr euch?
 Der nächste RL-Termin steht auf Grund von Covid-19 noch nicht fest. Aktuell trifft man uns auf https://matrix.complb.de
-(#chaostreff:matrix.complb.de) und jeden Mittwoch per Jitsi (https://complb.de/covid1920200315/)
+(#chaostreff:matrix.complb.de) und zwei Mal im Monat bei Jitsi (https://complb.de/stammtisch20200506/)
 Du erreichst uns außerdem per Mail über chaostreff AT complb PUNKt de
--- a/post/CCC_Why_what_is_SIMD.md
+++ b/post/CCC_Why_what_is_SIMD.md
@ -0,0 +1,259 @@
 # ARMv7-l SIMD and using NEON
 ## Motivation for SIMD
 <img src="pics/Adaline.jpg">
 [(WikiCommons:Adaline)](https://commons.wikimedia.org/wiki/File:Adaline.jpg)
 $\hat{y} = f(W\times \hat{x}+B)$
 For a time series:
 $X = \left[\hat{x_0},\hat{x_1},\dots,\hat{x_n}\right]$
 We get:
 $Y = f(W\times X+B)$
 ### That is what Thensorflow, numpy and lots of others are good about ...
 ## Matrix Mutlipication
 Eor each element in the resulting matrix a scalar product of a specific column of the matrix W and a specifc row of matrix X is required. 
 <img src="pics/mxm.png">
 Now that takes a while:
 Simple approach in C
 ~~~
   for (c = 0; c < m; c++) {
      for (d = 0; d < q; d++) {
        for (k = 0; k < p; k++) {
          sum = sum + first[c][k]*second[k][d];
        }
        multiply[c][d] = sum;
        sum = 0;
      }
    }
 ~~~
 That requires for matrixs (100,100) x (100,1000) 100*100*1000 = 10 MFLOPS. What can be done to optimize the speed?
 ## Techniques to optimize the calculation
 ### Only splitting the Matrix
 Two reasons:
 1. optimize cache usage (**not today**)
 2. using **SIMD power**
 <img src="pics/matrix.png">
 # SIMD (single instruction multiple data)
 ### Just a few words to Inlining Assembler in C (or C++)
 Assembler [examples see](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html)
 The most simple one, works on x86:
 ~~~
 #include <stdio.h>
 int main(void)
 {
        int foo = 10, bar = 15;
        asm volatile ("addl  %%ebx,%%eax"
                      :"=a"(foo)
                      :"a"(foo), "b"(bar));
        printf("foo+bar=%d\n", foo);
        return 0;
 }
 ~~~
 From the gcc manual
 ~~~
 asm asm-qualifiers ( AssemblerTemplate 
                 : OutputOperands 
                 [ : InputOperands
                 [ : Clobbers ] ])
 ~~~
 ## My sources
 All points are from the link above The 
 [NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf)
 - general idea behind SIMD (not talking about MIMD)
 - ARM NEON comparision with others (1.2 - pp 1-4)
 - the instruction timing is not clear - depends as all calculations mainly on data fetching time.  
 - Fundamentals of NEON technology (1.4 - pp 1-10)
  - 1.4.1 Registers q, d, s
  - 1.4.2 Datatypes
 ### What is it
 With a single instruction a vector (or other structurs) can be calculated in parallel.
 One assembler instruction multi/adds vectors of 4x4:
 <img src="pics/matrix-simd.png">
 Each of the 9 patches requires 4 x 4 = 16 SIMD instruction (compared to 4 x 4 x 4 = 64 ops ) fmla.f32. (multipy/Add)
 ## Remark about this document
 This study is only for a better understanding of the SIMD instructions and SIMD performance of
 the ARMV7-A core (actually this one is a CORTEX-A53, but the OS supports only the 32 bit
 alternative.)
 ## Documents and Sources
 [ARM ® and Thumb ® -2 Instruction Set](http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QRC0001_UAL.pdf)
 [ARM Architecture Reference Manual ARMv7-A and ARMv7-R](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf)
 The 
 [NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) provides all the information required to realy do SIMD on ARMV7-A and R. 
 The document explains the register structure of the single, double and 128 bit registers as well as the instructions. 
 Besides other examples (Swapping color channel, FIR, 
 cross product), 
 there is also an example for matrix matrix multiplication. 
 The example examined here is based on this document and the 4 x 4 matrix multiplication given (chapter 7.1, pp. 115.)
 ## About the example: my_sgemm
 The matrix matrix multiplication calculates patches of 4 x 4 at one time the rest of the
 calculation is straight forward.
 ~~~
 for (i ...)
  for (j ...)
     for (k ...)
 ~~~
 the inner loop calls the optimized 4 x 4 multiplication.
 ## Shape of the matrixes
 All  matrixes in C are column-based. matrix_a is regular and matrix_b is transposed. (Therefore, all scalar products
 of columns [B] with rows of [A] are column $\times$ column multipilications.)
 The calculation is performing
 $C = A \times B^\mathsf{T} + C$
 Assuming the matrix A contains n rows and m columns, then
 the element A[i,j] has in the c-array representing the matrix the index i * m + j.
 If we want to extract a patch out of the matrix:
 A[k:k+4,l:l+4], the for rows of the matrix could be calculated by,
 - first row starts at k*m+l
 - the next row starts with some offset o = m-4.
 - same for the thrid and forth rows.
 ## The assembler SIMD part for the 4 x 4 multiplication
 Purpose of the 4x4 matrix multiplication: It multiplies of a small 4 x 4 patch of some large 
 colom-based matrixes, important to know: matrix_a is regular,
 matrix_b is transposed.
 ~~~
 static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
                                float *output,
                                int off_a, int off_b, int off_o ) {
  /** \code */
  asm volatile (
    "# Start manual code \n\t"
    "# Matrix Multiplication \n\n\t"
 ~~~
 Macro section
 This macro performs the actual multiplication. It provides the output row for one column of matrix_a and the matrix_b (q8 - q11). The rows are stored in col0 and col1 (which corresponts to two 128 bit registers), the colums are stored in 
 q8-q11. res_q gives the resulting output row.
 ~~~
    ".macro  mul_col_f32 res_q, col0_d, col1_d\n\t"
    "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
    "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
    "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
    "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
    ".endm\n\n\t"
 ~~~
 End macro section
 Start loading the 128 registers with 4 single floats. q12-q15 are first loaded 
 with the current state of the output. 
 After each register is loaded some
 offset has to be added, since the next row starts with some offset. The same
 mechanismus applies to all matrixes. 
 load current state of output -> q12 - q15 */
 ~~~
    "vld1.32 {q12}, [%6]!\n\t"
    "add %6, %6, %5\n\t"        /* add some offset until start of next row */
    "vld1.32 {q13}, [%6]!\n\t"
    "add %6, %6, %5\n\t"
    "vld1.32 {q14}, [%6]!\n\t"
    "add %6, %6, %5\n\t"
    "vld1.32 {q15}, [%6]!\n\t"
 ~~~
 load matrix_b (transposed!) -> q8 - q11 */
 ~~~
    "vld1.32 {q8}, [%2]!\n\t"
    "add %2, %2, %4\n\t"
    "vld1.32 {q9}, [%2]!\n\t"
    "add %2, %2, %4\n\t"
    "vld1.32 {q10}, [%2]!\n\t"
    "add %2, %2, %4\n\t"
    "vld1.32 {q11}, [%2]!\n\t"
 ~~~
 load matrix_a -> q0 - q3
 ~~~
    "vld1.32 {q0}, [%1]!\n\t"
    "add %1, %1, %3\n\t"
    "vld1.32 {q1}, [%1]!\n\t"
    "add %1,%1, %3\n\t"
    "vld1.32 {q2}, [%1]!\n\t"
    "add %1, %1, %3\n\t"
    "vld1.32 {q3}, [%1]!\n\t"
 ~~~
 End load registers
 Start doing the actual matrix multiplication as defined in macro
 ~~~  
    "mul_col_f32 q12, d0, d1\n\t"
    "mul_col_f32 q13, d2, d3\n\t"
    "mul_col_f32 q14, d4, d5\n\t"
    "mul_col_f32 q15, d6, d7\n\n\t"
 ~~~
 store the result [q12 - 115] into output
 ~~~
    "vst1.32 {q12}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q13}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q14}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q15}, [%0]!\n\t"
 ~~~
 start argument section of inline assembler
 ~~~
    :"+r"((long) output)
    :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
     "r"(off_o),"r"(&output[0]));
  /** \endcode */
  return;
 }
 ~~~
--- a/post/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf
+++ b/post/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf
--- a/post/Logo_Chaostreff_LB.zip
+++ b/post/Logo_Chaostreff_LB.zip
--- a/post/logo.md
+++ b/post/logo.md
@ -0,0 +1,8 @@
 ---
 date: "2020-05-02"
 tages: ["chaostreff", "logo", "misc"]
 title: "Logo-Paket Chaostreff LB"
 ---
 Das Logo-Zip gibt es hier [Download](https://complb.de/logo/Logo_Chaostreff_LB.zip)
--- a/post/matrix_matrix.c
+++ b/post/matrix_matrix.c
@ -0,0 +1,300 @@
 #include <stdio.h>
 #include <string.h>
 #include <math.h>
 #include <sys/time.h>
 /**
 *  gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
 *  or cross compile
 *  arm-linux-gnueabihf-gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
 *
 * test with a = np.arange(4*n*m + 4m).reshape(4*n,4*m)
 * a.dot(a.T)
 * 
 * this file contains all the elements to show ARM7VL usage of 
 * SIMD architecture
 *
 */
 void simple_transpose( float *, int, int, float *);
 static inline void my_sgemm_4x4(float *, float *, float *,
 				int, int, int );
 /**
 * help routine (only for documentation demonstration
 * 
 * transpose some matrix matrix_a (size n_rows, n_cols) 
 * to matrix output
 *
 */
 void simple_transpose( float *matrix_a, int n_rows_a, int n_cols_a,
 		       float *output ) {
  for (int ra = 0; ra < n_rows_a; ra++)
    for (int ca = 0; ca < n_cols_a; ca++ )
      output[n_rows_a*ca+ra] = matrix_a[n_cols_a*ra+ca];
      //output[ra][ca] = matrix_a[ca][ra];
  return;
 }
 /**
 * Kernal function including the optimization and the 
 * 4 x 4 multiplication of 4 x4 fragments of large column based
 * matrixes matrix_a and matrix_b
 *
 * arguments:
 * - (float *) matrix_a: square matrix of size 4 x 4,
 * - (float *) matrix_b: square matrix of size 4 x 4,  
 * - (float *) output: 4 x 4 result (return value)
 * - (int) off_a,b,o: offset between to elements last element of one row
 *         and 1 element of next row matrix_a, matrix_b and output 
 * 
 * details documented here ![Using_SIMD](/home/eduard/work/wikiwhat/doc/Using_SIMD.md)
 */
 static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
 				float *output,
 				int off_a, int off_b, int off_o ) {
  /** \code */
  asm volatile (
    "# Start manual code \n\t"
    "# Matrix Multiplication \n\n\t"
    /* Maco section */
    ".macro  mul_col_f32 res_q, col0_d, col1_d\n\t"
    "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
    "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
    "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
    "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
    ".endm\n\n\t"
    /* end macro section */
    /* load current state of output -> q12 - q15 */
    "vld1.32 {q12}, [%6]!\n\t"
    "add %6, %6, %5\n\t"        /* add some offset until start of next row */
    "vld1.32 {q13}, [%6]!\n\t"
    "add %6, %6, %5\n\t"
    "vld1.32 {q14}, [%6]!\n\t"
    "add %6, %6, %5\n\t"
    "vld1.32 {q15}, [%6]!\n\t"
    /* load matrix_b (transposed!) -> q8 - q11 */
    "vld1.32 {q8}, [%2]!\n\t"
    "add %2, %2, %4\n\t"
    "vld1.32 {q9}, [%2]!\n\t"   
    "add %2, %2, %4\n\t"
    "vld1.32 {q10}, [%2]!\n\t"   
    "add %2, %2, %4\n\t"
    "vld1.32 {q11}, [%2]!\n\t"   
    /* load matrix_a -> q0 - q3 */
    "vld1.32 {q0}, [%1]!\n\t"   
    "add %1, %1, %3\n\t"
    "vld1.32 {q1}, [%1]!\n\t"   
    "add %1,%1, %3\n\t"
    "vld1.32 {q2}, [%1]!\n\t"   
    "add %1, %1, %3\n\t"
    "vld1.32 {q3}, [%1]!\n\t"
    /* end load registers
     * start doing the actual matrix multiplication as defined in macro */
    "mul_col_f32 q12, d0, d1\n\t"
    "mul_col_f32 q13, d2, d3\n\t"
    "mul_col_f32 q14, d4, d5\n\t"
    "mul_col_f32 q15, d6, d7\n\n\t"
    /* store the result [q12 - 115] into output */
    "vst1.32 {q12}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q13}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q14}, [%0]!\n\t"
    "add %0, %0, %5\n\t"
    "vst1.32 {q15}, [%0]!\n\t"
    /* start argument section of inline assembler */
    :"+r"((long) output)
    :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
     "r"(off_o),"r"(&output[0]));
  /** \endcode */
  return;
 }
 /**
 * matrix matrix multiplication of some matrix_a and some matrix_b
 * (works only for size 4*n x 4*m)
 * the order is column based and output = a x b.transpose()
 *
 * the multiplication based on patch-wise standard multiplication algorithm
 * each patch of size 4 x 4
 */
 void my_sgemm(float *matrix_a, int n_rows_a, int n_cols_a,
 	      float *matrix_b, int n_rows_b, int n_cols_b,
 	      float *output ) {
  int offset_a = 4*(n_cols_a-4);
  int offset_b = 4*(n_cols_b-4);
  for(int i=0;i<n_rows_a;i = i+4 ) {
    for(int j=0;j<n_cols_b;j = j+4 ) {    
      for(int k=0;k<n_cols_a;k = k+4 ) {    
 	my_sgemm_4x4(&matrix_a[n_cols_a*i+k],
 		     &matrix_b[n_cols_b*k+j], 
 		     &output[n_cols_b*i+j],
 		     offset_a, offset_b, offset_b);
      }    
    }
  }
  return;
 }
 /**
 * standard algorithm for matrix matrix multiplication
 * output = matrix_a.dot(matrix_bb.transpose())
 * - arguments:
 *  - (float *) a: column-based matrix size n_colums_a x n_rows_b
 *  - (int)     n_rows_a, n_cols_a: size of matrix a
 *  - (float *) b: column-based matrix size n_colums_a x n_rows_b
 *  - (int)     n_rows_b, n_cols_b: size of matrix b
 *  - (float *) output: column-based matrix = a.dot(b.T)
 * - return: void
 */
 void simple_mm( float *a, int n_rows_a, int n_cols_a,
 		float *b, int n_rows_b, int n_cols_b,
 		float *output ) {
  for(int i=0;i<n_rows_a;i++)    
    for(int j=0;j<n_cols_b;j++) {    
      output[n_cols_b*i+j]=0;    
      for(int k=0;k<n_cols_a;k++) {    
 	output[n_cols_b*i+j]+=a[n_cols_a*i+k]*b[n_cols_b*k+j];    
      }    
    }    
  return;
 }
 /**
 * int main():
 *     calls simple and optimized function and compare speed
 *     size defined by macros (N_ROWS_x, N_COLS_B: 4 * n, 4 * m)
 *     Matrix defined as:
 *     matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
 *     matrix_b = a.T
 *     1. call optimized my_sgemm
 *     2. call simple_mm
 *     compare times, check identity
 *     print results
 * size of example matrix 
 */
 #define N_COLS_A 256
 #define N_ROWS_A 256
 #define N_COLS_B N_ROWS_A
 #define N_ROWS_B N_COLS_A
 int main() {
  float matrix_a[N_ROWS_A][N_COLS_A];
  float matrix_b[N_ROWS_B][N_COLS_B];
  float matrix_aa[N_ROWS_A][N_COLS_A];
  float matrix_bb[N_ROWS_B][N_COLS_B];
  float buffer[N_ROWS_A][N_COLS_B];
  float reference[N_ROWS_A][N_COLS_B];
  struct timeval t1, t2, t3;
  long int durationf, durations;
  /**
   * matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
   */
  for (int ra = 0; ra < N_ROWS_A; ra++) {
     for (int ca = 0; ca < N_COLS_A; ca++) {
       matrix_a[ra][ca] = N_COLS_A*ra+ca;
       matrix_aa[ra][ca] = N_COLS_A*ra+ca;
     }
  }
  /**
   * calculate matrix_b as matrix_a.T 
   *
   */
  simple_transpose(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
  		   &matrix_b[0][0]);
  simple_transpose(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
  		   &matrix_bb[0][0]);
  /**
   * set outputs buffer (outpot of my_sgemm) and referece (simple_mm)
   * to zero
   */
  for (int ra = 0; ra < N_ROWS_A; ra++)
    for (int cb = 0; cb < N_COLS_B; cb++) {
      buffer[ra][cb] = 0.0;
      reference[ra][cb] = 0.0;
    }
  /**
   * 1. set timer to t1 (start of optimized algorithm)
   * 2. call optimized algorithm
   */
  gettimeofday(&t1, NULL);
  my_sgemm(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
  	   &matrix_bb[0][0], N_ROWS_B, N_COLS_B,
  	   &buffer[0][0]);
  /**
   * 3. set timer to t2 (end of optimized and start of simple algorithm)
   * 4. call optimized algorithm
   */
  gettimeofday(&t2, NULL);
  simple_mm(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
 	    &matrix_b[0][0], N_ROWS_B, N_COLS_B,
 	    &reference[0][0]);
  /**
   * 3. set timer to t3 (end of simple algorithm)
   */
  gettimeofday(&t3, NULL);
  /**
   * calculate durations for optimized and simple algorithm
   */
  durationf = 1e6*(t2.tv_sec - t1.tv_sec)+(t2.tv_usec - t1.tv_usec);
  durations = 1e6*(t3.tv_sec - t2.tv_sec)+(t3.tv_usec - t2.tv_usec);
  /**
   * output (6 x 6 patch of result (both algorithm)
   */
  printf("my_sgemm\n");
  for (int ra=0; ra<6; ra++ ) {
    for (int cb=0; cb<6; cb++ ) printf("%.2e ", buffer[ra][cb]);
    printf("\n");
  }
  printf("reference\n");
  for (int ra=0; ra<6; ra++ ) {
    for (int cb=0; cb<6; cb++ ) printf("%.2e ", reference[ra][cb]);
    printf("\n");
  }
  /**
   * calculate mean sqare error 
   */
  float mse = 0.0F;
  for (int ra=0; ra<N_ROWS_A; ra++ ) {
    for (int cb=0; cb<N_COLS_B; cb++ ) {
      mse += (reference[ra][cb]-buffer[ra][cb]) *
 	(reference[ra][cb]-buffer[ra][cb]);
    }
  }
  /**
   * print mse and ration of times optimized_time / simple_time
   */
  printf("MSE: %.5f [durationrate f/s %.5f]\n",mse,
 	 (float)durationf/(float)durations);
  return 0;
 }
--- a/post/motivation_matrix_mult.ipynb
+++ b/post/motivation_matrix_mult.ipynb
@ -0,0 +1,318 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Motivation\n",
    "## How to solve matrix-matrix multiplication"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Simple approach in C\n",
    "\n",
    "~~~\n",
    "   for (c = 0; c < m; c++) {\n",
    "      for (d = 0; d < q; d++) {\n",
    "        for (k = 0; k < p; k++) {\n",
    "          sum = sum + first[c][k]*second[k][d];\n",
    "        }\n",
    " \n",
    "        multiply[c][d] = sum;\n",
    "        sum = 0;\n",
    "      }\n",
    "    }\n",
    "~~~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def mymatrixmult(A,B):\n",
    "    y = np.zeros((A.shape[0], B.shape[1]))\n",
    "    for i in range(A.shape[0]):\n",
    "        for j in range(B.shape[1]):\n",
    "            for k in range(A.shape[0]):\n",
    "                 y[i][j] += A[i][k]*B[k][j]\n",
    "    return y\n",
    "                \n",
    "m = np.arange(40000).reshape(200,200)\n",
    "m1 = m/np.average(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Takes a while (200**3) = 8 MFLOPS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
       "        1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
       "       [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
       "        5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
       "       [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
       "        9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
       "       ...,\n",
       "       [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
       "        7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
       "       [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
       "        7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
       "       [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
       "        7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mymatrixmult(m1, m1.T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
       "        1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
       "       [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
       "        5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
       "       [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
       "        9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
       "       ...,\n",
       "       [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
       "        7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
       "       [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
       "        7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
       "       [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
       "        7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m1.dot(m1.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now numpy: (2000**3) = 8 GFLOPS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "M = np.arange(4000000).reshape(2000,2000)   \n",
    "M1 = M/np.average(M) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[6.66167083e-04, 1.66566758e-03, 2.66516808e-03, ...,\n",
       "        1.99666867e+00, 1.99766817e+00, 1.99866767e+00],\n",
       "       [1.66566758e-03, 4.66516908e-03, 7.66467058e-03, ...,\n",
       "        5.99167016e+00, 5.99466966e+00, 5.99766917e+00],\n",
       "       [2.66516808e-03, 7.66467058e-03, 1.26641731e-02, ...,\n",
       "        9.98667166e+00, 9.99167116e+00, 9.99667067e+00],\n",
       "       ...,\n",
       "       [1.99666867e+00, 5.99167016e+00, 9.98667166e+00, ...,\n",
       "        7.98001466e+03, 7.98400966e+03, 7.98800466e+03],\n",
       "       [1.99766817e+00, 5.99466966e+00, 9.99167116e+00, ...,\n",
       "        7.98400966e+03, 7.98800666e+03, 7.99200366e+03],\n",
       "       [1.99866767e+00, 5.99766917e+00, 9.99667067e+00, ...,\n",
       "        7.98800466e+03, 7.99200366e+03, 7.99600267e+03]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M1.dot(M1.T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11],\n",
       "       [ 12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23],\n",
       "       [ 24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35],\n",
       "       [ 36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47],\n",
       "       [ 48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59],\n",
       "       [ 60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71],\n",
       "       [ 72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83],\n",
       "       [ 84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95],\n",
       "       [ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107],\n",
       "       [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],\n",
       "       [120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131],\n",
       "       [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = 12\n",
    "np.arange(n*n).reshape(n,n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting the matrix\n",
    "\n",
    "Two reasons:\n",
    "1. optimize cache usage\n",
    "2. using SIMD power\n",
    "\n",
    "<img src=\"pics/matrix.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ?\n",
    "\n",
    "No idea, about the following: $y = tanh(M)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.tanh(M1.dot(M1.T))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 1000)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16.0"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "((4*128)**3)*16/((128)**3*64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# das wichtig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/post/stammtisch20200506.md
+++ b/post/stammtisch20200506.md
@ -0,0 +1,27 @@
 ---
 date: "2020-05-06"
 tages: ["chaostreff", "veranstaltung", "protokoll"]
 title: "Kurzprotokoll 06.05.2020"
 ---
 Chaostreff-relevante Punkte:
 1. Der Chaostreff findet nun regelmäßig an folgenden Terminen statt:
    1. 1. Mittwoch im Monat, ab 19:42, Ort: Jitsi
    2. 3. Dienstag im Monat, ab 19:42, Ort: Jitsi
    Jitsi-Link: https://meet.jit.si/chaoslb
 Einige Themen mit und ohne Links:
 1. Es gab einen spannenden Vortrag von fritzthekit/eduard zu SIMD und sauschnellen Matrixberechnungen auf einem
   Raspberry Pi. Links zum [Jupyter-Notebook](motivation_matrix_mult.ipynb), [C-Code](matrix_matrix.c) und zur [Präsentation als Markdown](CCC_Why_what_is_SIMD.md)
 2. Go für kleine Plätze [TinyGo](https://tinygo.org/)
 3. Zooms Datenschutzbeurteilungen waren ein Thema (die jetzt doch irgendwie halb DSGVO-konform sind)
 4. Iridium Browser [Link](https://iridiumbrowser.de/)
 5. ...
 Nächster Stammtisch ist am 19.05.2020 mit einem Vortrag von Harvey.
--- a/post/stammtisch20200519.md
+++ b/post/stammtisch20200519.md
@ -0,0 +1,24 @@
 ---
 date: "2020-05-19"
 tages: ["chaostreff", "veranstaltung"]
 title: "Kurzprotokoll 19.05.2020"
 ---
 Chaostreff-relevante Punkte:
 1. Ein gemeinsamer Abend mit dem CCCS wäre auch mal nett. Wir gehen in die Terminfindung.
 2. Wohin kann das Repo der Chaostreff-Website migriert werden? In der Diskussion sind die Gitea-Instanz vom CCCS und Codeberg.
 Einige Themen mit und ohne Links:
 1. Der Vortrag von Harvey war mit 10 Menschen gut besucht und - wie immer - sehr informativ und unterhaltsam [Folien](CompLB-Kramski-Home-Recording-20200519_v01.2.pdf)
 2. GoTo-Meeting wird von Datenschützern als Videokonferenzsystem verwendet: [Link](https://www.gotomeeting.com/de-de)
 3. Gitea als Git-Webfrontend: [Link](https://gitea.io/en-us/)
 4. Codeberg als Github-Alternative: [Link](https://codeberg.org/)
 5. gotop als schicke top/htop-Alternative - wird nicht mehr gepflegt, es gibt aber ein Fork und eine Fortführung in Rust: [Link](https://github.com/cjbassi/gotop)
 6. Online-Abstimmung bei den Grünen: Mit welcher Technik realisieren Parteien, z.B. die Grünen, Online-Abstimmungen?
 7. Multigeiger: [Link](https://github.com/ecocurious2/MultiGeiger)
 8. ...
 Der nächste Chaostreff findet am 03.06.2020 statt.
 Bleibt gesund und hackt Sachen!
--- a/vas.md
+++ b/vas.md
@ -12,5 +12,5 @@ ansehen.
 1. 01.04.2020 - Harvey/Heinz: _Shell-Health-Check_ oder _Wie ich (wieder) lernte, die Shell zu lieben_ [Folien](https://complb.de/stammtisch20200401/CompLB-Kramski-Shell-Check-20200325_v02.pdf)
 2. 22.04.2020 - ampoff/Steffen: _Die National Software Reference Library_ [Folien](https://complb.de/stammtisch20200422/nsrl_short.pdf)
 3. 06.05.2020 - fritzthekit - _SIMD und neuronale Netze_
-4. 20.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_
+4. 19.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_ [Folien](https://complb.de/stammtisch2020519/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf)
-5. t.b.a.     - ampoff/Steffen: _Ansible und AWX_ oder _einenSchmissigenTitelFinden_
+5. 16.06.2020 - ampoff/Steffen: _Ansible und AWX_ oder _Langweilige Tasks automatisieren, mehr Zeit für alles andere_