diff --git a/post/CCC_Why_what_is_SIMD.md b/post/CCC_Why_what_is_SIMD.md
new file mode 100644
index 0000000..db3b32b
--- /dev/null
+++ b/post/CCC_Why_what_is_SIMD.md
@@ -0,0 +1,259 @@
+# ARMv7-l SIMD and using NEON
+
+## Motivation for SIMD
+
+
+
+[(WikiCommons:Adaline)](https://commons.wikimedia.org/wiki/File:Adaline.jpg)
+
+
+$\hat{y} = f(W\times \hat{x}+B)$
+
+For a time series:
+
+$X = \left[\hat{x_0},\hat{x_1},\dots,\hat{x_n}\right]$
+
+We get:
+
+$Y = f(W\times X+B)$
+
+### That is what Thensorflow, numpy and lots of others are good about ...
+
+## Matrix Mutlipication
+
+Eor each element in the resulting matrix a scalar product of a specific column of the matrix W and a specifc row of matrix X is required.
+
+
+
+Now that takes a while:
+
+Simple approach in C
+
+~~~
+ for (c = 0; c < m; c++) {
+ for (d = 0; d < q; d++) {
+ for (k = 0; k < p; k++) {
+ sum = sum + first[c][k]*second[k][d];
+ }
+
+ multiply[c][d] = sum;
+ sum = 0;
+ }
+ }
+~~~
+
+That requires for matrixs (100,100) x (100,1000) 100*100*1000 = 10 MFLOPS. What can be done to optimize the speed?
+
+## Techniques to optimize the calculation
+
+### Only splitting the Matrix
+
+Two reasons:
+1. optimize cache usage (**not today**)
+2. using **SIMD power**
+
+
+
+
+# SIMD (single instruction multiple data)
+
+### Just a few words to Inlining Assembler in C (or C++)
+
+Assembler [examples see](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html)
+
+The most simple one, works on x86:
+
+~~~
+#include
+
+int main(void)
+{
+ int foo = 10, bar = 15;
+ asm volatile ("addl %%ebx,%%eax"
+ :"=a"(foo)
+ :"a"(foo), "b"(bar));
+ printf("foo+bar=%d\n", foo);
+ return 0;
+}
+~~~
+
+From the gcc manual
+~~~
+asm asm-qualifiers ( AssemblerTemplate
+ : OutputOperands
+ [ : InputOperands
+ [ : Clobbers ] ])
+~~~
+
+## My sources
+
+All points are from the link above The
+[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf)
+
+- general idea behind SIMD (not talking about MIMD)
+- ARM NEON comparision with others (1.2 - pp 1-4)
+- the instruction timing is not clear - depends as all calculations mainly on data fetching time.
+- Fundamentals of NEON technology (1.4 - pp 1-10)
+ - 1.4.1 Registers q, d, s
+ - 1.4.2 Datatypes
+
+
+
+
+### What is it
+
+With a single instruction a vector (or other structurs) can be calculated in parallel.
+
+One assembler instruction multi/adds vectors of 4x4:
+
+
+
+Each of the 9 patches requires 4 x 4 = 16 SIMD instruction (compared to 4 x 4 x 4 = 64 ops ) fmla.f32. (multipy/Add)
+
+## Remark about this document
+
+This study is only for a better understanding of the SIMD instructions and SIMD performance of
+the ARMV7-A core (actually this one is a CORTEX-A53, but the OS supports only the 32 bit
+alternative.)
+
+## Documents and Sources
+
+[ARM ® and Thumb ® -2 Instruction Set](http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QRC0001_UAL.pdf)
+
+[ARM Architecture Reference Manual ARMv7-A and ARMv7-R](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf)
+
+The
+[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) provides all the information required to realy do SIMD on ARMV7-A and R.
+The document explains the register structure of the single, double and 128 bit registers as well as the instructions.
+
+Besides other examples (Swapping color channel, FIR,
+cross product),
+there is also an example for matrix matrix multiplication.
+
+The example examined here is based on this document and the 4 x 4 matrix multiplication given (chapter 7.1, pp. 115.)
+
+## About the example: my_sgemm
+
+The matrix matrix multiplication calculates patches of 4 x 4 at one time the rest of the
+calculation is straight forward.
+
+~~~
+for (i ...)
+ for (j ...)
+ for (k ...)
+~~~
+the inner loop calls the optimized 4 x 4 multiplication.
+
+
+## Shape of the matrixes
+
+All matrixes in C are column-based. matrix_a is regular and matrix_b is transposed. (Therefore, all scalar products
+of columns [B] with rows of [A] are column $\times$ column multipilications.)
+
+The calculation is performing
+
+$C = A \times B^\mathsf{T} + C$
+
+Assuming the matrix A contains n rows and m columns, then
+the element A[i,j] has in the c-array representing the matrix the index i * m + j.
+If we want to extract a patch out of the matrix:
+A[k:k+4,l:l+4], the for rows of the matrix could be calculated by,
+- first row starts at k*m+l
+- the next row starts with some offset o = m-4.
+- same for the thrid and forth rows.
+
+## The assembler SIMD part for the 4 x 4 multiplication
+
+Purpose of the 4x4 matrix multiplication: It multiplies of a small 4 x 4 patch of some large
+colom-based matrixes, important to know: matrix_a is regular,
+matrix_b is transposed.
+
+~~~
+static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
+ float *output,
+ int off_a, int off_b, int off_o ) {
+ /** \code */
+ asm volatile (
+ "# Start manual code \n\t"
+ "# Matrix Multiplication \n\n\t"
+~~~
+Macro section
+This macro performs the actual multiplication. It provides the output row for one column of matrix_a and the matrix_b (q8 - q11). The rows are stored in col0 and col1 (which corresponts to two 128 bit registers), the colums are stored in
+q8-q11. res_q gives the resulting output row.
+~~~
+ ".macro mul_col_f32 res_q, col0_d, col1_d\n\t"
+ "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
+ "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
+ "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
+ "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
+ ".endm\n\n\t"
+~~~
+End macro section
+
+Start loading the 128 registers with 4 single floats. q12-q15 are first loaded
+with the current state of the output.
+
+After each register is loaded some
+offset has to be added, since the next row starts with some offset. The same
+mechanismus applies to all matrixes.
+
+
+load current state of output -> q12 - q15 */
+~~~
+ "vld1.32 {q12}, [%6]!\n\t"
+ "add %6, %6, %5\n\t" /* add some offset until start of next row */
+ "vld1.32 {q13}, [%6]!\n\t"
+ "add %6, %6, %5\n\t"
+ "vld1.32 {q14}, [%6]!\n\t"
+ "add %6, %6, %5\n\t"
+ "vld1.32 {q15}, [%6]!\n\t"
+~~~
+load matrix_b (transposed!) -> q8 - q11 */
+~~~
+ "vld1.32 {q8}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q9}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q10}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q11}, [%2]!\n\t"
+~~~
+load matrix_a -> q0 - q3
+~~~
+ "vld1.32 {q0}, [%1]!\n\t"
+ "add %1, %1, %3\n\t"
+ "vld1.32 {q1}, [%1]!\n\t"
+ "add %1,%1, %3\n\t"
+ "vld1.32 {q2}, [%1]!\n\t"
+ "add %1, %1, %3\n\t"
+ "vld1.32 {q3}, [%1]!\n\t"
+~~~
+End load registers
+
+Start doing the actual matrix multiplication as defined in macro
+~~~
+ "mul_col_f32 q12, d0, d1\n\t"
+ "mul_col_f32 q13, d2, d3\n\t"
+ "mul_col_f32 q14, d4, d5\n\t"
+ "mul_col_f32 q15, d6, d7\n\n\t"
+ ~~~
+store the result [q12 - 115] into output
+ ~~~
+ "vst1.32 {q12}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q13}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q14}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q15}, [%0]!\n\t"
+~~~
+start argument section of inline assembler
+~~~
+ :"+r"((long) output)
+ :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
+ "r"(off_o),"r"(&output[0]));
+ /** \endcode */
+ return;
+}
+~~~
+
diff --git a/post/matrix_matrix.c b/post/matrix_matrix.c
new file mode 100644
index 0000000..134c244
--- /dev/null
+++ b/post/matrix_matrix.c
@@ -0,0 +1,300 @@
+#include
+#include
+#include
+#include
+
+/**
+ * gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
+ * or cross compile
+ * arm-linux-gnueabihf-gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
+ *
+ * test with a = np.arange(4*n*m + 4m).reshape(4*n,4*m)
+ * a.dot(a.T)
+ *
+ * this file contains all the elements to show ARM7VL usage of
+ * SIMD architecture
+ *
+ */
+
+void simple_transpose( float *, int, int, float *);
+static inline void my_sgemm_4x4(float *, float *, float *,
+ int, int, int );
+
+/**
+ * help routine (only for documentation demonstration
+ *
+ * transpose some matrix matrix_a (size n_rows, n_cols)
+ * to matrix output
+ *
+ */
+
+void simple_transpose( float *matrix_a, int n_rows_a, int n_cols_a,
+ float *output ) {
+ for (int ra = 0; ra < n_rows_a; ra++)
+ for (int ca = 0; ca < n_cols_a; ca++ )
+ output[n_rows_a*ca+ra] = matrix_a[n_cols_a*ra+ca];
+ //output[ra][ca] = matrix_a[ca][ra];
+ return;
+}
+
+
+/**
+ * Kernal function including the optimization and the
+ * 4 x 4 multiplication of 4 x4 fragments of large column based
+ * matrixes matrix_a and matrix_b
+ *
+ * arguments:
+ * - (float *) matrix_a: square matrix of size 4 x 4,
+ * - (float *) matrix_b: square matrix of size 4 x 4,
+ * - (float *) output: 4 x 4 result (return value)
+ * - (int) off_a,b,o: offset between to elements last element of one row
+ * and 1 element of next row matrix_a, matrix_b and output
+ *
+ * details documented here 
+ */
+
+static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
+ float *output,
+ int off_a, int off_b, int off_o ) {
+ /** \code */
+ asm volatile (
+ "# Start manual code \n\t"
+ "# Matrix Multiplication \n\n\t"
+ /* Maco section */
+ ".macro mul_col_f32 res_q, col0_d, col1_d\n\t"
+ "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
+ "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
+ "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
+ "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
+ ".endm\n\n\t"
+ /* end macro section */
+ /* load current state of output -> q12 - q15 */
+ "vld1.32 {q12}, [%6]!\n\t"
+ "add %6, %6, %5\n\t" /* add some offset until start of next row */
+ "vld1.32 {q13}, [%6]!\n\t"
+ "add %6, %6, %5\n\t"
+ "vld1.32 {q14}, [%6]!\n\t"
+ "add %6, %6, %5\n\t"
+ "vld1.32 {q15}, [%6]!\n\t"
+ /* load matrix_b (transposed!) -> q8 - q11 */
+ "vld1.32 {q8}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q9}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q10}, [%2]!\n\t"
+ "add %2, %2, %4\n\t"
+ "vld1.32 {q11}, [%2]!\n\t"
+ /* load matrix_a -> q0 - q3 */
+ "vld1.32 {q0}, [%1]!\n\t"
+ "add %1, %1, %3\n\t"
+ "vld1.32 {q1}, [%1]!\n\t"
+ "add %1,%1, %3\n\t"
+ "vld1.32 {q2}, [%1]!\n\t"
+ "add %1, %1, %3\n\t"
+ "vld1.32 {q3}, [%1]!\n\t"
+ /* end load registers
+ * start doing the actual matrix multiplication as defined in macro */
+ "mul_col_f32 q12, d0, d1\n\t"
+ "mul_col_f32 q13, d2, d3\n\t"
+ "mul_col_f32 q14, d4, d5\n\t"
+ "mul_col_f32 q15, d6, d7\n\n\t"
+ /* store the result [q12 - 115] into output */
+ "vst1.32 {q12}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q13}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q14}, [%0]!\n\t"
+ "add %0, %0, %5\n\t"
+ "vst1.32 {q15}, [%0]!\n\t"
+ /* start argument section of inline assembler */
+ :"+r"((long) output)
+ :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
+ "r"(off_o),"r"(&output[0]));
+ /** \endcode */
+ return;
+}
+
+/**
+ * matrix matrix multiplication of some matrix_a and some matrix_b
+ * (works only for size 4*n x 4*m)
+ * the order is column based and output = a x b.transpose()
+ *
+ * the multiplication based on patch-wise standard multiplication algorithm
+ * each patch of size 4 x 4
+ */
+
+void my_sgemm(float *matrix_a, int n_rows_a, int n_cols_a,
+ float *matrix_b, int n_rows_b, int n_cols_b,
+ float *output ) {
+ int offset_a = 4*(n_cols_a-4);
+ int offset_b = 4*(n_cols_b-4);
+ for(int i=0;i"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## ?\n",
+ "\n",
+ "No idea, about the following: $y = tanh(M)$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "y = np.tanh(M1.dot(M1.T))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(1000, 1000)"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "y.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "16.0"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "((4*128)**3)*16/((128)**3*64)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# das wichtig"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/post/stammtisch20200506.md b/post/stammtisch20200506.md
new file mode 100644
index 0000000..a7a0086
--- /dev/null
+++ b/post/stammtisch20200506.md
@@ -0,0 +1,26 @@
+---
+date: "2020-05-06"
+tages: ["chaostreff", "veranstaltung", "protokoll"]
+title: "Kurzprotokoll 06.05.2020"
+---
+
+
+Chaostreff-relevante Punkte:
+
+1. Der Chaostreff findet nun regelmäßig an folgenden Terminen statt:
+ 1. 1. Mittwoch im Monat, ab 19:42, Ort: Jitsi
+ 2. 3. Dienstag im Monat, ab 19:42, Ort: Jitsi
+
+ Jitsi-Link: https://meet.jit.si/chaoslb
+
+
+Einige Themen mit und ohne Links:
+
+1. Es gab einen spannenden Vortrag von fritzthekit/eduard zu SIMD und sauschnellen Matrixberechnungen auf einem
+ Raspberry Pi. Links zum [Jupyter-Notebook](motivation_matrix_mult.ipynb), [C-Code](matrix_matrix.c) und zur [Präsentation als Markdown](CCC_Why_what_is_SIMD.md)
+2. Go für kleine Plätze [TinyGo](https://tinygo.org/)
+3. Zooms Datenschutzbeurteilungen waren ein Thema (die jetzt doch irgendwie halb DSGVO-konform sind)
+3. ...
+
+
+Nächster Stammtisch ist am 19.05.2020 mit einem Vortrag von Harvey.
diff --git a/vas.md b/vas.md
index 89fc4b7..d0a9466 100644
--- a/vas.md
+++ b/vas.md
@@ -12,5 +12,5 @@ ansehen.
1. 01.04.2020 - Harvey/Heinz: _Shell-Health-Check_ oder _Wie ich (wieder) lernte, die Shell zu lieben_ [Folien](https://complb.de/stammtisch20200401/CompLB-Kramski-Shell-Check-20200325_v02.pdf)
2. 22.04.2020 - ampoff/Steffen: _Die National Software Reference Library_ [Folien](https://complb.de/stammtisch20200422/nsrl_short.pdf)
3. 06.05.2020 - fritzthekit - _SIMD und neuronale Netze_
-4. 20.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_
+4. 19.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_
5. t.b.a. - ampoff/Steffen: _Ansible und AWX_ oder _einenSchmissigenTitelFinden_