stammtisch ß6.05.2020

This commit is contained in:
Steffen Fritz 2020-05-07 14:54:28 +02:00
parent c55a1cab4b
commit 9e5eed71ff
5 changed files with 904 additions and 1 deletions

View file

@ -0,0 +1,259 @@
# ARMv7-l SIMD and using NEON
## Motivation for SIMD
<img src="pics/Adaline.jpg">
[(WikiCommons:Adaline)](https://commons.wikimedia.org/wiki/File:Adaline.jpg)
$\hat{y} = f(W\times \hat{x}+B)$
For a time series:
$X = \left[\hat{x_0},\hat{x_1},\dots,\hat{x_n}\right]$
We get:
$Y = f(W\times X+B)$
### That is what Thensorflow, numpy and lots of others are good about ...
## Matrix Mutlipication
Eor each element in the resulting matrix a scalar product of a specific column of the matrix W and a specifc row of matrix X is required.
<img src="pics/mxm.png">
Now that takes a while:
Simple approach in C
~~~
for (c = 0; c < m; c++) {
for (d = 0; d < q; d++) {
for (k = 0; k < p; k++) {
sum = sum + first[c][k]*second[k][d];
}
multiply[c][d] = sum;
sum = 0;
}
}
~~~
That requires for matrixs (100,100) x (100,1000) 100*100*1000 = 10 MFLOPS. What can be done to optimize the speed?
## Techniques to optimize the calculation
### Only splitting the Matrix
Two reasons:
1. optimize cache usage (**not today**)
2. using **SIMD power**
<img src="pics/matrix.png">
# SIMD (single instruction multiple data)
### Just a few words to Inlining Assembler in C (or C++)
Assembler [examples see](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html)
The most simple one, works on x86:
~~~
#include <stdio.h>
int main(void)
{
int foo = 10, bar = 15;
asm volatile ("addl %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar));
printf("foo+bar=%d\n", foo);
return 0;
}
~~~
From the gcc manual
~~~
asm asm-qualifiers ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
~~~
## My sources
All points are from the link above The
[NEON TM Version: 1.0 Programmers Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf)
- general idea behind SIMD (not talking about MIMD)
- ARM NEON comparision with others (1.2 - pp 1-4)
- the instruction timing is not clear - depends as all calculations mainly on data fetching time.
- Fundamentals of NEON technology (1.4 - pp 1-10)
- 1.4.1 Registers q, d, s
- 1.4.2 Datatypes
### What is it
With a single instruction a vector (or other structurs) can be calculated in parallel.
One assembler instruction multi/adds vectors of 4x4:
<img src="pics/matrix-simd.png">
Each of the 9 patches requires 4 x 4 = 16 SIMD instruction (compared to 4 x 4 x 4 = 64 ops ) fmla.f32. (multipy/Add)
## Remark about this document
This study is only for a better understanding of the SIMD instructions and SIMD performance of
the ARMV7-A core (actually this one is a CORTEX-A53, but the OS supports only the 32 bit
alternative.)
## Documents and Sources
[ARM ® and Thumb ® -2 Instruction Set](http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QRC0001_UAL.pdf)
[ARM Architecture Reference Manual ARMv7-A and ARMv7-R](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf)
The
[NEON TM Version: 1.0 Programmers Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) provides all the information required to realy do SIMD on ARMV7-A and R.
The document explains the register structure of the single, double and 128 bit registers as well as the instructions.
Besides other examples (Swapping color channel, FIR,
cross product),
there is also an example for matrix matrix multiplication.
The example examined here is based on this document and the 4 x 4 matrix multiplication given (chapter 7.1, pp. 115.)
## About the example: my_sgemm
The matrix matrix multiplication calculates patches of 4 x 4 at one time the rest of the
calculation is straight forward.
~~~
for (i ...)
for (j ...)
for (k ...)
~~~
the inner loop calls the optimized 4 x 4 multiplication.
## Shape of the matrixes
All matrixes in C are column-based. matrix_a is regular and matrix_b is transposed. (Therefore, all scalar products
of columns [B] with rows of [A] are column $\times$ column multipilications.)
The calculation is performing
$C = A \times B^\mathsf{T} + C$
Assuming the matrix A contains n rows and m columns, then
the element A[i,j] has in the c-array representing the matrix the index i * m + j.
If we want to extract a patch out of the matrix:
A[k:k+4,l:l+4], the for rows of the matrix could be calculated by,
- first row starts at k*m+l
- the next row starts with some offset o = m-4.
- same for the thrid and forth rows.
## The assembler SIMD part for the 4 x 4 multiplication
Purpose of the 4x4 matrix multiplication: It multiplies of a small 4 x 4 patch of some large
colom-based matrixes, important to know: matrix_a is regular,
matrix_b is transposed.
~~~
static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
float *output,
int off_a, int off_b, int off_o ) {
/** \code */
asm volatile (
"# Start manual code \n\t"
"# Matrix Multiplication \n\n\t"
~~~
Macro section
This macro performs the actual multiplication. It provides the output row for one column of matrix_a and the matrix_b (q8 - q11). The rows are stored in col0 and col1 (which corresponts to two 128 bit registers), the colums are stored in
q8-q11. res_q gives the resulting output row.
~~~
".macro mul_col_f32 res_q, col0_d, col1_d\n\t"
"vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
"vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
"vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
"vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
".endm\n\n\t"
~~~
End macro section
Start loading the 128 registers with 4 single floats. q12-q15 are first loaded
with the current state of the output.
After each register is loaded some
offset has to be added, since the next row starts with some offset. The same
mechanismus applies to all matrixes.
load current state of output -> q12 - q15 */
~~~
"vld1.32 {q12}, [%6]!\n\t"
"add %6, %6, %5\n\t" /* add some offset until start of next row */
"vld1.32 {q13}, [%6]!\n\t"
"add %6, %6, %5\n\t"
"vld1.32 {q14}, [%6]!\n\t"
"add %6, %6, %5\n\t"
"vld1.32 {q15}, [%6]!\n\t"
~~~
load matrix_b (transposed!) -> q8 - q11 */
~~~
"vld1.32 {q8}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q9}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q10}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q11}, [%2]!\n\t"
~~~
load matrix_a -> q0 - q3
~~~
"vld1.32 {q0}, [%1]!\n\t"
"add %1, %1, %3\n\t"
"vld1.32 {q1}, [%1]!\n\t"
"add %1,%1, %3\n\t"
"vld1.32 {q2}, [%1]!\n\t"
"add %1, %1, %3\n\t"
"vld1.32 {q3}, [%1]!\n\t"
~~~
End load registers
Start doing the actual matrix multiplication as defined in macro
~~~
"mul_col_f32 q12, d0, d1\n\t"
"mul_col_f32 q13, d2, d3\n\t"
"mul_col_f32 q14, d4, d5\n\t"
"mul_col_f32 q15, d6, d7\n\n\t"
~~~
store the result [q12 - 115] into output
~~~
"vst1.32 {q12}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q13}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q14}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q15}, [%0]!\n\t"
~~~
start argument section of inline assembler
~~~
:"+r"((long) output)
:"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
"r"(off_o),"r"(&output[0]));
/** \endcode */
return;
}
~~~

300
post/matrix_matrix.c Normal file
View file

@ -0,0 +1,300 @@
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>
/**
* gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
* or cross compile
* arm-linux-gnueabihf-gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c
*
* test with a = np.arange(4*n*m + 4m).reshape(4*n,4*m)
* a.dot(a.T)
*
* this file contains all the elements to show ARM7VL usage of
* SIMD architecture
*
*/
void simple_transpose( float *, int, int, float *);
static inline void my_sgemm_4x4(float *, float *, float *,
int, int, int );
/**
* help routine (only for documentation demonstration
*
* transpose some matrix matrix_a (size n_rows, n_cols)
* to matrix output
*
*/
void simple_transpose( float *matrix_a, int n_rows_a, int n_cols_a,
float *output ) {
for (int ra = 0; ra < n_rows_a; ra++)
for (int ca = 0; ca < n_cols_a; ca++ )
output[n_rows_a*ca+ra] = matrix_a[n_cols_a*ra+ca];
//output[ra][ca] = matrix_a[ca][ra];
return;
}
/**
* Kernal function including the optimization and the
* 4 x 4 multiplication of 4 x4 fragments of large column based
* matrixes matrix_a and matrix_b
*
* arguments:
* - (float *) matrix_a: square matrix of size 4 x 4,
* - (float *) matrix_b: square matrix of size 4 x 4,
* - (float *) output: 4 x 4 result (return value)
* - (int) off_a,b,o: offset between to elements last element of one row
* and 1 element of next row matrix_a, matrix_b and output
*
* details documented here ![Using_SIMD](/home/eduard/work/wikiwhat/doc/Using_SIMD.md)
*/
static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b,
float *output,
int off_a, int off_b, int off_o ) {
/** \code */
asm volatile (
"# Start manual code \n\t"
"# Matrix Multiplication \n\n\t"
/* Maco section */
".macro mul_col_f32 res_q, col0_d, col1_d\n\t"
"vmla.f32 \\res_q, q8, \\col0_d[0]\n\t"
"vmla.f32 \\res_q, q9, \\col0_d[1]\n\t"
"vmla.f32 \\res_q, q10, \\col1_d[0]\n\t"
"vmla.f32 \\res_q, q11, \\col1_d[1]\n\t"
".endm\n\n\t"
/* end macro section */
/* load current state of output -> q12 - q15 */
"vld1.32 {q12}, [%6]!\n\t"
"add %6, %6, %5\n\t" /* add some offset until start of next row */
"vld1.32 {q13}, [%6]!\n\t"
"add %6, %6, %5\n\t"
"vld1.32 {q14}, [%6]!\n\t"
"add %6, %6, %5\n\t"
"vld1.32 {q15}, [%6]!\n\t"
/* load matrix_b (transposed!) -> q8 - q11 */
"vld1.32 {q8}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q9}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q10}, [%2]!\n\t"
"add %2, %2, %4\n\t"
"vld1.32 {q11}, [%2]!\n\t"
/* load matrix_a -> q0 - q3 */
"vld1.32 {q0}, [%1]!\n\t"
"add %1, %1, %3\n\t"
"vld1.32 {q1}, [%1]!\n\t"
"add %1,%1, %3\n\t"
"vld1.32 {q2}, [%1]!\n\t"
"add %1, %1, %3\n\t"
"vld1.32 {q3}, [%1]!\n\t"
/* end load registers
* start doing the actual matrix multiplication as defined in macro */
"mul_col_f32 q12, d0, d1\n\t"
"mul_col_f32 q13, d2, d3\n\t"
"mul_col_f32 q14, d4, d5\n\t"
"mul_col_f32 q15, d6, d7\n\n\t"
/* store the result [q12 - 115] into output */
"vst1.32 {q12}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q13}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q14}, [%0]!\n\t"
"add %0, %0, %5\n\t"
"vst1.32 {q15}, [%0]!\n\t"
/* start argument section of inline assembler */
:"+r"((long) output)
:"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b),
"r"(off_o),"r"(&output[0]));
/** \endcode */
return;
}
/**
* matrix matrix multiplication of some matrix_a and some matrix_b
* (works only for size 4*n x 4*m)
* the order is column based and output = a x b.transpose()
*
* the multiplication based on patch-wise standard multiplication algorithm
* each patch of size 4 x 4
*/
void my_sgemm(float *matrix_a, int n_rows_a, int n_cols_a,
float *matrix_b, int n_rows_b, int n_cols_b,
float *output ) {
int offset_a = 4*(n_cols_a-4);
int offset_b = 4*(n_cols_b-4);
for(int i=0;i<n_rows_a;i = i+4 ) {
for(int j=0;j<n_cols_b;j = j+4 ) {
for(int k=0;k<n_cols_a;k = k+4 ) {
my_sgemm_4x4(&matrix_a[n_cols_a*i+k],
&matrix_b[n_cols_b*k+j],
&output[n_cols_b*i+j],
offset_a, offset_b, offset_b);
}
}
}
return;
}
/**
* standard algorithm for matrix matrix multiplication
* output = matrix_a.dot(matrix_bb.transpose())
* - arguments:
* - (float *) a: column-based matrix size n_colums_a x n_rows_b
* - (int) n_rows_a, n_cols_a: size of matrix a
* - (float *) b: column-based matrix size n_colums_a x n_rows_b
* - (int) n_rows_b, n_cols_b: size of matrix b
* - (float *) output: column-based matrix = a.dot(b.T)
* - return: void
*/
void simple_mm( float *a, int n_rows_a, int n_cols_a,
float *b, int n_rows_b, int n_cols_b,
float *output ) {
for(int i=0;i<n_rows_a;i++)
for(int j=0;j<n_cols_b;j++) {
output[n_cols_b*i+j]=0;
for(int k=0;k<n_cols_a;k++) {
output[n_cols_b*i+j]+=a[n_cols_a*i+k]*b[n_cols_b*k+j];
}
}
return;
}
/**
* int main():
* calls simple and optimized function and compare speed
* size defined by macros (N_ROWS_x, N_COLS_B: 4 * n, 4 * m)
* Matrix defined as:
* matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
* matrix_b = a.T
* 1. call optimized my_sgemm
* 2. call simple_mm
* compare times, check identity
* print results
* size of example matrix
*/
#define N_COLS_A 256
#define N_ROWS_A 256
#define N_COLS_B N_ROWS_A
#define N_ROWS_B N_COLS_A
int main() {
float matrix_a[N_ROWS_A][N_COLS_A];
float matrix_b[N_ROWS_B][N_COLS_B];
float matrix_aa[N_ROWS_A][N_COLS_A];
float matrix_bb[N_ROWS_B][N_COLS_B];
float buffer[N_ROWS_A][N_COLS_B];
float reference[N_ROWS_A][N_COLS_B];
struct timeval t1, t2, t3;
long int durationf, durations;
/**
* matrix_a = np.arange(N_ROWS_A*N_COLS_B).reshape(N_ROWS_A,N_COLS_A)
*/
for (int ra = 0; ra < N_ROWS_A; ra++) {
for (int ca = 0; ca < N_COLS_A; ca++) {
matrix_a[ra][ca] = N_COLS_A*ra+ca;
matrix_aa[ra][ca] = N_COLS_A*ra+ca;
}
}
/**
* calculate matrix_b as matrix_a.T
*
*/
simple_transpose(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
&matrix_b[0][0]);
simple_transpose(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
&matrix_bb[0][0]);
/**
* set outputs buffer (outpot of my_sgemm) and referece (simple_mm)
* to zero
*/
for (int ra = 0; ra < N_ROWS_A; ra++)
for (int cb = 0; cb < N_COLS_B; cb++) {
buffer[ra][cb] = 0.0;
reference[ra][cb] = 0.0;
}
/**
* 1. set timer to t1 (start of optimized algorithm)
* 2. call optimized algorithm
*/
gettimeofday(&t1, NULL);
my_sgemm(&matrix_aa[0][0], N_ROWS_A, N_COLS_A,
&matrix_bb[0][0], N_ROWS_B, N_COLS_B,
&buffer[0][0]);
/**
* 3. set timer to t2 (end of optimized and start of simple algorithm)
* 4. call optimized algorithm
*/
gettimeofday(&t2, NULL);
simple_mm(&matrix_a[0][0], N_ROWS_A, N_COLS_A,
&matrix_b[0][0], N_ROWS_B, N_COLS_B,
&reference[0][0]);
/**
* 3. set timer to t3 (end of simple algorithm)
*/
gettimeofday(&t3, NULL);
/**
* calculate durations for optimized and simple algorithm
*/
durationf = 1e6*(t2.tv_sec - t1.tv_sec)+(t2.tv_usec - t1.tv_usec);
durations = 1e6*(t3.tv_sec - t2.tv_sec)+(t3.tv_usec - t2.tv_usec);
/**
* output (6 x 6 patch of result (both algorithm)
*/
printf("my_sgemm\n");
for (int ra=0; ra<6; ra++ ) {
for (int cb=0; cb<6; cb++ ) printf("%.2e ", buffer[ra][cb]);
printf("\n");
}
printf("reference\n");
for (int ra=0; ra<6; ra++ ) {
for (int cb=0; cb<6; cb++ ) printf("%.2e ", reference[ra][cb]);
printf("\n");
}
/**
* calculate mean sqare error
*/
float mse = 0.0F;
for (int ra=0; ra<N_ROWS_A; ra++ ) {
for (int cb=0; cb<N_COLS_B; cb++ ) {
mse += (reference[ra][cb]-buffer[ra][cb]) *
(reference[ra][cb]-buffer[ra][cb]);
}
}
/**
* print mse and ration of times optimized_time / simple_time
*/
printf("MSE: %.5f [durationrate f/s %.5f]\n",mse,
(float)durationf/(float)durations);
return 0;
}

View file

@ -0,0 +1,318 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Motivation\n",
"## How to solve matrix-matrix multiplication"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Simple approach in C\n",
"\n",
"~~~\n",
" for (c = 0; c < m; c++) {\n",
" for (d = 0; d < q; d++) {\n",
" for (k = 0; k < p; k++) {\n",
" sum = sum + first[c][k]*second[k][d];\n",
" }\n",
" \n",
" multiply[c][d] = sum;\n",
" sum = 0;\n",
" }\n",
" }\n",
"~~~"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def mymatrixmult(A,B):\n",
" y = np.zeros((A.shape[0], B.shape[1]))\n",
" for i in range(A.shape[0]):\n",
" for j in range(B.shape[1]):\n",
" for k in range(A.shape[0]):\n",
" y[i][j] += A[i][k]*B[k][j]\n",
" return y\n",
" \n",
"m = np.arange(40000).reshape(200,200)\n",
"m1 = m/np.average(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Takes a while (200**3) = 8 MFLOPS"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
" 1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
" [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
" 5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
" [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
" 9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
" ...,\n",
" [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
" 7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
" [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
" 7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
" [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
" 7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mymatrixmult(m1, m1.T)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[6.61708085e-03, 1.65675784e-02, 2.65180759e-02, ...,\n",
" 1.96686509e+00, 1.97681559e+00, 1.98676609e+00],\n",
" [1.65675784e-02, 4.65190759e-02, 7.64705735e-02, ...,\n",
" 5.91701260e+00, 5.94696409e+00, 5.97691559e+00],\n",
" [2.65180759e-02, 7.64705735e-02, 1.26423071e-01, ...,\n",
" 9.86716010e+00, 9.91711260e+00, 9.96706510e+00],\n",
" ...,\n",
" [1.96686509e+00, 5.91701260e+00, 9.86716010e+00, ...,\n",
" 7.80145924e+02, 7.84096071e+02, 7.88046219e+02],\n",
" [1.97681559e+00, 5.94696409e+00, 9.91711260e+00, ...,\n",
" 7.84096071e+02, 7.88066220e+02, 7.92036368e+02],\n",
" [1.98676609e+00, 5.97691559e+00, 9.96706510e+00, ...,\n",
" 7.88046219e+02, 7.92036368e+02, 7.96026518e+02]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m1.dot(m1.T)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now numpy: (2000**3) = 8 GFLOPS"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"M = np.arange(4000000).reshape(2000,2000) \n",
"M1 = M/np.average(M) "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[6.66167083e-04, 1.66566758e-03, 2.66516808e-03, ...,\n",
" 1.99666867e+00, 1.99766817e+00, 1.99866767e+00],\n",
" [1.66566758e-03, 4.66516908e-03, 7.66467058e-03, ...,\n",
" 5.99167016e+00, 5.99466966e+00, 5.99766917e+00],\n",
" [2.66516808e-03, 7.66467058e-03, 1.26641731e-02, ...,\n",
" 9.98667166e+00, 9.99167116e+00, 9.99667067e+00],\n",
" ...,\n",
" [1.99666867e+00, 5.99167016e+00, 9.98667166e+00, ...,\n",
" 7.98001466e+03, 7.98400966e+03, 7.98800466e+03],\n",
" [1.99766817e+00, 5.99466966e+00, 9.99167116e+00, ...,\n",
" 7.98400966e+03, 7.98800666e+03, 7.99200366e+03],\n",
" [1.99866767e+00, 5.99766917e+00, 9.99667067e+00, ...,\n",
" 7.98800466e+03, 7.99200366e+03, 7.99600267e+03]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M1.dot(M1.T)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],\n",
" [ 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],\n",
" [ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],\n",
" [ 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],\n",
" [ 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59],\n",
" [ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71],\n",
" [ 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83],\n",
" [ 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95],\n",
" [ 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107],\n",
" [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],\n",
" [120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131],\n",
" [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n = 12\n",
"np.arange(n*n).reshape(n,n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Splitting the matrix\n",
"\n",
"Two reasons:\n",
"1. optimize cache usage\n",
"2. using SIMD power\n",
"\n",
"<img src=\"pics/matrix.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ?\n",
"\n",
"No idea, about the following: $y = tanh(M)$"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"y = np.tanh(M1.dot(M1.T))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1000, 1000)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.shape"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16.0"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"((4*128)**3)*16/((128)**3*64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# das wichtig"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -0,0 +1,26 @@
---
date: "2020-05-06"
tages: ["chaostreff", "veranstaltung", "protokoll"]
title: "Kurzprotokoll 06.05.2020"
---
Chaostreff-relevante Punkte:
1. Der Chaostreff findet nun regelmäßig an folgenden Terminen statt:
1. 1. Mittwoch im Monat, ab 19:42, Ort: Jitsi
2. 3. Dienstag im Monat, ab 19:42, Ort: Jitsi
Jitsi-Link: https://meet.jit.si/chaoslb
Einige Themen mit und ohne Links:
1. Es gab einen spannenden Vortrag von fritzthekit/eduard zu SIMD und sauschnellen Matrixberechnungen auf einem
Raspberry Pi. Links zum [Jupyter-Notebook](motivation_matrix_mult.ipynb), [C-Code](matrix_matrix.c) und zur [Präsentation als Markdown](CCC_Why_what_is_SIMD.md)
2. Go für kleine Plätze [TinyGo](https://tinygo.org/)
3. Zooms Datenschutzbeurteilungen waren ein Thema (die jetzt doch irgendwie halb DSGVO-konform sind)
3. ...
Nächster Stammtisch ist am 19.05.2020 mit einem Vortrag von Harvey.