diff --git a/_index.md b/_index.md index acfad29..c5f58f3 100644 --- a/_index.md +++ b/_index.md @@ -13,7 +13,7 @@ Der Chaostreff Ludwigsburg ist ein lockeres Zusammentreffen von Hackern, die sic #### Wann und Wo trefft ihr euch? Der nächste RL-Termin steht auf Grund von Covid-19 noch nicht fest. Aktuell trifft man uns auf https://matrix.complb.de -(#chaostreff:matrix.complb.de) und jeden Mittwoch per Jitsi (https://complb.de/covid1920200315/) +(#chaostreff:matrix.complb.de) und zwei Mal im Monat bei Jitsi (https://complb.de/stammtisch20200506/) Du erreichst uns außerdem per Mail über chaostreff AT complb PUNKt de diff --git a/post/CCC_Why_what_is_SIMD.md b/post/CCC_Why_what_is_SIMD.md new file mode 100644 index 0000000..db3b32b --- /dev/null +++ b/post/CCC_Why_what_is_SIMD.md @@ -0,0 +1,259 @@ +# ARMv7-l SIMD and using NEON + +## Motivation for SIMD + + + +[(WikiCommons:Adaline)](https://commons.wikimedia.org/wiki/File:Adaline.jpg) + + +$\hat{y} = f(W\times \hat{x}+B)$ + +For a time series: + +$X = \left[\hat{x_0},\hat{x_1},\dots,\hat{x_n}\right]$ + +We get: + +$Y = f(W\times X+B)$ + +### That is what Thensorflow, numpy and lots of others are good about ... + +## Matrix Mutlipication + +Eor each element in the resulting matrix a scalar product of a specific column of the matrix W and a specifc row of matrix X is required. + + + +Now that takes a while: + +Simple approach in C + +~~~ + for (c = 0; c < m; c++) { + for (d = 0; d < q; d++) { + for (k = 0; k < p; k++) { + sum = sum + first[c][k]*second[k][d]; + } + + multiply[c][d] = sum; + sum = 0; + } + } +~~~ + +That requires for matrixs (100,100) x (100,1000) 100*100*1000 = 10 MFLOPS. What can be done to optimize the speed? + +## Techniques to optimize the calculation + +### Only splitting the Matrix + +Two reasons: +1. optimize cache usage (**not today**) +2. using **SIMD power** + + + + +# SIMD (single instruction multiple data) + +### Just a few words to Inlining Assembler in C (or C++) + +Assembler [examples see](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html) + +The most simple one, works on x86: + +~~~ +#include + +int main(void) +{ + int foo = 10, bar = 15; + asm volatile ("addl %%ebx,%%eax" + :"=a"(foo) + :"a"(foo), "b"(bar)); + printf("foo+bar=%d\n", foo); + return 0; +} +~~~ + +From the gcc manual +~~~ +asm asm-qualifiers ( AssemblerTemplate + : OutputOperands + [ : InputOperands + [ : Clobbers ] ]) +~~~ + +## My sources + +All points are from the link above The +[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) + +- general idea behind SIMD (not talking about MIMD) +- ARM NEON comparision with others (1.2 - pp 1-4) +- the instruction timing is not clear - depends as all calculations mainly on data fetching time. +- Fundamentals of NEON technology (1.4 - pp 1-10) + - 1.4.1 Registers q, d, s + - 1.4.2 Datatypes + + + + +### What is it + +With a single instruction a vector (or other structurs) can be calculated in parallel. + +One assembler instruction multi/adds vectors of 4x4: + + + +Each of the 9 patches requires 4 x 4 = 16 SIMD instruction (compared to 4 x 4 x 4 = 64 ops ) fmla.f32. (multipy/Add) + +## Remark about this document + +This study is only for a better understanding of the SIMD instructions and SIMD performance of +the ARMV7-A core (actually this one is a CORTEX-A53, but the OS supports only the 32 bit +alternative.) + +## Documents and Sources + +[ARM ® and Thumb ® -2 Instruction Set](http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QRC0001_UAL.pdf) + +[ARM Architecture Reference Manual ARMv7-A and ARMv7-R](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf) + +The +[NEON TM Version: 1.0 Programmer’s Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf) provides all the information required to realy do SIMD on ARMV7-A and R. +The document explains the register structure of the single, double and 128 bit registers as well as the instructions. + +Besides other examples (Swapping color channel, FIR, +cross product), +there is also an example for matrix matrix multiplication. + +The example examined here is based on this document and the 4 x 4 matrix multiplication given (chapter 7.1, pp. 115.) + +## About the example: my_sgemm + +The matrix matrix multiplication calculates patches of 4 x 4 at one time the rest of the +calculation is straight forward. + +~~~ +for (i ...) + for (j ...) + for (k ...) +~~~ +the inner loop calls the optimized 4 x 4 multiplication. + + +## Shape of the matrixes + +All matrixes in C are column-based. matrix_a is regular and matrix_b is transposed. (Therefore, all scalar products +of columns [B] with rows of [A] are column $\times$ column multipilications.) + +The calculation is performing + +$C = A \times B^\mathsf{T} + C$ + +Assuming the matrix A contains n rows and m columns, then +the element A[i,j] has in the c-array representing the matrix the index i * m + j. +If we want to extract a patch out of the matrix: +A[k:k+4,l:l+4], the for rows of the matrix could be calculated by, +- first row starts at k*m+l +- the next row starts with some offset o = m-4. +- same for the thrid and forth rows. + +## The assembler SIMD part for the 4 x 4 multiplication + +Purpose of the 4x4 matrix multiplication: It multiplies of a small 4 x 4 patch of some large +colom-based matrixes, important to know: matrix_a is regular, +matrix_b is transposed. + +~~~ +static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b, + float *output, + int off_a, int off_b, int off_o ) { + /** \code */ + asm volatile ( + "# Start manual code \n\t" + "# Matrix Multiplication \n\n\t" +~~~ +Macro section +This macro performs the actual multiplication. It provides the output row for one column of matrix_a and the matrix_b (q8 - q11). The rows are stored in col0 and col1 (which corresponts to two 128 bit registers), the colums are stored in +q8-q11. res_q gives the resulting output row. +~~~ + ".macro mul_col_f32 res_q, col0_d, col1_d\n\t" + "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t" + "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t" + "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t" + "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t" + ".endm\n\n\t" +~~~ +End macro section + +Start loading the 128 registers with 4 single floats. q12-q15 are first loaded +with the current state of the output. + +After each register is loaded some +offset has to be added, since the next row starts with some offset. The same +mechanismus applies to all matrixes. + + +load current state of output -> q12 - q15 */ +~~~ + "vld1.32 {q12}, [%6]!\n\t" + "add %6, %6, %5\n\t" /* add some offset until start of next row */ + "vld1.32 {q13}, [%6]!\n\t" + "add %6, %6, %5\n\t" + "vld1.32 {q14}, [%6]!\n\t" + "add %6, %6, %5\n\t" + "vld1.32 {q15}, [%6]!\n\t" +~~~ +load matrix_b (transposed!) -> q8 - q11 */ +~~~ + "vld1.32 {q8}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q9}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q10}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q11}, [%2]!\n\t" +~~~ +load matrix_a -> q0 - q3 +~~~ + "vld1.32 {q0}, [%1]!\n\t" + "add %1, %1, %3\n\t" + "vld1.32 {q1}, [%1]!\n\t" + "add %1,%1, %3\n\t" + "vld1.32 {q2}, [%1]!\n\t" + "add %1, %1, %3\n\t" + "vld1.32 {q3}, [%1]!\n\t" +~~~ +End load registers + +Start doing the actual matrix multiplication as defined in macro +~~~ + "mul_col_f32 q12, d0, d1\n\t" + "mul_col_f32 q13, d2, d3\n\t" + "mul_col_f32 q14, d4, d5\n\t" + "mul_col_f32 q15, d6, d7\n\n\t" + ~~~ +store the result [q12 - 115] into output + ~~~ + "vst1.32 {q12}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q13}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q14}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q15}, [%0]!\n\t" +~~~ +start argument section of inline assembler +~~~ + :"+r"((long) output) + :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b), + "r"(off_o),"r"(&output[0])); + /** \endcode */ + return; +} +~~~ + diff --git a/post/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf b/post/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf new file mode 100644 index 0000000..3b8f5d9 Binary files /dev/null and b/post/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf differ diff --git a/post/Logo_Chaostreff_LB.zip b/post/Logo_Chaostreff_LB.zip new file mode 100644 index 0000000..7a20534 Binary files /dev/null and b/post/Logo_Chaostreff_LB.zip differ diff --git a/post/logo.md b/post/logo.md new file mode 100644 index 0000000..cfe3e98 --- /dev/null +++ b/post/logo.md @@ -0,0 +1,8 @@ +--- +date: "2020-05-02" +tages: ["chaostreff", "logo", "misc"] +title: "Logo-Paket Chaostreff LB" +--- + +Das Logo-Zip gibt es hier [Download](https://complb.de/logo/Logo_Chaostreff_LB.zip) + diff --git a/post/matrix_matrix.c b/post/matrix_matrix.c new file mode 100644 index 0000000..134c244 --- /dev/null +++ b/post/matrix_matrix.c @@ -0,0 +1,300 @@ +#include +#include +#include +#include + +/** + * gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c + * or cross compile + * arm-linux-gnueabihf-gcc -o mm -g -march=armv7 -mfpu=neon-vfpv4 matrix_matrix.c + * + * test with a = np.arange(4*n*m + 4m).reshape(4*n,4*m) + * a.dot(a.T) + * + * this file contains all the elements to show ARM7VL usage of + * SIMD architecture + * + */ + +void simple_transpose( float *, int, int, float *); +static inline void my_sgemm_4x4(float *, float *, float *, + int, int, int ); + +/** + * help routine (only for documentation demonstration + * + * transpose some matrix matrix_a (size n_rows, n_cols) + * to matrix output + * + */ + +void simple_transpose( float *matrix_a, int n_rows_a, int n_cols_a, + float *output ) { + for (int ra = 0; ra < n_rows_a; ra++) + for (int ca = 0; ca < n_cols_a; ca++ ) + output[n_rows_a*ca+ra] = matrix_a[n_cols_a*ra+ca]; + //output[ra][ca] = matrix_a[ca][ra]; + return; +} + + +/** + * Kernal function including the optimization and the + * 4 x 4 multiplication of 4 x4 fragments of large column based + * matrixes matrix_a and matrix_b + * + * arguments: + * - (float *) matrix_a: square matrix of size 4 x 4, + * - (float *) matrix_b: square matrix of size 4 x 4, + * - (float *) output: 4 x 4 result (return value) + * - (int) off_a,b,o: offset between to elements last element of one row + * and 1 element of next row matrix_a, matrix_b and output + * + * details documented here ![Using_SIMD](/home/eduard/work/wikiwhat/doc/Using_SIMD.md) + */ + +static inline void my_sgemm_4x4(float *matrix_a, float *matrix_b, + float *output, + int off_a, int off_b, int off_o ) { + /** \code */ + asm volatile ( + "# Start manual code \n\t" + "# Matrix Multiplication \n\n\t" + /* Maco section */ + ".macro mul_col_f32 res_q, col0_d, col1_d\n\t" + "vmla.f32 \\res_q, q8, \\col0_d[0]\n\t" + "vmla.f32 \\res_q, q9, \\col0_d[1]\n\t" + "vmla.f32 \\res_q, q10, \\col1_d[0]\n\t" + "vmla.f32 \\res_q, q11, \\col1_d[1]\n\t" + ".endm\n\n\t" + /* end macro section */ + /* load current state of output -> q12 - q15 */ + "vld1.32 {q12}, [%6]!\n\t" + "add %6, %6, %5\n\t" /* add some offset until start of next row */ + "vld1.32 {q13}, [%6]!\n\t" + "add %6, %6, %5\n\t" + "vld1.32 {q14}, [%6]!\n\t" + "add %6, %6, %5\n\t" + "vld1.32 {q15}, [%6]!\n\t" + /* load matrix_b (transposed!) -> q8 - q11 */ + "vld1.32 {q8}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q9}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q10}, [%2]!\n\t" + "add %2, %2, %4\n\t" + "vld1.32 {q11}, [%2]!\n\t" + /* load matrix_a -> q0 - q3 */ + "vld1.32 {q0}, [%1]!\n\t" + "add %1, %1, %3\n\t" + "vld1.32 {q1}, [%1]!\n\t" + "add %1,%1, %3\n\t" + "vld1.32 {q2}, [%1]!\n\t" + "add %1, %1, %3\n\t" + "vld1.32 {q3}, [%1]!\n\t" + /* end load registers + * start doing the actual matrix multiplication as defined in macro */ + "mul_col_f32 q12, d0, d1\n\t" + "mul_col_f32 q13, d2, d3\n\t" + "mul_col_f32 q14, d4, d5\n\t" + "mul_col_f32 q15, d6, d7\n\n\t" + /* store the result [q12 - 115] into output */ + "vst1.32 {q12}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q13}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q14}, [%0]!\n\t" + "add %0, %0, %5\n\t" + "vst1.32 {q15}, [%0]!\n\t" + /* start argument section of inline assembler */ + :"+r"((long) output) + :"r"(&matrix_a[0]),"r"(&matrix_b[0]),"r"(off_a),"r"(off_b), + "r"(off_o),"r"(&output[0])); + /** \endcode */ + return; +} + +/** + * matrix matrix multiplication of some matrix_a and some matrix_b + * (works only for size 4*n x 4*m) + * the order is column based and output = a x b.transpose() + * + * the multiplication based on patch-wise standard multiplication algorithm + * each patch of size 4 x 4 + */ + +void my_sgemm(float *matrix_a, int n_rows_a, int n_cols_a, + float *matrix_b, int n_rows_b, int n_cols_b, + float *output ) { + int offset_a = 4*(n_cols_a-4); + int offset_b = 4*(n_cols_b-4); + for(int i=0;i" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ?\n", + "\n", + "No idea, about the following: $y = tanh(M)$" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "y = np.tanh(M1.dot(M1.T))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1000, 1000)" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "16.0" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "((4*128)**3)*16/((128)**3*64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# das wichtig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/post/stammtisch20200506.md b/post/stammtisch20200506.md new file mode 100644 index 0000000..37ac45b --- /dev/null +++ b/post/stammtisch20200506.md @@ -0,0 +1,27 @@ +--- +date: "2020-05-06" +tages: ["chaostreff", "veranstaltung", "protokoll"] +title: "Kurzprotokoll 06.05.2020" +--- + + +Chaostreff-relevante Punkte: + +1. Der Chaostreff findet nun regelmäßig an folgenden Terminen statt: + 1. 1. Mittwoch im Monat, ab 19:42, Ort: Jitsi + 2. 3. Dienstag im Monat, ab 19:42, Ort: Jitsi + + Jitsi-Link: https://meet.jit.si/chaoslb + + +Einige Themen mit und ohne Links: + +1. Es gab einen spannenden Vortrag von fritzthekit/eduard zu SIMD und sauschnellen Matrixberechnungen auf einem + Raspberry Pi. Links zum [Jupyter-Notebook](motivation_matrix_mult.ipynb), [C-Code](matrix_matrix.c) und zur [Präsentation als Markdown](CCC_Why_what_is_SIMD.md) +2. Go für kleine Plätze [TinyGo](https://tinygo.org/) +3. Zooms Datenschutzbeurteilungen waren ein Thema (die jetzt doch irgendwie halb DSGVO-konform sind) +4. Iridium Browser [Link](https://iridiumbrowser.de/) +5. ... + + +Nächster Stammtisch ist am 19.05.2020 mit einem Vortrag von Harvey. diff --git a/post/stammtisch20200519.md b/post/stammtisch20200519.md new file mode 100644 index 0000000..8fdc2c0 --- /dev/null +++ b/post/stammtisch20200519.md @@ -0,0 +1,24 @@ +--- +date: "2020-05-19" +tages: ["chaostreff", "veranstaltung"] +title: "Kurzprotokoll 19.05.2020" +--- + +Chaostreff-relevante Punkte: +1. Ein gemeinsamer Abend mit dem CCCS wäre auch mal nett. Wir gehen in die Terminfindung. +2. Wohin kann das Repo der Chaostreff-Website migriert werden? In der Diskussion sind die Gitea-Instanz vom CCCS und Codeberg. + +Einige Themen mit und ohne Links: +1. Der Vortrag von Harvey war mit 10 Menschen gut besucht und - wie immer - sehr informativ und unterhaltsam [Folien](CompLB-Kramski-Home-Recording-20200519_v01.2.pdf) +2. GoTo-Meeting wird von Datenschützern als Videokonferenzsystem verwendet: [Link](https://www.gotomeeting.com/de-de) +3. Gitea als Git-Webfrontend: [Link](https://gitea.io/en-us/) +4. Codeberg als Github-Alternative: [Link](https://codeberg.org/) +5. gotop als schicke top/htop-Alternative - wird nicht mehr gepflegt, es gibt aber ein Fork und eine Fortführung in Rust: [Link](https://github.com/cjbassi/gotop) +6. Online-Abstimmung bei den Grünen: Mit welcher Technik realisieren Parteien, z.B. die Grünen, Online-Abstimmungen? +7. Multigeiger: [Link](https://github.com/ecocurious2/MultiGeiger) +8. ... + + +Der nächste Chaostreff findet am 03.06.2020 statt. + +Bleibt gesund und hackt Sachen! diff --git a/vas.md b/vas.md index 89fc4b7..f6d36f2 100644 --- a/vas.md +++ b/vas.md @@ -12,5 +12,5 @@ ansehen. 1. 01.04.2020 - Harvey/Heinz: _Shell-Health-Check_ oder _Wie ich (wieder) lernte, die Shell zu lieben_ [Folien](https://complb.de/stammtisch20200401/CompLB-Kramski-Shell-Check-20200325_v02.pdf) 2. 22.04.2020 - ampoff/Steffen: _Die National Software Reference Library_ [Folien](https://complb.de/stammtisch20200422/nsrl_short.pdf) 3. 06.05.2020 - fritzthekit - _SIMD und neuronale Netze_ -4. 20.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_ -5. t.b.a. - ampoff/Steffen: _Ansible und AWX_ oder _einenSchmissigenTitelFinden_ +4. 19.05.2020 - Harvey/Heinz: _Mit Spielfilm-Mitschnitten gegen den Stream schwimmen_ [Folien](https://complb.de/stammtisch2020519/CompLB-Kramski-Home-Recording-20200519_v01.2.pdf) +5. 16.06.2020 - ampoff/Steffen: _Ansible und AWX_ oder _Langweilige Tasks automatisieren, mehr Zeit für alles andere_