conferences/openpower2021.mdwn

   1 # OpenPOWER Summit 2021
   2
   3 Links
   4
   5 * Full Schedule: <https://cfp.openpower.foundation/summit2021/schedule/>
   6 * <https://cfp.openpower.foundation/summit2021/cfp>
   7 * <https://cfp.openpower.foundation/summit2021/talk/review/CA7XEWT9ZKMJ3D7NRXXEK9SYPXBAHPCD>
   8 * Youtube Conference playlist <https://www.youtube.com/playlist?list=PLEqfbaomKgQrYjscb-2cQt_S1v_xbg9Cq>
   9 * <https://events.linuxfoundation.org/openpower-summit-north-america/>
  10 * <https://cfp.openpower.foundation/summit2021/talk/NWMQTE/>
  11 * 2021-10-28, 13:00–13:45, RoomB <https://zoom.us/j/99048202175>
  12 * Slides <https://ftp.libre-soc.org/openpower_2021.pdf>
  13 * Talk Preview <https://www.youtube.com/watch?v=NpmbUfgiuFE>
  14 * SVP64 REMAP <https://libre-soc.org/openpower/sv/remap/>
  15 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdctlee.py;hb=HEAD>
  16 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD>
  17 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD>
  18 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD>
  19 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD>
  20
  21 # Notes from the talk
  22
  23 ```
  24 those are all "in-place" (i.e. you use the register file to complete the entire operation, no LD/STs needed in the middle)
  25 it's a ridiculously-long list! https://en.wikipedia.org/wiki/Discrete_cosine_transform#Applications
  26 https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)
  27 the NTT wiki page is here https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)#Number-theoretic_transform
  28 like, not having 4-wide SIMD and only using 3 of the SIMD lanes
  29 https://arxiv.org/abs/2002.10143#
  30 fascinating paper
  31 that's down to not having to do branches
  32 because the zero-overhead loop doesn't even need a branch instruction
  33 no predication in VSX, either.
  34 it's a rather unfortunate dichotomy, here
  35 which according to the "strict" definition of "Custom Extension" would be in OPCODE 22
  36
  37 https://libre-soc.org/openpower/sv/overview/
  38 fascinatingly this was exactly what Peter Hsu (architect of the MIPS R8000) came up with back around 1994-5!
  39 unfortunately, the only reason they didn't go ahead with it was because they hadn't worked out Multi-Issue Out-of-Order Execution at the time
  40 so couldn't fully exploit the idea
  41 each REMAP can actually be applied to more than one register if required
  42 which is used (shown later) in the 5-operand (draft) instructions
  43 you _could_ do this but you have to have a massive number of Reservation Stations
  44 (an In-Order system would be hosed)
  45 so with this trick you get multiple pipelined FMACs outstanding
  46 the hope is that by the time the inner for-loop has completed, you can do another (partial) FMAC on the same register
  47 i meant, you rotate (not transpose) :)
  48 the matrix data is in order 0 1 2 3
  49 but REMAP can access it in 0 2 1 3
  50 or invert the X-dimension
  51 1 2 0 3
  52
  53 and that is basically the values of the matrix "rotated" :)
  54 https://libre-soc.org/openpower/sv/remap/
  55 Aspex was bought out by Ericsson, so the only information available on it now is papers by Argy Krikelis
  56 and the other co-designers
  57 https://www.researchgate.net/profile/Argy-Krikelis
  58 here's the source for that matrix unit test
  59 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
  60 to experiment with Matrix "Schedules" this is a simple stand-alone program
  61 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD
  62
  63 so you can have data appear to be re-ordered *in-place* (register numbers):
  64 r0 r3 r6
  65 r1 r4 r7
  66 r2 r5 r8
  67 we cut down the MP3 FFMPEG main loop from 450 instructions down to *only 100*.
  68 it was stunning, totally unexpected
  69 ohh dear. FFT. this was hellishly complicated :) took about 2 months to do both DCT and FFT
  70 that 5-operand draft instruction is crucial to do DCT and FFT in-place
  71 if you don't want to do in-place, you can get away with the "normal" approach of using a temp scalar variable (and 3-4 instructions)
  72 but, that kiinda defeats the object of the exercise :)
  73 https://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf
  74 TMS320 FFT
  75 standard library for the nexagon
  76 https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/audio-capiappi/33010
  77 definite "wow" on the number of VLIW uOps for Hexagon
  78
  79 https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf
  80 https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf
  81 the original paper by Byeong Gi Lee. 1984!
  82 here's the stand-alone program which can generate the triple-loop schedules
  83 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3
  84 whoops i meant "add it to X[0]" :)
  85 https://www.nayuki.io/page/free-small-fft-in-multiple-languages
  86 https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf
  87 https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf
  88 the original paper by Byeong Gi Lee. 1984!
  89 here's the stand-alone program which can generate the triple-loop schedules
  90 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3
  91 whoops i meant "add it to X[0]" :)
  92 https://www.nayuki.io/page/free-small-fft-in-multiple-languages
  93 really cool set of implementations of FFT
  94 this was mind-bending :)
  95 of course, if you are not doing in-place, it doesn't matter
  96 but when you don't do in-place, you end up using *double the number of registers* which is how a lot of implementations of FFT work. sigh
  97 that puts pressure on the regfile, which is a critical resource in 3D and Video applications
  98 power consumption ends up going through the roof if you have to "spill"
  99 the full unit test(s) for SVP64 FFT remap are here https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
 100 this is where the bit about mapping any one of the 3 REMAPs to *five* possible target registers is needed
 101 which is why svremap takes so many operands :)
 102 https://opencores.org/projects/hwlu
 103 it's in VHDL.
 104 the paper on ZOLC is fascinating https://www.researchgate.net/publication/224647569_A_portable_specification_of_zero-overhead_looping_control_hardware_applied_to_embedded_processors
 105 and, like the Snitch core, has absolutely stunning reductions in instruction count (and power consumption)
 106 reverse-order
 107 0123
 108 7654
 109 for DCT
 110 where FFT is
 111 0123
 112 4567
 113 i was amazed by this elegant algorithm
 114 from looking at the numbers
 115 here's the source for a stand-alone program to create DCT schedules
 116 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD
 117 i use it to auto-generate the SVG DCT diagrams used in this talk :)
 118 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/dct_butterfly_svg.py;hb=HEAD
 119 https://www.youtube.com/watch?v=fn2KJvWyBKg
 120 trying to explain it without a slide, sigh :)
 121 it's in the video
 122 here's the unit test for draft svp64 dct
 123 https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD
 124 i meant, 25% :)
 125 they added a transpose-matrix instruction to turn 3x4 into 4x3
 126 it might have been NEON that ARM added that to, rather than MALI
 127
 128 Andrey:
 129 First time I heard about REP instruction (my knowledge of x86 is like a drop in the ocean), so perhaps a link might be useful:
 130 https://www.aldeid.com/wiki/X86-assembly/Instructions/rep
 131
 132 ```
 133
 134 # Abstract
 135
 136 *Draft SVP64 in-place Matrix Multiply and FFT / DCT for OpenPOWER*
 137
 138 Advanced Cray-style Vectors are being developed for the Power ISA, as a
 139 Draft Extension for submission to the new OpenPOWER ISA Working Group,
 140 named SVP64.  Whilst in-place Matrix Multiply was planned for a much
 141 later advanced version of SVP64, an investigation into putting FFMPEG's
 142 MP3 CODEC inner loop into Vectorised Assembler resulted in such a large
 143 drop in code size (over 4x reduction) that it warranted priority
 144 investigation.
 145
 146 Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT)
 147 and Number-Theory Transform (NTT) form the basis of too numerous
 148 high-priority algorithms to count.  Normal SIMD Processors and even
 149 normal Vector Processors have a hard time dealing with them: inspecting
 150 FFMPEG's source code reveals that heavily optimised inline assembler (no
 151 loops, just hundreds to thousands of lines of assembler) is not uncommon.
 152
 153 The focus of this NLnet-sponsored research is therefore to create enhancements
 154 to SVP64 to be able to cover DFT, DCT, NTT and Matrix-Multiply entirely
 155 in-place.  In-place is crucially important for many applications (3D, Video)
 156 to keep power consumption down by avoiding register spill as well as L1/L2
 157 cache strip-mining.  General-purpose RADIX-2 DCT and complex DFT will be
 158 shown and explained, as well as the in-place Matrix Multiply which does
 159 not require transposing or register spill for any sized Matrices
 160 (including non-power-two) up to 128 FMACs.  The basics of SVP64, covered
 161 in the Overview [1], will also be briefly described.
 162
 163 [1] https://libre-soc.org/openpower/sv/overview/