# OpenPOWER Summit 2021

Links

* Full Schedule: <https://cfp.openpower.foundation/summit2021/schedule/>
* <https://cfp.openpower.foundation/summit2021/cfp>
* <https://cfp.openpower.foundation/summit2021/talk/review/CA7XEWT9ZKMJ3D7NRXXEK9SYPXBAHPCD>
* Youtube Conference playlist <https://www.youtube.com/playlist?list=PLEqfbaomKgQrYjscb-2cQt_S1v_xbg9Cq>
* <https://events.linuxfoundation.org/openpower-summit-north-america/>
* <https://cfp.openpower.foundation/summit2021/talk/NWMQTE/>
* 2021-10-28, 13:00–13:45, RoomB <https://zoom.us/j/99048202175>
* Slides <https://ftp.libre-soc.org/openpower_2021.pdf>
* Talk Preview <https://www.youtube.com/watch?v=NpmbUfgiuFE>
* SVP64 REMAP <https://libre-soc.org/openpower/sv/remap/>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdctlee.py;hb=HEAD>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;hb=HEAD>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD>

# Notes from the talk

```
those are all "in-place" (i.e. you use the register file to complete the entire operation, no LD/STs needed in the middle) 
it's a ridiculously-long list! https://en.wikipedia.org/wiki/Discrete_cosine_transform#Applications 
https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general) 
the NTT wiki page is here https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)#Number-theoretic_transform 
like, not having 4-wide SIMD and only using 3 of the SIMD lanes 
https://arxiv.org/abs/2002.10143# 
fascinating paper 
that's down to not having to do branches 
because the zero-overhead loop doesn't even need a branch instruction 
no predication in VSX, either. 
it's a rather unfortunate dichotomy, here 
which according to the "strict" definition of "Custom Extension" would be in OPCODE 22 

https://libre-soc.org/openpower/sv/overview/ 
fascinatingly this was exactly what Peter Hsu (architect of the MIPS R8000) came up with back around 1994-5! 
unfortunately, the only reason they didn't go ahead with it was because they hadn't worked out Multi-Issue Out-of-Order Execution at the time 
so couldn't fully exploit the idea 
each REMAP can actually be applied to more than one register if required 
which is used (shown later) in the 5-operand (draft) instructions 
you _could_ do this but you have to have a massive number of Reservation Stations 
(an In-Order system would be hosed) 
so with this trick you get multiple pipelined FMACs outstanding 
the hope is that by the time the inner for-loop has completed, you can do another (partial) FMAC on the same register 
i meant, you rotate (not transpose) :) 
the matrix data is in order 0 1 2 3 
but REMAP can access it in 0 2 1 3 
or invert the X-dimension 
1 2 0 3 

and that is basically the values of the matrix "rotated" :) 
https://libre-soc.org/openpower/sv/remap/ 
Aspex was bought out by Ericsson, so the only information available on it now is papers by Argy Krikelis 
and the other co-designers 
https://www.researchgate.net/profile/Argy-Krikelis 
here's the source for that matrix unit test 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD 
to experiment with Matrix "Schedules" this is a simple stand-alone program 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD 

so you can have data appear to be re-ordered *in-place* (register numbers): 
r0 r3 r6 
r1 r4 r7 
r2 r5 r8 
we cut down the MP3 FFMPEG main loop from 450 instructions down to *only 100*. 
it was stunning, totally unexpected 
ohh dear. FFT. this was hellishly complicated :) took about 2 months to do both DCT and FFT 
that 5-operand draft instruction is crucial to do DCT and FFT in-place 
if you don't want to do in-place, you can get away with the "normal" approach of using a temp scalar variable (and 3-4 instructions) 
but, that kiinda defeats the object of the exercise :) 
https://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf 
TMS320 FFT 
standard library for the nexagon 
https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/audio-capiappi/33010 
definite "wow" on the number of VLIW uOps for Hexagon

https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf 
https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf 
the original paper by Byeong Gi Lee. 1984! 
here's the stand-alone program which can generate the triple-loop schedules 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3 
whoops i meant "add it to X[0]" :) 
https://www.nayuki.io/page/free-small-fft-in-multiple-languages 
https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf 
https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf 
the original paper by Byeong Gi Lee. 1984! 
here's the stand-alone program which can generate the triple-loop schedules 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3 
whoops i meant "add it to X[0]" :) 
https://www.nayuki.io/page/free-small-fft-in-multiple-languages 
really cool set of implementations of FFT 
this was mind-bending :) 
of course, if you are not doing in-place, it doesn't matter 
but when you don't do in-place, you end up using *double the number of registers* which is how a lot of implementations of FFT work. sigh 
that puts pressure on the regfile, which is a critical resource in 3D and Video applications 
power consumption ends up going through the roof if you have to "spill" 
the full unit test(s) for SVP64 FFT remap are here https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD 
this is where the bit about mapping any one of the 3 REMAPs to *five* possible target registers is needed 
which is why svremap takes so many operands :) 
https://opencores.org/projects/hwlu 
it's in VHDL. 
the paper on ZOLC is fascinating https://www.researchgate.net/publication/224647569_A_portable_specification_of_zero-overhead_looping_control_hardware_applied_to_embedded_processors 
and, like the Snitch core, has absolutely stunning reductions in instruction count (and power consumption) 
reverse-order 
0123 
7654 
for DCT 
where FFT is 
0123 
4567 
i was amazed by this elegant algorithm 
from looking at the numbers 
here's the source for a stand-alone program to create DCT schedules 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD 
i use it to auto-generate the SVG DCT diagrams used in this talk :) 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/dct_butterfly_svg.py;hb=HEAD 
https://www.youtube.com/watch?v=fn2KJvWyBKg 
trying to explain it without a slide, sigh :) 
it's in the video 
here's the unit test for draft svp64 dct 
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD 
i meant, 25% :) 
they added a transpose-matrix instruction to turn 3x4 into 4x3 
it might have been NEON that ARM added that to, rather than MALI 

Andrey:
First time I heard about REP instruction (my knowledge of x86 is like a drop in the ocean), so perhaps a link might be useful:
https://www.aldeid.com/wiki/X86-assembly/Instructions/rep 

```

# Abstract

*Draft SVP64 in-place Matrix Multiply and FFT / DCT for OpenPOWER*

Advanced Cray-style Vectors are being developed for the Power ISA, as a
Draft Extension for submission to the new OpenPOWER ISA Working Group,
named SVP64.  Whilst in-place Matrix Multiply was planned for a much
later advanced version of SVP64, an investigation into putting FFMPEG's
MP3 CODEC inner loop into Vectorised Assembler resulted in such a large
drop in code size (over 4x reduction) that it warranted priority
investigation.

Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT)
and Number-Theory Transform (NTT) form the basis of too numerous
high-priority algorithms to count.  Normal SIMD Processors and even
normal Vector Processors have a hard time dealing with them: inspecting
FFMPEG's source code reveals that heavily optimised inline assembler (no
loops, just hundreds to thousands of lines of assembler) is not uncommon.

The focus of this NLnet-sponsored research is therefore to create enhancements
to SVP64 to be able to cover DFT, DCT, NTT and Matrix-Multiply entirely
in-place.  In-place is crucially important for many applications (3D, Video)
to keep power consumption down by avoiding register spill as well as L1/L2
cache strip-mining.  General-purpose RADIX-2 DCT and complex DFT will be
shown and explained, as well as the in-place Matrix Multiply which does
not require transposing or register spill for any sized Matrices
(including non-power-two) up to 128 FMACs.  The basics of SVP64, covered
in the Overview [1], will also be briefly described.

[1] https://libre-soc.org/openpower/sv/overview/