svp64-primer/summary.tex

   1 \section{Summary}
   2 The proposed \acs{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
   3 ONLY uses scalar instructions}.
   4
   5 \begin{itemize}
   6 \item The Power \acs{ISA} v3.1 Specification is not altered in any way.
   7   v3.1 Code-compatibility is guaranteed.
   8 \item Does not require sacrificing 32-bit Major Opcodes.
   9 \item Does not require adding duplicates of instructions
  10       (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
  11 \item Fully abstracted: does not create Micro-architectural dependencies
  12       (no fixed "Lane" size), one binary works across all existing
  13       \textit{and future} implementations.
  14 \item Specifically designed to be easily implemented
  15       on top of an existing Micro-architecture (especially
  16       Superscalar Out-of-Order Multi-issue) without
  17       disruptive full architectural redesigns.
  18 \item Divided into Compliancy Levels to suit differing needs.
  19 \item At the highest Compliancy Level only requires five instructions
  20       (SVE2 requires appx 9,000. \acs{AVX-512} around 10,000. \acs{RVV} around
  21       300).
  22 \item Predication, an often-requested feature, is added cleanly
  23       (without modifying the v3.1 Power ISA)
  24 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
  25       instructions (without adding any v3.1 Power ISA instructions)
  26 \item Full \acs{DCT} and \acs{FFT} RADIX2 Triple-loops are achieved with
  27       dramatically reduced instruction count, and power consumption expected
  28       to greatly reduce. Normally found only in high-end \acs{VLIW} \acs{DSP}
  29       (TI MSP, Qualcomm Hexagon)
  30 \item Fail-First Load/Store allows Vectorised high performance
  31       strncpy to be implemented in around 14
  32       instructions (hand-optimised \acs{VSX} assembler is 240).
  33 \item Inner loop of MP3 implemented in under 100 instructions
  34       (gcc produces 450 for the same function on POWER9).
  35 \end{itemize}
  36
  37 All areas investigated so far consistently showed reductions in executable
  38 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
  39 power consumption due to less I-Cache/TLB pressure and also Issue remaining
  40 idle for long periods.
  41
  42 Simple-V has been specifically and carefully crafted to respect
  43 the Power ISA's Supercomputing pedigree.
  44
  45 \begin{figure}[hb]
  46     \centering
  47         \includegraphics[width=0.6\linewidth]{power_pipelines.png}
  48         \caption{Showing how SV fits in between Decode and Issue}
  49         \label{fig:power_pipelines}
  50 \end{figure}
  51
  52 \pagebreak
  53
  54 \subsection{What is SIMD?}
  55
  56 \acs{SIMD} is a way of partitioning existing \acs{CPU}
  57 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
  58 \cite{SIMD_HARM}\cite{SIMD_HPC}
  59 These partitions can then be operated on simultaneously, and the initial values
  60 and results being stored as entire 64-bit registers (\acs{SWAR}).
  61 The SIMD instruction opcode
  62 includes the data width and the operation to perform.
  63 \par
  64
  65 \begin{figure}[hb]
  66     \centering
  67         \includegraphics[width=0.6\linewidth]{simd_axb}
  68         \caption{SIMD multiplication}
  69         \label{fig:simd_axb}
  70 \end{figure}
  71
  72 This method can have a huge advantage for rapid processing of
  73 vector-type data (image/video, physics simulations, cryptography,
  74 etc.),
  75 \cite{SIMD_WASM},
  76  and thus on paper is very attractive compared to
  77 scalar-only instructions.
  78 \textit{As long as the data width fits the workload, everything is fine}.
  79 \par
  80
  81 \subsection{Shortfalls of SIMD}
  82 SIMD registers are of a fixed length and thus to achieve greater
  83 performance, CPU architects typically increase the width of registers
  84 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  85 binary compatibility is an important feature, and thus each doubling
  86 of SIMD registers also expands the instruction set. The number of
  87 instructions quickly balloons and this can be seen in for example
  88 IA-32 expanding from 80 to about 1400 instructions since
  89 the 1970s\cite{SIMD_HARM}.\par
  90
  91 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
  92 The following are just some of the reasons why SIMD is unsustainable as
  93 the number of instructions increase:
  94 \begin{itemize}
  95         \item Hardware design, ASIC routing etc.
  96         \item Compiler design
  97         \item Documentation of the ISA
  98         \item Manual coding and optimisation
  99         \item Time to support the platform
 100         \item Compilance Suite development and testing
 101         \item Protracted Variable-Length encoding (x86) severely compromises
 102         Multi-issue decoding
 103 \end{itemize}
 104
 105 \subsection{Scalable Vector Architectures}
 106 An older alternative exists to utilise data parallelism - vector
 107 architectures. Vector CPUs collect operands from the main memory, and
 108 store them in large, sequential vector registers.\par
 109
 110 A simple vector processor might operate on one element at a time,
 111 however as the element operations are usually independent,
 112 a processor could be made to compute all of the vector's
 113 elements simultaneously, taking advantage of multiple pipelines.\par
 114
 115 Typically, today's vector processors can execute two, four, or eight
 116 64-bit elements per clock cycle.
 117 \cite{SIMD_HARM}.
 118 Such processors can also deal with (in hardware) fringe cases where the vector
 119 length is not a multiple of the number of elements. The element data width
 120 is variable (just like in SIMD) but it is the \textit{number} of elements being
 121 variable under control of a "setvl" instruction that makes Vector ISAs
 122 "Scalable"
 123 \par
 124
 125 \acs{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
 126 which can fit 1024 64-bit words.
 127 \cite{riscv-v-spec}.
 128 The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
 129 An early Draft of RVV supported overlaying the Vector Registers onto the
 130 Floating Point registers, similar to \acs{MMX}.
 131
 132 \begin{figure}[hb]
 133     \centering
 134         \includegraphics[width=0.6\linewidth]{cray_vector_regs}
 135         \caption{Cray Vector registers: 8 registers, 64 elements each}
 136         \label{fig:cray_vector_regs}
 137 \end{figure}
 138
 139 Simple-V's "Vector" Registers are specifically designed to fit on top of
 140 the Scalar (GPR, FPR) register files, which are extended from the default
 141 of 32, to 128 entries in the high-end Compliancy Levels.  This is a primary
 142 reason why Simple-V can be added on top of an existing Scalar ISA, and
 143 \textit{in particular} why there is no need to add Vector Registers or
 144 Vector instructions.
 145
 146 \begin{figure}[hb]
 147     \centering
 148         \includegraphics[width=0.6\linewidth]{svp64_regs.png}
 149         \caption{three instructions, same vector length, different element widths}
 150         \label{fig:svp64_regs}
 151 \end{figure}
 152
 153 \subsection{Simple Vectorisation}
 154 \acs{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
 155 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
 156 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
 157 leveraging a scalar ISA by using "Prefixing".
 158 \textbf{No dedicated vector opcodes exist in SV, at all}.
 159 SVP64 uses 25\% of the Power ISA v3.1 64-bit Prefix space (EXT001) to create
 160 the SV Vectorisation Context for the 32-bit Scalar Suffix.
 161
 162 \vspace{10pt}
 163 Main design principles
 164 \begin{itemize}
 165         \item Introduce by implementing on top of existing Power ISA
 166         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
 167               issues multiple scalar operations
 168         \item Preserves underlying scalar execution dependencies as if
 169               the for-loop had been expanded into actual scalar instructions
 170         ("preserving Program Order")
 171         \item Augments existing instructions by adding "tags" - provides
 172           Vectorisation "context" rather than adding new opcodes.
 173         \item Does not modify or deviate from the underlying scalar
 174           Power ISA unless there's a significant performance boost or other
 175           advantage in the vector space
 176         \item Aimed at Supercomputing: avoids creating significant
 177               \textit{sequential dependency hazards}, allowing \textbf{high
 178               performance multi-issue superscalar microarchitectures} to be
 179           leveraged.
 180 \end{itemize}
 181
 182 Advantages include:
 183 \begin{itemize}
 184         \item Easy to create first (and sometimes only) implementation
 185               as a literal for-loop in hardware, simulators, and compilers.
 186         \item Obliterates SIMD opcode proliferation
 187           ($O(N^6)$) as well as dedicated Vectorisation
 188           ISAs. No more separate vector instructions.
 189         \item Reducing maintenance overhead (no separate Vector instructions).
 190           Adding any new Scalar instruction
 191           \textit{automatically adds a Vectorised version of the same}.
 192         \item Easier for compilers, coders, documentation
 193 \end{itemize}
 194