3d_gpu/requirements_specification.mdwn

   1 # Requirements Specification
   2
   3 This document contains the Requirements Specification for the Libre RISC-V
   4 micro-architectural design.  It shall meet the target of 5-6 32-bit GFLOPs,
   5 150 M-Pixels/sec, 30 Million Triangles/sec, and minimum video decode
   6 capability of 720p @ 30fps to a 1920x1080 framebuffer, in under 2.5 watts
   7 at an 800mhz clock rate.  Exceeding this target is acceptable if the
   8 power budget is not exceeded.  Exceeding this target "just because we can"
   9 is also acceptable, as long as it does not disrupt meeting the minimum
  10 performance and power requirements.
  11
  12 # General Architectural Design Principle
  13
  14 The general design base is to utilise an augmented and enhanced variant
  15 of the original CDC 6600 scoreboard system.  It is not well-known that
  16 the 6600 includes operand forwarding and register renaming.  Precise
  17 exceptions, precise in-order commit, branch speculation, "nameless"
  18 registers (results detected that need not be written because they have
  19 been overwritten by another instruction), predication and vectorisation
  20 will all be added by overloading write hazards.
  21
  22 An overview of the design is as follows:
  23
  24 * 3D and Video primitives (operations) will only be added as strictly
  25   necessary to achieve the minimum power and performance target.
  26 * Identified so far is a 4xFP32 ARGB Quad to 1xINT32 ARGB pixel
  27   conversion opcode (part of the Vulkan API).  It will write directly
  28   to a separate "tile buffer" (SRAM), not to the integer register
  29   file.  The instruction will be scalar and will inherently and
  30   automatically parallelised by SV, just like all other scalar opcodes.
  31 * xBitManip opcodes will be required to deal with VPU workloads
  32 * The register files will be stratified into 4-way 2R1W banks,
  33   with *separate* and distinct byte-level write-enable lines on all four
  34   bytes of all four banks.
  35 * 6600-style scoreboards will be augmented with "shadow" wires
  36   and write hazard capability on exceptions, branch speculation,
  37   LD/ST and predication.
  38 * Each "shadow" capability of each type will be provided by a separate
  39   Function Unit.  For example if there is to exist the possibility of rolling
  40   ahead through two speculative branches, then two **separate**
  41   Branch-speculative Function Units will be required: each will
  42   hold their own separate and distinct "shadow" (Go-Die wire) and
  43   write-hazard over instructions on which the branch depends.
  44 * Likewise for predication, which shall place a "hold" on
  45   the Function Units that depend on it until the register used
  46   as a predicate mask has been read and decoded, there will be
  47   separate Function Units waiting for each predication mask register.
  48   Bits in the mask that are "zero" will result in "Go-Die" signals being
  49   sent to the Function Units previously (speculatively) allocated for that
  50   (now cancelled) element operation.  Bits that are "1" will cancel
  51   their Write-Hazard and allow the Function Unit to proceed with that
  52   element's operation.
  53 * The 6600 "Q-Table" that records, for each register, the last Function
  54   Unit (in instruction issue order) that is to write its result to that
  55   register, shall be augmented with "history" capability that aids and
  56   assists in "rollback" of "nameless" registers, should an exception
  57   or interrupt occur. "History" is simply a (short) queue (stack)
  58   that preserves, in instruction-issue order, a record of the previous
  59   Function Unit(s) that targetted each register as a destination.
  60 * Function Units will have both src and destination Reservation
  61   Stations (latches) in order to buffer incoming and outgoing data.
  62   This to make best use of (limited) inter-Function-Unit bus bandwidth.
  63 * Crossbar Routing from the Register File will be on the **source**
  64   registers **only**: Function Units will route **directly** to
  65   and be hard-wired associated with one of four register banks.
  66 * Additional "Operand Forwarding" crossbar(s) will be added that
  67   **bypass** the register file entirely, to be used exclusively
  68   for registers that have specifically been identified as "nameless".
  69 * Function Units will be the *front-end* to **shared** pipelined
  70   concurrent ALUs.  The input src registers will come from the
  71   latches associated with the Function Unit, and will put the
  72   result **back** into the destination latch associated with that
  73   **same** Function Unit.
  74 * **Pairs** of 32-bit Function Units will handle 64-bit operations,
  75   with the 32-bit src Reservation Stations (latches) "teaming up"
  76   to store 64-bit src register values, and likewise the 32-bit
  77   destination latches for the same (paired) Function Units.
  78 * 32-bit Function Units will handle 8 and 16 bit operations in
  79   cases where batches of operations may be (easily, conveniently)
  80   allocated to a 32-bit-wide SIMD-style (predicated) ALU.
  81 * Additional 8-bit Function Units (in groups of 4) will handle
  82   8-bit operations as well as pair up to handle 16-bit operations
  83   in cases where neither 8 nor 16 bit operations can be (conveniently,
  84   easily) allocated to parallel (SIMD-like) ALUs.  This to handle
  85   corner-cases and to not jam up the 32-bit Function Units with single-byte
  86   operations (resulting in only 25% utilisation).
  87 * Allocation of an operation to a 32-bit ALU will block the
  88   corresponding 8/16-bit Function Unit(s) for that register, and vice-versa.
  89   8/16-bit operations will however **not** block the remaining
  90   (unallocated) bytes of the same register from being utilised.
  91 * Spectre timing attacks will be dealt with by ensuring that there
  92   are no side-channels between cores in the usual ways (no shared
  93   DIV unit, correct use of L1 cache), however there will be an
  94   addition of a "Speculation Fence" instruction (or hint) that will
  95   reset the internal state to a known quiescent state.  This involves
  96   cancellation of all speculation, cancellation of "nameless" registers,
  97   committing outstanding register writes to the register file, and
  98   cancelling all Function Units waiting for read hazards.  This to
  99   be automatically done on any exceptions or interrupts.
 100
 101 # Register File
 102
 103 There shall be two 127-entry 64-bit register files: one for floating-point,
 104 the other for integer operations.  Each shall have byte-level write-enable
 105 lines, and shall be divided into 4-way 2R1W banks that are split into
 106 odd-even register numbers and further split into hi-32 and lo-32 bits.
 107
 108 In this way, 2 simultaneous 64-bit operations may write to the register
 109 file (as long as the destinations have odd and even numbers), or 4
 110 simultaneous 32-bit operations likewise.  byte-level write-enable is
 111 so that writes may be performed down to the 16-bit and 8-bit level
 112 without requiring additional reads.
 113
 114 Additionally, if a read is requested for a register that is currently
 115 being written, the written value shall be "passed through" on the same
 116 cycle, such that the register file may effectively be used as an
 117 "Operand Forwarding" Channel.
 118
 119 # Function Units
 120
 121 ## Commit Phase (instruction order preservation)
 122
 123 # 6600 Scoreboards
 124
 125 6600 Scoreboards are usually viewed as incomplete: incapable of register
 126 renaming and precise exceptions are two of the perceived flaws.  These
 127 flaws do not exist, however it takes some explaining.
 128
 129 ## Q-Table (FU to Register Lookup)
 130
 131 The Q Table is a lookup table that records (in binary form in the
 132 original 6600, however unary bit-wise form - N Function Unit bits
 133 and M register bits - can be recommended) the last Function Unit
 134 that, in instruction issue order, is to write to any given
 135 register.
 136
 137 However, to support "nameless" registers, the Q-Table shall support
 138 *multiple* (historical) entries, recording the history of the
 139 *previous* Function Unit that was to write to each register.
 140 When historic entries exist (non-empty), the following shall occur:
 141
 142 * All Function Units with historic entries shall **not** commit
 143   their values to the register file, even if they are free to do so.
 144 * All Function Units with historic entries shall hold a "write hazard"
 145   against their dependencies that are waiting for that "nameless" result.
 146 * When a dependent Function Unit has cleared all possibility of an
 147   Exception being raised, it shall **drop** the write hazard on the
 148   "nameless" source.
 149 * If a "nameless" Function Unit needs to generate an Exception, it
 150   does so in the standard way (see "Exceptions"), **however**,
 151   in doing so it will also result in a **roll back** of the Q-Table for
 152   **all and any** cancelled Function Units, to *previous* (historic)
 153   Q-Table values for the relevant destination registers.  Once
 154   rolled back, the Function Unit must store its result in the register
 155   file, prior to permitting the Exception to proceed.
 156 * Likewise If a dependent Function Unit has to generate an exception,
 157   and its source Function Units are "nameless", the "nameless"
 158   Function Units must also "roll back", store their results, and
 159   finally permit the Exception to trigger.
 160 * Likewise, all other "nameless" results must also be "rolled back",
 161   except unlike the Function Units triggering the exception they may
 162   roll back to the newest "nameless" historical Q-Table entry
 163   (if they have not already been cancelled by the FU triggering the
 164   exception).
 165
 166 Bear in mind that exceptions (like all operations that are ready to
 167 commit) may only occur in-order (following a FU-to-FU "link" bit),
 168 and may only occur if the Function Unit is entirely free of write hazards.
 169
 170 ## FU-to-FU Dependency Matrix
 171
 172 The Function-Unit to Function-Unit Dependency Matrix expresses the
 173 read and write hazards - dependencies - between Function Units.
 174
 175 ## Branch Speculation
 176
 177 Branch speculation is done by preventing instructions from becoming
 178 "writeable" until the Branch Unit knows if it has resolved or not.
 179 This is done with the addition of "Shadow" lines, as shown below:
 180
 181 This image reproduced with kind permission, Copyright (C) Mitch Alsup
 182 [[!img shadow_issue_flipflops.png]]
 183
 184 Note that there are multiple "Shadow" signals, coming not just from Branch
 185 Speculation but also from predication and exception shadows.
 186
 187 On a "Failed" signal, the instruction is told to "Go Die".  This is
 188 passed to the Computation Unit as well.  When all "Success" signals
 189 are raised the instruction is permitted to enter "Writeable".
 190
 191 ## Exceptions
 192
 193 Exceptions shall be handled by each instruction that *may* throw an
 194 exception having and holding a "Shadow" wire over all dependent
 195 Function Units, in exactly the same way as Branch Speculation.
 196 Likewise, dependent instructions are prevented and prohibited from
 197 entering the "Writeable" state.
 198
 199 Dependent downstream instructions, if the exception is thrown,
 200 shall have the "Failed" bit ASSERTED (by the Function Unit throwing
 201 the exception) such that the down-stream dependent instruction is told
 202 to "Go Die".
 203
 204 If the point is reached at which the instruction knows that the
 205 Exception cannot possibly occur, the "Success" signal is raised
 206 instead, thus cancelling the "hold" over dependent downstream
 207 instructions - again in exactly the same way as Branch Speculation
 208 "Success".
 209
 210 Exceptions may **only** be actually raised if they are at the front of
 211 the instruction queue, i.e. if they are free of write hazards.
 212 See section on "Function Unit Commit" phase, as the Function Units
 213 have a "link bit" that preserves the instruction issue order, which
 214 must also be respected.
 215
 216 # Spectre-style timing mitigation
 217
 218 Spectre-style timing attacks are defined by one instruction issue
 219 affecting the completion time of past **and future** instructions.
 220 The key insight to mitigation against such attacks is to note that
 221 arbitrary untrusted instructions must not be permitted to affect
 222 trusted instructions.  Consequently as long as there is a firebreak
 223 (a "Fence") between trusted and untrusted, timing attacks can be
 224 held off.
 225
 226 Two instructions ("hints") shall therefore be added:
 227
 228 * One that stops speculation, multi-issue and any out-of-order
 229   resource allocation for a minimum of 16 instructions.
 230 * Another that **cancels** all speculation and reservations,
 231   cancels "nameless" registers, waits for and ensures that all
 232   outstanding instructions have completed and committed, before
 233   permitting the processor to continue further.
 234
 235 This latter shall occur unconditionally without requiring a special
 236 instruction to be called, on ECALL as well as all exceptions and
 237 interrupts.
 238
 239 # ALU design
 240
 241 There is a separate pipelined alu for fdiv/fsqrt/frsqrt/idiv/irem
 242 that is possibly shared between 2 or 4 cores.
 243
 244 The main ALUs are each a unified ALU for i8-i64/f16-f64 where the
 245 ALU is split into lanes with separate instructions for each 32-bit half.
 246 So, the multiplier should be capable of 64-bit fmadd, 2x32-bit fmadd,
 247 4x16-bit fmadd, 1x32-bit fmadd + 2x16-bit fmadd (in either order), and all
 248 (8/16/32/64) sizes of integer mul/mulhsu/mulh/mulhu in 2 groups of 32-bits.
 249 We can implement fmul using fmadd with 0 (make sure that we get the right
 250 sign bit for 0 for all rounding modes).
 251
 252 # Rowhammer Mitigation
 253
 254 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-March/000699.html>
 255 * <https://arxiv.org/pdf/1903.00446.pdf>