bug 1244: separate frame for linked list image
[libreriscv.git] / 3d_gpu / requirements_specification.mdwn
1 # Requirements Specification
2
3 This document contains the Requirements Specification for the Libre RISC-V
4 micro-architectural design. It shall meet the target of 5-6 32-bit GFLOPs,
5 150 M-Pixels/sec, 30 Million Triangles/sec, and minimum video decode
6 capability of 720p @ 30fps to a 1920x1080 framebuffer, in under 2.5 watts
7 at an 800mhz clock rate. Exceeding this target is acceptable if the
8 power budget is not exceeded. Exceeding this target "just because we can"
9 is also acceptable, as long as it does not disrupt meeting the minimum
10 performance and power requirements.
11
12 # General Architectural Design Principle
13
14 The general design base is to utilise an augmented and enhanced variant
15 of the original CDC 6600 scoreboard system. It is not well-known that
16 the 6600 includes operand forwarding and register renaming. Precise
17 exceptions, precise in-order commit, branch speculation, "nameless"
18 registers (results detected that need not be written because they have
19 been overwritten by another instruction), predication and vectorisation
20 will all be added by overloading write hazards.
21
22 An overview of the design is as follows:
23
24 * 3D and Video primitives (operations) will only be added as strictly
25 necessary to achieve the minimum power and performance target.
26 * Identified so far is a 4xFP32 ARGB Quad to 1xINT32 ARGB pixel
27 conversion opcode (part of the Vulkan API). It will write directly
28 to a separate "tile buffer" (SRAM), not to the integer register
29 file. The instruction will be scalar and will inherently and
30 automatically parallelised by SV, just like all other scalar opcodes.
31 * xBitManip opcodes will be required to deal with VPU workloads
32 * The register files will be stratified into 4-way 2R1W banks,
33 with *separate* and distinct byte-level write-enable lines on all four
34 bytes of all four banks.
35 * 6600-style scoreboards will be augmented with "shadow" wires
36 and write hazard capability on exceptions, branch speculation,
37 LD/ST and predication.
38 * Each "shadow" capability of each type will be provided by a separate
39 Function Unit. For example if there is to exist the possibility of rolling
40 ahead through two speculative branches, then two **separate**
41 Branch-speculative Function Units will be required: each will
42 hold their own separate and distinct "shadow" (Go-Die wire) and
43 write-hazard over instructions on which the branch depends.
44 * Likewise for predication, which shall place a "hold" on
45 the Function Units that depend on it until the register used
46 as a predicate mask has been read and decoded, there will be
47 separate Function Units waiting for each predication mask register.
48 Bits in the mask that are "zero" will result in "Go-Die" signals being
49 sent to the Function Units previously (speculatively) allocated for that
50 (now cancelled) element operation. Bits that are "1" will cancel
51 their Write-Hazard and allow the Function Unit to proceed with that
52 element's operation.
53 * The 6600 "Q-Table" that records, for each register, the last Function
54 Unit (in instruction issue order) that is to write its result to that
55 register, shall be augmented with "history" capability that aids and
56 assists in "rollback" of "nameless" registers, should an exception
57 or interrupt occur. "History" is simply a (short) queue (stack)
58 that preserves, in instruction-issue order, a record of the previous
59 Function Unit(s) that targetted each register as a destination.
60 * Function Units will have both src and destination Reservation
61 Stations (latches) in order to buffer incoming and outgoing data.
62 This to make best use of (limited) inter-Function-Unit bus bandwidth.
63 * Crossbar Routing from the Register File will be on the **source**
64 registers **only**: Function Units will route **directly** to
65 and be hard-wired associated with one of four register banks.
66 * Additional "Operand Forwarding" crossbar(s) will be added that
67 **bypass** the register file entirely, to be used exclusively
68 for registers that have specifically been identified as "nameless".
69 * Function Units will be the *front-end* to **shared** pipelined
70 concurrent ALUs. The input src registers will come from the
71 latches associated with the Function Unit, and will put the
72 result **back** into the destination latch associated with that
73 **same** Function Unit.
74 * **Pairs** of 32-bit Function Units will handle 64-bit operations,
75 with the 32-bit src Reservation Stations (latches) "teaming up"
76 to store 64-bit src register values, and likewise the 32-bit
77 destination latches for the same (paired) Function Units.
78 * 32-bit Function Units will handle 8 and 16 bit operations in
79 cases where batches of operations may be (easily, conveniently)
80 allocated to a 32-bit-wide SIMD-style (predicated) ALU.
81 * Additional 8-bit Function Units (in groups of 4) will handle
82 8-bit operations as well as pair up to handle 16-bit operations
83 in cases where neither 8 nor 16 bit operations can be (conveniently,
84 easily) allocated to parallel (SIMD-like) ALUs. This to handle
85 corner-cases and to not jam up the 32-bit Function Units with single-byte
86 operations (resulting in only 25% utilisation).
87 * Allocation of an operation to a 32-bit ALU will block the
88 corresponding 8/16-bit Function Unit(s) for that register, and vice-versa.
89 8/16-bit operations will however **not** block the remaining
90 (unallocated) bytes of the same register from being utilised.
91 * Spectre timing attacks will be dealt with by ensuring that there
92 are no side-channels between cores in the usual ways (no shared
93 DIV unit, correct use of L1 cache), however there will be an
94 addition of a "Speculation Fence" instruction (or hint) that will
95 reset the internal state to a known quiescent state. This involves
96 cancellation of all speculation, cancellation of "nameless" registers,
97 committing outstanding register writes to the register file, and
98 cancelling all Function Units waiting for read hazards. This to
99 be automatically done on any exceptions or interrupts.
100
101 # Register File
102
103 There shall be two 127-entry 64-bit register files: one for floating-point,
104 the other for integer operations. Each shall have byte-level write-enable
105 lines, and shall be divided into 4-way 2R1W banks that are split into
106 odd-even register numbers and further split into hi-32 and lo-32 bits.
107
108 In this way, 2 simultaneous 64-bit operations may write to the register
109 file (as long as the destinations have odd and even numbers), or 4
110 simultaneous 32-bit operations likewise. byte-level write-enable is
111 so that writes may be performed down to the 16-bit and 8-bit level
112 without requiring additional reads.
113
114 Additionally, if a read is requested for a register that is currently
115 being written, the written value shall be "passed through" on the same
116 cycle, such that the register file may effectively be used as an
117 "Operand Forwarding" Channel.
118
119 # Function Units
120
121 ## Commit Phase (instruction order preservation)
122
123 # 6600 Scoreboards
124
125 6600 Scoreboards are usually viewed as incomplete: incapable of register
126 renaming and precise exceptions are two of the perceived flaws. These
127 flaws do not exist, however it takes some explaining.
128
129 ## Q-Table (FU to Register Lookup)
130
131 The Q Table is a lookup table that records (in binary form in the
132 original 6600, however unary bit-wise form - N Function Unit bits
133 and M register bits - can be recommended) the last Function Unit
134 that, in instruction issue order, is to write to any given
135 register.
136
137 However, to support "nameless" registers, the Q-Table shall support
138 *multiple* (historical) entries, recording the history of the
139 *previous* Function Unit that was to write to each register.
140 When historic entries exist (non-empty), the following shall occur:
141
142 * All Function Units with historic entries shall **not** commit
143 their values to the register file, even if they are free to do so.
144 * All Function Units with historic entries shall hold a "write hazard"
145 against their dependencies that are waiting for that "nameless" result.
146 * When a dependent Function Unit has cleared all possibility of an
147 Exception being raised, it shall **drop** the write hazard on the
148 "nameless" source.
149 * If a "nameless" Function Unit needs to generate an Exception, it
150 does so in the standard way (see "Exceptions"), **however**,
151 in doing so it will also result in a **roll back** of the Q-Table for
152 **all and any** cancelled Function Units, to *previous* (historic)
153 Q-Table values for the relevant destination registers. Once
154 rolled back, the Function Unit must store its result in the register
155 file, prior to permitting the Exception to proceed.
156 * Likewise If a dependent Function Unit has to generate an exception,
157 and its source Function Units are "nameless", the "nameless"
158 Function Units must also "roll back", store their results, and
159 finally permit the Exception to trigger.
160 * Likewise, all other "nameless" results must also be "rolled back",
161 except unlike the Function Units triggering the exception they may
162 roll back to the newest "nameless" historical Q-Table entry
163 (if they have not already been cancelled by the FU triggering the
164 exception).
165
166 Bear in mind that exceptions (like all operations that are ready to
167 commit) may only occur in-order (following a FU-to-FU "link" bit),
168 and may only occur if the Function Unit is entirely free of write hazards.
169
170 ## FU-to-FU Dependency Matrix
171
172 The Function-Unit to Function-Unit Dependency Matrix expresses the
173 read and write hazards - dependencies - between Function Units.
174
175 ## Branch Speculation
176
177 Branch speculation is done by preventing instructions from becoming
178 "writeable" until the Branch Unit knows if it has resolved or not.
179 This is done with the addition of "Shadow" lines, as shown below:
180
181 This image reproduced with kind permission, Copyright (C) Mitch Alsup
182 [[!img shadow_issue_flipflops.png]]
183
184 Note that there are multiple "Shadow" signals, coming not just from Branch
185 Speculation but also from predication and exception shadows.
186
187 On a "Failed" signal, the instruction is told to "Go Die". This is
188 passed to the Computation Unit as well. When all "Success" signals
189 are raised the instruction is permitted to enter "Writeable".
190
191 ## Exceptions
192
193 Exceptions shall be handled by each instruction that *may* throw an
194 exception having and holding a "Shadow" wire over all dependent
195 Function Units, in exactly the same way as Branch Speculation.
196 Likewise, dependent instructions are prevented and prohibited from
197 entering the "Writeable" state.
198
199 Dependent downstream instructions, if the exception is thrown,
200 shall have the "Failed" bit ASSERTED (by the Function Unit throwing
201 the exception) such that the down-stream dependent instruction is told
202 to "Go Die".
203
204 If the point is reached at which the instruction knows that the
205 Exception cannot possibly occur, the "Success" signal is raised
206 instead, thus cancelling the "hold" over dependent downstream
207 instructions - again in exactly the same way as Branch Speculation
208 "Success".
209
210 Exceptions may **only** be actually raised if they are at the front of
211 the instruction queue, i.e. if they are free of write hazards.
212 See section on "Function Unit Commit" phase, as the Function Units
213 have a "link bit" that preserves the instruction issue order, which
214 must also be respected.
215
216 # Spectre-style timing mitigation
217
218 Spectre-style timing attacks are defined by one instruction issue
219 affecting the completion time of past **and future** instructions.
220 The key insight to mitigation against such attacks is to note that
221 arbitrary untrusted instructions must not be permitted to affect
222 trusted instructions. Consequently as long as there is a firebreak
223 (a "Fence") between trusted and untrusted, timing attacks can be
224 held off.
225
226 Two instructions ("hints") shall therefore be added:
227
228 * One that stops speculation, multi-issue and any out-of-order
229 resource allocation for a minimum of 16 instructions.
230 * Another that **cancels** all speculation and reservations,
231 cancels "nameless" registers, waits for and ensures that all
232 outstanding instructions have completed and committed, before
233 permitting the processor to continue further.
234
235 This latter shall occur unconditionally without requiring a special
236 instruction to be called, on ECALL as well as all exceptions and
237 interrupts.
238
239 # ALU design
240
241 There is a separate pipelined alu for fdiv/fsqrt/frsqrt/idiv/irem
242 that is possibly shared between 2 or 4 cores.
243
244 The main ALUs are each a unified ALU for i8-i64/f16-f64 where the
245 ALU is split into lanes with separate instructions for each 32-bit half.
246 So, the multiplier should be capable of 64-bit fmadd, 2x32-bit fmadd,
247 4x16-bit fmadd, 1x32-bit fmadd + 2x16-bit fmadd (in either order), and all
248 (8/16/32/64) sizes of integer mul/mulhsu/mulh/mulhu in 2 groups of 32-bits.
249 We can implement fmul using fmadd with 0 (make sure that we get the right
250 sign bit for 0 for all rounding modes).
251
252 # Rowhammer Mitigation
253
254 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-March/000699.html>
255 * <https://arxiv.org/pdf/1903.00446.pdf>