(no commit message)
[libreriscv.git] / simple_v_extension / specification.mdwn
1
2 # Simple-V (Parallelism Extension Proposal) Specification
3
4 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
5 * Status: DRAFTv0.6.1
6 * Last edited: 10 sep 2019
7 * Ancillary resource: [[opcodes]]
8 * Ancillary resource: [[sv_prefix_proposal]]
9 * Ancillary resource: [[abridged_spec]]
10 * Ancillary resource: [[vblock_format]]
11 * Ancillary resource: [[appendix]]
12
13 Authors/Contributors:
14
15 * Luke Kenneth Casson Leighton
16 * Allen Baum
17 * Bruce Hoult
18 * comp.arch
19 * Jacob Bachmeyer
20 * Guy Lemurieux
21 * Jacob Lifshay
22 * Terje Mathisen
23 * The RISC-V Founders, without whom this all would not be possible.
24
25 [[!toc ]]
26
27 # Summary and Background: Rationale
28
29 Simple-V is a uniform parallelism API for RISC-V hardware that has several
30 unplanned side-effects including code-size reduction, expansion of
31 HINT space and more. The reason for
32 creating it is to provide a manageable way to turn a pre-existing design
33 into a parallel one, in a step-by-step incremental fashion, without adding any new opcodes, thus allowing
34 the implementor to focus on adding hardware where it is needed and necessary.
35 The primary target is for mobile-class 3D GPUs and VPUs, with secondary
36 goals being to reduce executable size (by extending the effectiveness of RV opcodes, RVC in particular) and reduce context-switch latency.
37
38 Critically: **No new instructions are added**. The parallelism (if any
39 is implemented) is implicitly added by tagging *standard* scalar registers
40 for redirection. When such a tagged register is used in any instruction,
41 it indicates that the PC shall **not** be incremented; instead a loop
42 is activated where *multiple* instructions are issued to the pipeline
43 (as determined by a length CSR), with contiguously incrementing register
44 numbers starting from the tagged register. When the last "element"
45 has been reached, only then is the PC permitted to move on. Thus
46 Simple-V effectively sits (slots) *in between* the instruction decode phase
47 and the ALU(s).
48
49 The barrier to entry with SV is therefore very low. The minimum
50 compliant implementation is software-emulation (traps), requiring
51 only the CSRs and CSR tables, and that an exception be thrown if an
52 instruction's registers are detected to have been tagged. The looping
53 that would otherwise be done in hardware is thus carried out in software,
54 instead. Whilst much slower, it is "compliant" with the SV specification,
55 and may be suited for implementation in RV32E and also in situations
56 where the implementor wishes to focus on certain aspects of SV, without
57 unnecessary time and resources into the silicon, whilst also conforming
58 strictly with the API. A good area to punt to software would be the
59 polymorphic element width capability for example.
60
61 Hardware Parallelism, if any, is therefore added at the implementor's
62 discretion to turn what would otherwise be a sequential loop into a
63 parallel one.
64
65 To emphasise that clearly: Simple-V (SV) is *not*:
66
67 * A SIMD system
68 * A SIMT system
69 * A Vectorisation Microarchitecture
70 * A microarchitecture of any specific kind
71 * A mandatory parallel processor microarchitecture of any kind
72 * A supercomputer extension
73
74 SV does **not** tell implementors how or even if they should implement
75 parallelism: it is a hardware "API" (Application Programming Interface)
76 that, if implemented, presents a uniform and consistent way to *express*
77 parallelism, at the same time leaving the choice of if, how, how much,
78 when and whether to parallelise operations **entirely to the implementor**.
79
80 # Basic Operation
81
82 The principle of SV is as follows:
83
84 * Standard RV instructions are "prefixed" (extended) through a 48/64
85 bit format (single instruction option) or a variable
86 length VLIW-like prefix (multi or "grouped" option).
87 * The prefix(es) indicate which registers are "tagged" as
88 "vectorised". Predicates can also be added, and element widths
89 overridden on any src or dest register.
90 * A "Vector Length" CSR is set, indicating the span of any future
91 "parallel" operations.
92 * If any operation (a **scalar** standard RV opcode) uses a register
93 that has been so "marked" ("tagged"), a hardware "macro-unrolling loop"
94 is activated, of length VL, that effectively issues **multiple**
95 identical instructions using contiguous sequentially-incrementing
96 register numbers, based on the "tags".
97 * **Whether they be executed sequentially or in parallel or a
98 mixture of both or punted to software-emulation in a trap handler
99 is entirely up to the implementor**.
100
101 In this way an entire scalar algorithm may be vectorised with
102 the minimum of modification to the hardware and to compiler toolchains.
103
104 To reiterate: **There are *no* new opcodes**. The scheme works *entirely*
105 on hidden context that augments *scalar* RISCV instructions.
106
107 # CSRs <a name="csrs"></a>
108
109 * An optional "reshaping" CSR key-value table which remaps from a 1D
110 linear shape to 2D or 3D, including full transposition.
111
112 There are five additional CSRs, available in any privilege level:
113
114 * MVL (the Maximum Vector Length)
115 * VL (sets which scalar register is to be the Vector Length)
116 * SUBVL (effectively a kind of SIMD)
117 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
118 * SVPSTATE (state information for SVPrefix)
119 * PCVBLK (the current operation being executed within a VBLOCK Group)
120
121 For User Mode there are the following CSRs:
122
123 * uePCVBLK (a copy of the sub-execution Program Counter, that is relative
124 to the start of the current VBLOCK Group, set on a trap).
125 * ueSTATE (useful for saving and restoring during context switch,
126 and for providing fast transitions)
127 * ueSVPSTATE when SVPrefix is implemented
128 Note: ueSVPSTATE is mirrored in the top 32 bits of ueSTATE.
129
130 There are also three additional CSRs for Supervisor-Mode:
131
132 * sePCVBLK
133 * seSTATE (which contains seSVPSTATE)
134 * seSVPSTATE
135
136 And likewise for M-Mode:
137
138 * mePCVBLK
139 * meSTATE (which contains meSVPSTATE)
140 * meSVPSTATE
141
142 The u/m/s CSRs are treated and handled exactly like their (x)epc
143 equivalents. On entry to or exit from a privilege level, the contents
144 of its (x)eSTATE are swapped with STATE.
145
146 Thus for example, a User Mode trap will end up swapping STATE and ueSTATE
147 (on both entry and exit), allowing User Mode traps to have their own
148 Vectorisation Context set up, separated from and unaffected by normal
149 user applications. If an M Mode trap occurs in the middle of the U Mode
150 trap, STATE is swapped with meSTATE, and restored on exit: the U Mode
151 trap continues unaware that the M Mode trap even occurred.
152
153 Likewise, Supervisor Mode may perform context-switches, safe in the
154 knowledge that its Vectorisation State is unaffected by User Mode.
155
156 The access pattern for these groups of CSRs in each mode follows the
157 same pattern for other CSRs that have M-Mode and S-Mode "mirrors":
158
159 * In M-Mode, the S-Mode and U-Mode CSRs are separate and distinct.
160 * In S-Mode, accessing and changing of the M-Mode CSRs is transparently
161 identical
162 to changing the S-Mode CSRs. Accessing and changing the U-Mode
163 CSRs is permitted.
164 * In U-Mode, accessing and changing of the S-Mode and U-Mode CSRs
165 is prohibited.
166
167 An interesting side effect of SV STATE being separate and distinct in S
168 Mode is that Vectorised saving of an entire register file to the stack
169 is a single instruction (through accidental provision of LOAD-MULTI
170 semantics). If the SVPrefix P64-LD-type format is used, LOAD-MULTI may
171 even be done with a single standalone 64 bit opcode (P64 may set up SVPSTATE.SUBVL,
172 SVPSTATE.VL and SVPSTATE.MVL from an immediate field, to cover the full regfile). It can
173 even be predicated, which opens up some very interesting possibilities.
174
175 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
176 equivalents. See VBLOCK section for details.
177
178 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
179
180 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
181 is variable length and may be dynamically set. MVL is
182 however limited to the regfile bitwidth XLEN (1-32 for RV32,
183 1-64 for RV64 and so on).
184
185 The reason for setting this limit is so that predication registers, when
186 marked as such, may fit into a single register as opposed to fanning
187 out over several registers. This keeps the hardware implementation a
188 little simpler.
189
190 The other important factor to note is that the actual MVL is internally
191 stored **offset by one**, so that it can fit into only 6 bits (for RV64)
192 and still cover a range up to XLEN bits. Attempts to set MVL to zero will
193 return an exception. This is expressed more clearly in the "pseudocode"
194 section, where there are subtle differences between CSRRW and CSRRWI.
195
196 ## Vector Length (VL) <a name="vl" />
197
198 VL is very different from RVV's VL. It contains the scalar register *number* that is to be treated as the Vector Length. It is a sub-field of STATE. When set to zero (x0) VL (vectorisation) is disabled.
199
200 Implementations realistically should keep a cached copy of the register pointed to by VL in the instruction issue and decode phases. Out of Order Engines must then, if it is not x0, add this register to Vectorised instruction Dependency Checking as an additional read/write hazard as appropriate.
201
202 Setting VL via this CSR is very unusual. It should not normally be needed except when [[specification/sv.setvl]] is not implemented. Note that unlike in sv.setvl, setting VL does not change the contents of the scalar register that it points to, although if the scalar register's contents are not within the range of MVL at the time that VL is set, an illegal instruction exception must be raised.
203
204 ## SUBVL - Sub Vector Length
205
206 This is a "group by quantity" that effectively asks each iteration
207 of the hardware loop to load SUBVL elements of width elwidth at a
208 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
209 operation issued, SUBVL operations are issued.
210
211 Another way to view SUBVL is that each element in the VL length vector is
212 now SUBVL times elwidth bits in length and now comprises SUBVL discrete
213 sub operations. This can be viewed as an inner SUBVL hardware for-loop within a VL hardware for-loop in effect,
214 with the sub-element increased every time in the innermost loop. This
215 is best illustrated in the (simplified) pseudocode example, in the
216 [[appendix]].
217
218 The primary use case for SUBVL is for 3D FP Vectors. A Vector of 3D
219 coordinates X,Y,Z for example may be loaded and multiplied then stored, per
220 VL element iteration, rather than having to set VL to three times larger.
221
222 Setting this CSR to 0 must raise an exception. Setting it to a value
223 greater than 4 likewise. To see the relationship with STATE, see below.
224
225 The main effect of SUBVL is that predication bits are applied per
226 **group**, rather than by individual element.
227
228 This saves a not insignificant number of instructions when handling 3D
229 vectors, as otherwise a much longer predicate mask would have to be set
230 up with regularly-repeated bit patterns.
231
232 See SUBVL Pseudocode illustration in the [[appendix]], for details.
233
234 ## STATE
235
236 out of date, see <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001896.html>
237
238 This is a standard CSR that contains sufficient information for a
239 full context save/restore. It contains (and permits setting of):
240
241 * MVL
242 * VL
243 * destoffs - the destination element offset of the current parallel
244 instruction being executed
245 * srcoffs - for twin-predication, the source element offset as well.
246 * SUBVL
247 * svdestoffs - the subvector destination element offset of the current
248 parallel instruction being executed
249
250 Interestingly STATE may hypothetically also be modified to make the
251 immediately-following instruction to skip a certain number of elements,
252 by playing with destoffs and srcoffs (and the subvector offsets as well)
253
254 Setting destoffs and srcoffs is realistically intended for saving state
255 so that exceptions (page faults in particular) may be serviced and the
256 hardware-loop that was being executed at the time of the trap, from
257 user-mode (or Supervisor-mode), may be returned to and continued from
258 exactly where it left off. The reason why this works is because setting
259 User-Mode STATE will not change (not be used) in M-Mode or S-Mode (and
260 is entirely why M-Mode and S-Mode have their own STATE CSRs, meSTATE
261 and seSTATE).
262
263 The format of the STATE CSR is as follows:
264
265 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
266 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
267 | rsvd | dsvoffs | subvl | destoffs | srcoffs | vl | maxvl |
268
269 Legal values of vl are between 0 and 31.
270
271 The relationship between SUBVL and the subvl field is:
272
273 | SUBVL | (25..24) |
274 | ----- | -------- |
275 | 1 | 0b00 |
276 | 2 | 0b01 |
277 | 3 | 0b10 |
278 | 4 | 0b11 |
279
280 When setting this CSR, the following characteristics will be enforced:
281
282 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
283 * **VL** must be set to a scalar register between 0 and 31.
284 * **SUBVL** which sets a SIMD-like quantity, has only 4 values so there
285 are no changes needed
286 * **srcoffs** will be truncated to be within the range 0 to VL-1
287 * **destoffs** will be truncated to be within the range 0 to VL-1
288 * **dsvoffs** will be truncated to be within the range 0 to SUBVL-1
289
290 NOTE: if the following instruction is not a twin predicated instruction,
291 and destoffs or dsvoffs has been set to non-zero, subsequent execution
292 behaviour is undefined. **USE WITH CARE**.
293
294 NOTE: sub-vector looping does not require a twin-predicate corresponding
295 index, because sub-vectors use the *main* (VL) loop predicate bit.
296
297 When SVPrefix is implemented, it can have its own VL, MVL and SUBVL, as well as element offsets. SVSTATE.VL acts slightly differently in that it is no longer a pointer to a scalar register but is an actual value just like RVV's VL.
298
299 The format of SVSTATE, which fits into *both* the top bits of STATE and also into a separate CSR, is as follows:
300
301 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
302 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
303 | rsvd | dsvoffs | subvl | destoffs | srcoffs | vl | maxvl |
304
305 ### Hardware rules for when to increment STATE offsets
306
307 The offsets inside STATE are like the indices in a loop, except
308 in hardware. They are also partially (conceptually) similar to a
309 "sub-execution Program Counter". As such, and to allow proper context
310 switching and to define correct exception behaviour, the following rules
311 must be observed:
312
313 * When the VL CSR is set, srcoffs and destoffs are reset to zero.
314 * Each instruction that contains a "tagged" register shall start
315 execution at the *current* value of srcoffs (and destoffs in the case
316 of twin predication)
317 * Unpredicated bits (in nonzeroing mode) shall cause the element operation
318 to skip, incrementing the srcoffs (or destoffs)
319 * On execution of an element operation, Exceptions shall **NOT** cause
320 srcoffs or destoffs to increment.
321 * On completion of the full Vector Loop (srcoffs = VL-1 or destoffs =
322 VL-1 after the last element is executed), both srcoffs and destoffs
323 shall be reset to zero.
324
325 This latter is why srcoffs and destoffs may be stored as values from
326 0 to XLEN-1 in the STATE CSR, because as loop indices they refer to
327 elements. srcoffs and destoffs never need to be set to VL: their maximum
328 operating values are limited to 0 to VL-1.
329
330 The same corresponding rules apply to SUBVL, svsrcoffs and svdestoffs.
331
332 ## MVL and VL Pseudocode
333
334 The pseudo-code for get and set of VL and MVL use the following internal
335 functions as follows:
336
337 set_mvl_csr(value, rd):
338 STATE.MVL = MIN(value, STATE.MVL)
339
340 get_mvl_csr(rd):
341 regs[rd] = STATE.VL
342
343 set_vl_csr(value, rd):
344 STATE.VL = rd
345 return STATE.VL
346
347 get_vl_csr(rd):
348 return STATE.VL
349
350 Note that where setting MVL behaves as a normal CSR (returns the old
351 value), unlike standard CSR behaviour, setting VL will return the **new**
352 value of VL **not** the old one.
353
354 For CSRRWI, the range of the immediate is restricted to 5 bits. In order to
355 maximise the effectiveness, an immediate of 0 is used to set VL=1,
356 an immediate of 1 is used to set VL=2 and so on:
357
358 CSRRWI_Set_MVL(value):
359 set_mvl_csr(value+1, x0)
360
361 CSRRWI_Set_VL(value):
362 set_vl_csr(value+1, x0)
363
364 However for CSRRW the following pseudocode is used for MVL and VL,
365 where setting the value to zero will cause an exception to be raised.
366 The reason is that if VL or MVL are set to zero, the STATE CSR is
367 not capable of storing that value.
368
369 CSRRW_Set_MVL(rs1, rd):
370 value = regs[rs1]
371 if value == 0 or value > XLEN:
372 raise Exception
373 set_mvl_csr(value, rd)
374
375 CSRRW_Set_VL(rs1, rd):
376 value = regs[rs1]
377 if value == 0 or value > XLEN:
378 raise Exception
379 set_vl_csr(value, rd)
380
381 In this way, when CSRRW is utilised with a loop variable, the value
382 that goes into VL (and into the destination register) may be used
383 in an instruction-minimal fashion:
384
385 CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
386 CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
387 CSRRWI MVL, 3 # sets MVL == **4** (not 3)
388 j zerotest # in case loop counter a0 already 0
389 loop:
390 CSRRW VL, t0, a0 # vl = t0 = min(mvl, a0)
391 ld a3, a1 # load 4 registers a3-6 from x
392 slli t1, t0, 3 # t1 = vl * 8 (in bytes)
393 ld a7, a2 # load 4 registers a7-10 from y
394 add a1, a1, t1 # increment pointer to x by vl*8
395 fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
396 sub a0, a0, t0 # n -= vl (t0)
397 st a7, a2 # store 4 registers a7-10 to y
398 add a2, a2, t1 # increment pointer to y by vl*8
399 zerotest:
400 bnez a0, loop # repeat if n != 0
401
402 With the STATE CSR, just like with CSRRWI, in order to maximise the
403 utilisation of the limited bitspace, "000000" in binary represents
404 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
405
406 CSRRW_Set_SV_STATE(rs1, rd):
407 value = regs[rs1]
408 get_state_csr(rd)
409 STATE.MVL = set_mvl_csr(value[11:6]+1)
410 STATE.VL = set_vl_csr(value[5:0]+1)
411 STATE.destoffs = value[23:18]>>18
412 STATE.srcoffs = value[23:18]>>12
413
414 get_state_csr(rd):
415 regs[rd] = (STATE.MVL-1) | (STATE.VL-1)<<6 | (STATE.srcoffs)<<12 |
416 (STATE.destoffs)<<18
417 return regs[rd]
418
419 In both cases, whilst CSR read of VL and MVL return the exact values
420 of VL and MVL respectively, reading and writing the STATE CSR returns
421 those values **minus one**. This is absolutely critical to implement
422 if the STATE CSR is to be used for fast context-switching.
423
424 ## VL, MVL and SUBVL instruction aliases
425
426 This table contains pseudo-assembly instruction aliases. Note the
427 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
428 reduced range of the 5 bit immediate.
429
430 | alias | CSR |
431 | - | - |
432 | SETVL rd, rs | CSRRW VL, rd, rs |
433 | SETVLi rd, #n | CSRRWI VL, rd, #n-1 |
434 | GETVL rd | CSRRW VL, rd, x0 |
435 | SETMVL rd, rs | CSRRW MVL, rd, rs |
436 | SETMVLi rd, #n | CSRRWI MVL,rd, #n-1 |
437 | GETMVL rd | CSRRW MVL, rd, x0 |
438
439 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
440
441 ## Register key-value (CAM) table <a name="regcsrtable" />
442
443 *NOTE: in prior versions of SV, this table used to be writable and
444 accessible via CSRs. It is now stored in the VBLOCK instruction format. Note
445 that this table does *not* get applied to the SVPrefix P48/64 format,
446 only to scalar opcodes*
447
448 The purpose of the Register table is three-fold:
449
450 * To mark integer and floating-point registers as requiring "redirection"
451 if it is ever used as a source or destination in any given operation.
452 This involves a level of indirection through a 5-to-7-bit lookup table,
453 such that **unmodified** operands with 5 bits (3 for some RVC ops) may
454 access up to **128** registers.
455 * To indicate whether, after redirection through the lookup table, the
456 register is a vector (or remains a scalar).
457 * To over-ride the implicit or explicit bitwidth that the operation would
458 normally give the register.
459
460 Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15)
461 and the Register table contains entried that only refer to registerd
462 x1-x14 or x16-x31, such operations will *never* activate the VL hardware
463 loop!
464
465 If however the (16 bit) Register table does contain such an entry (x8-x15
466 or x2 in the case of LWSP), that src or dest reg may be redirected
467 anywhere to the *full* 128 register range. Thus, RVC becomes far more
468 powerful and has many more opportunities to reduce code size that in
469 Standard RV32/RV64 executables.
470
471 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
472
473 i/f is set to "1" to indicate that the redirection/tag entry is to
474 be applied to integer registers; 0 indicates that it is relevant to
475 floating-point registers.
476
477 The 8 bit format is used for a much more compact expression. "isvec"
478 is implicit and, similar to [[sv_prefix_proposal]], the target vector
479 is "regnum<<2", implicitly. Contrast this with the 16-bit format where
480 the target vector is *explicitly* named in bits 8 to 14, and bit 15 may
481 optionally set "scalar" mode.
482
483 Note that whilst SVPrefix adds one extra bit to each of rd, rs1 etc.,
484 and thus the "vector" mode need only shift the (6 bit) regnum by 1 to
485 get the actual (7 bit) register number to use, there is not enough space
486 in the 8 bit format (only 5 bits for regnum) so "regnum<<2" is required.
487
488 vew has the following meanings, indicating that the instruction's
489 operand size is "over-ridden" in a polymorphic fashion:
490
491 | vew | bitwidth |
492 | --- | ------------------- |
493 | 00 | default (XLEN/FLEN) |
494 | 01 | 8 bit |
495 | 10 | 16 bit |
496 | 11 | 32 bit |
497
498 As the above table is a CAM (key-value store) it may be appropriate
499 (faster, implementation-wise) to expand it as follows:
500
501 [[!inline raw="yes" pages="simple_v_extension/reg_table" ]]
502
503 ## Predication Table <a name="predication_csr_table"></a>
504
505 *NOTE: in prior versions of SV, this table used to be writable and
506 accessible via CSRs. It is now stored in the VBLOCK instruction format.
507 The table does **not** apply to SVPrefix opcodes*
508
509 The Predication Table is a key-value store indicating whether, if a
510 given destination register (integer or floating-point) is referred to
511 in an instruction, it is to be predicated. Like the Register table, it
512 is an indirect lookup that allows the RV opcodes to not need modification.
513
514 It is particularly important to note
515 that the *actual* register used can be *different* from the one that is
516 in the instruction, due to the redirection through the lookup table.
517
518 * regidx is the register that in combination with the
519 i/f flag, if that integer or floating-point register is referred to in a
520 (standard RV) instruction results in the lookup table being referenced
521 to find the predication mask to use for this operation.
522 * predidx is the *actual* (full, 7 bit) register to be used for the
523 predication mask.
524 * inv indicates that the predication mask bits are to be inverted
525 prior to use *without* actually modifying the contents of the
526 register from which those bits originated.
527 * zeroing is either 1 or 0, and if set to 1, the operation must
528 place zeros in any element position where the predication mask is
529 set to zero. If zeroing is set to 0, unpredicated elements *must*
530 be left alone. Some microarchitectures may choose to interpret
531 this as skipping the operation entirely. Others which wish to
532 stick more closely to a SIMD architecture may choose instead to
533 interpret unpredicated elements as an internal "copy element"
534 operation (which would be necessary in SIMD microarchitectures
535 that perform register-renaming)
536 * ffirst is a special mode that stops sequential element processing when
537 a data-dependent condition occurs, whether a trap or a conditional test.
538 The handling of each (trap or conditional test) is slightly different:
539 see Instruction sections for further details
540
541 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
542
543 The 8 bit format is a compact and less expressive variant of the full
544 16 bit format. Using the 8 bit format is very different: the predicate
545 register to use is implicit, and numbering begins inplicitly from x9. The
546 regnum is still used to "activate" predication, in the same fashion as
547 described above.
548
549 The 16 bit Predication CSR Table is a key-value store, so
550 implementation-wise it will be faster to turn the table around (maintain
551 topologically equivalent state). Opportunities then exist to access
552 registers in unary form instead of binary, saving gates and power by
553 only activating "redirection" with a single AND gate, instead of
554 multiple multi-bit XORs (a CAM):
555
556 [[!inline raw="yes" pages="simple_v_extension/pred_table" ]]
557
558 So when an operation is to be predicated, it is the internal state that
559 is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
560 pseudo-code for operations is given, where p is the explicit (direct)
561 reference to the predication register to be used:
562
563 for (int i=0; i<vl; ++i)
564 if ([!]preg[p][i])
565 (d ? vreg[rd][i] : sreg[rd]) =
566 iop(s1 ? vreg[rs1][i] : sreg[rs1],
567 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
568
569 This instead becomes an *indirect* reference using the *internal* state
570 table generated from the Predication CSR key-value store, which is used
571 as follows.
572
573 if type(iop) == INT:
574 preg = int_pred_reg[rd]
575 else:
576 preg = fp_pred_reg[rd]
577
578 for (int i=0; i<vl; ++i)
579 predicate, zeroing = get_pred_val(type(iop) == INT, rd):
580 if (predicate && (1<<i))
581 result = iop(s1 ? regfile[rs1+i] : regfile[rs1],
582 s2 ? regfile[rs2+i] : regfile[rs2]);
583 (d ? regfile[rd+i] : regfile[rd]) = result
584 if preg.ffirst and result == 0:
585 VL = i # result was zero, end loop early, return VL
586 return
587 else if (zeroing)
588 (d ? regfile[rd+i] : regfile[rd]) = 0
589
590 Note:
591
592 * d, s1 and s2 are booleans indicating whether destination,
593 source1 and source2 are vector or scalar
594 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
595 above, for clarity. rd, rs1 and rs2 all also must ALSO go through
596 register-level redirection (from the Register table) if they are
597 vectors.
598 * fail-on-first mode stops execution early whenever an operation
599 returns a zero value. floating-point results count both
600 positive-zero as well as negative-zero as "fail".
601
602 If written as a function, obtaining the predication mask (and whether
603 zeroing takes place) may be done as follows:
604
605 [[!inline raw="yes" pages="simple_v_extension/get_pred_value" ]]
606
607 Note here, critically, that **only** if the register is marked
608 in its **register** table entry as being "active" does the testing
609 proceed further to check if the **predicate** table entry is
610 also active.
611
612 Note also that this is in direct contrast to branch operations
613 for the storage of comparisions: in these specific circumstances
614 the requirement for there to be an active *register* entry
615 is removed.
616
617 ## Fail-on-First Mode <a name="ffirst-mode"></a>
618
619 ffirst is a special data-dependent predicate mode. There are two
620 variants: one is for faults: typically for LOAD/STORE operations,
621 which may encounter end of page faults during a series of operations.
622 The other variant is comparisons such as FEQ (or the augmented behaviour
623 of Branch), and any operation that returns a result of zero (whether
624 integer or floating-point). In the FP case, this includes negative-zero.
625
626 ffirst interacts with zero- and non-zero predication. In non-zeroing
627 mode, masked-out operations are simply excluded from testing (can never
628 fail). However for fail-comparisons (not faults) in zeroing mode, the
629 result will be zero: this *always* "fails", thus on the very first
630 masked-out element ffirst will always terminate.
631
632 Note that ffirst mode works because the execution order must "appear" to be
633 (in "program order"). An in-order architecture must execute the element
634 operations in sequence, whilst an out-of-order architecture must *commit*
635 the element operations in sequence and cancel speculatively-executed
636 ones (giving the appearance of in-order execution).
637
638 Note also, that if ffirst mode is needed without predication, a special
639 "always-on" Predicate Table Entry may be constructed by setting
640 inverse-on and using x0 as the predicate register. This
641 will have the effect of creating a mask of all ones, allowing ffirst
642 to be set.
643
644 See [[appendix]] for more details on fail-on-first modes, as well as
645 pseudo-code, below.
646
647 ## REMAP and SHAPE CSRs <a name="remap" />
648
649 See optional [[remap]] section.
650
651 # Instruction Execution Order
652
653 Simple-V behaves as if it is a hardware-level "macro expansion system",
654 substituting and expanding a single instruction into multiple sequential
655 instructions with contiguous and sequentially-incrementing registers.
656 As such, it does **not** modify - or specify - the behaviour and semantics of
657 the execution order: that may be deduced from the **existing** RV
658 specification in each and every case.
659
660 So for example if a particular micro-architecture permits out-of-order
661 execution, and it is augmented with Simple-V, then wherever instructions
662 may be out-of-order then so may the "post-expansion" SV ones.
663
664 If on the other hand there are memory guarantees which specifically
665 prevent and prohibit certain instructions from being re-ordered
666 (such as the Atomicity Axiom, or FENCE constraints), then clearly
667 those constraints **MUST** also be obeyed "post-expansion".
668
669 It should be absolutely clear that SV is **not** about providing new
670 functionality or changing the existing behaviour of a micro-architetural
671 design, or about changing the RISC-V Specification.
672 It is **purely** about compacting what would otherwise be contiguous
673 instructions that use sequentially-increasing register numbers down
674 to the **one** instruction.
675
676 # Instructions <a name="instructions" />
677
678 See [[appendix]] for specific cases where instruction behaviour is
679 augmented. A greatly simplified example is below. Note that this
680 is the ADD implementation, not a separate VADD instruction:
681
682 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
683
684 Note that several things have been left out of this example.
685 See [[appendix]] for additional examples that show how to add
686 support for additional features (twin predication, elwidth,
687 zeroing, SUBVL etc.)
688
689 Branches in particular have been transparently augmented to include
690 "collation" of comparison results into a tagged register.
691
692 # Exceptions
693
694 Exceptions may occur at any time, in any given underlying scalar
695 operation. This implies that context-switching (traps) may occur, and
696 operation must be returned to where it left off. That in turn implies
697 that the full state - including the current parallel element being
698 processed - has to be saved and restored. This is what the **STATE**
699 and **PCVBLK** CSRs are for.
700
701 The implications are that all underlying individual scalar operations
702 "issued" by the parallelisation have to appear to be executed sequentially.
703 The further implications are that if two or more individual element
704 operations are underway, and one with an earlier index causes an exception,
705 it will be necessary for the microarchitecture to **discard** or terminate
706 operations with higher indices. Optimisated microarchitectures could
707 hypothetically store (cache) results, for subsequent replay if appropriate.
708
709 In short: exception handling **MUST** be precise, in-order, and exactly
710 like Standard RISC-V as far as the instruction execution order is
711 concerned, regardless of whether it is PC, PCVBLK, VL or SUBVL that
712 is currently being incremented.
713
714 # Hints
715
716 A "HINT" is an operation that has no effect on architectural state,
717 where its use may, by agreed convention, give advance notification
718 to the microarchitecture: branch prediction notification would be
719 a good example. Usually HINTs are where rd=x0.
720
721 With Simple-V being capable of issuing *parallel* instructions where
722 rd=x0, the space for possible HINTs is expanded considerably. VL
723 could be used to indicate different hints. In addition, if predication
724 is set, the predication register itself could hypothetically be passed
725 in as a *parameter* to the HINT operation.
726
727 No specific hints are yet defined in Simple-V
728
729 # Vector Block Format <a name="vliw-format"></a>
730
731 The VBLOCK Format allows Register, Predication and Vector Length to be contextually associated with a group of RISC-V scalar opcodes. The format is as follows:
732
733 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
734
735 For more details, including the CSRs, see ancillary resource: [[vblock_format]]
736
737 # Under consideration <a name="issues"></a>
738
739 See [[discussion]]
740