add branch clarification
[libreriscv.git] / simple_v_extension / specification.mdwn
1 # Simple-V (Parallelism Extension Proposal) Specification
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6.1
5 * Last edited: 10 sep 2019
6 * Ancillary resource: [[opcodes]]
7 * Ancillary resource: [[sv_prefix_proposal]]
8 * Ancillary resource: [[abridged_spec]]
9 * Ancillary resource: [[vblock_format]]
10 * Ancillary resource: [[appendix]]
11
12 With thanks to:
13
14 * Allen Baum
15 * Bruce Hoult
16 * comp.arch
17 * Jacob Bachmeyer
18 * Guy Lemurieux
19 * Jacob Lifshay
20 * Terje Mathisen
21 * The RISC-V Founders, without whom this all would not be possible.
22
23 [[!toc ]]
24
25 # Summary and Background: Rationale
26
27 Simple-V is a uniform parallelism API for RISC-V hardware that has several
28 unplanned side-effects including code-size reduction, expansion of
29 HINT space and more. The reason for
30 creating it is to provide a manageable way to turn a pre-existing design
31 into a parallel one, in a step-by-step incremental fashion, without adding any new opcodes, thus allowing
32 the implementor to focus on adding hardware where it is needed and necessary.
33 The primary target is for mobile-class 3D GPUs and VPUs, with secondary
34 goals being to reduce executable size (by extending the effectiveness of RV opcodes, RVC in particular) and reduce context-switch latency.
35
36 Critically: **No new instructions are added**. The parallelism (if any
37 is implemented) is implicitly added by tagging *standard* scalar registers
38 for redirection. When such a tagged register is used in any instruction,
39 it indicates that the PC shall **not** be incremented; instead a loop
40 is activated where *multiple* instructions are issued to the pipeline
41 (as determined by a length CSR), with contiguously incrementing register
42 numbers starting from the tagged register. When the last "element"
43 has been reached, only then is the PC permitted to move on. Thus
44 Simple-V effectively sits (slots) *in between* the instruction decode phase
45 and the ALU(s).
46
47 The barrier to entry with SV is therefore very low. The minimum
48 compliant implementation is software-emulation (traps), requiring
49 only the CSRs and CSR tables, and that an exception be thrown if an
50 instruction's registers are detected to have been tagged. The looping
51 that would otherwise be done in hardware is thus carried out in software,
52 instead. Whilst much slower, it is "compliant" with the SV specification,
53 and may be suited for implementation in RV32E and also in situations
54 where the implementor wishes to focus on certain aspects of SV, without
55 unnecessary time and resources into the silicon, whilst also conforming
56 strictly with the API. A good area to punt to software would be the
57 polymorphic element width capability for example.
58
59 Hardware Parallelism, if any, is therefore added at the implementor's
60 discretion to turn what would otherwise be a sequential loop into a
61 parallel one.
62
63 To emphasise that clearly: Simple-V (SV) is *not*:
64
65 * A SIMD system
66 * A SIMT system
67 * A Vectorisation Microarchitecture
68 * A microarchitecture of any specific kind
69 * A mandary parallel processor microarchitecture of any kind
70 * A supercomputer extension
71
72 SV does **not** tell implementors how or even if they should implement
73 parallelism: it is a hardware "API" (Application Programming Interface)
74 that, if implemented, presents a uniform and consistent way to *express*
75 parallelism, at the same time leaving the choice of if, how, how much,
76 when and whether to parallelise operations **entirely to the implementor**.
77
78 # Basic Operation
79
80 The principle of SV is as follows:
81
82 * Standard RV instructions are "prefixed" (extended) through a 48/64
83 bit format (single instruction option) or a variable
84 length VLIW-like prefix (multi or "grouped" option).
85 * The prefix(es) indicate which registers are "tagged" as
86 "vectorised". Predicates can also be added, and element widths
87 overridden on any src or dest register.
88 * A "Vector Length" CSR is set, indicating the span of any future
89 "parallel" operations.
90 * If any operation (a **scalar** standard RV opcode) uses a register
91 that has been so "marked" ("tagged"), a hardware "macro-unrolling loop"
92 is activated, of length VL, that effectively issues **multiple**
93 identical instructions using contiguous sequentially-incrementing
94 register numbers, based on the "tags".
95 * **Whether they be executed sequentially or in parallel or a
96 mixture of both or punted to software-emulation in a trap handler
97 is entirely up to the implementor**.
98
99 In this way an entire scalar algorithm may be vectorised with
100 the minimum of modification to the hardware and to compiler toolchains.
101
102 To reiterate: **There are *no* new opcodes**. The scheme works *entirely*
103 on hidden context that augments *scalar* RISCV instructions.
104
105 # CSRs <a name="csrs"></a>
106
107 * An optional "reshaping" CSR key-value table which remaps from a 1D
108 linear shape to 2D or 3D, including full transposition.
109
110 There are five additional CSRs, available in any privilege level:
111
112 * MVL (the Maximum Vector Length)
113 * VL (sets which scalar register is to be the Vector Length)
114 * SUBVL (effectively a kind of SIMD)
115 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
116 * SVPSTATE (state information for SVPrefix)
117 * PCVBLK (the current operation being executed within a VBLOCK Group)
118
119 For User Mode there are the following CSRs:
120
121 * uePCVBLK (a copy of the sub-execution Program Counter, that is relative
122 to the start of the current VBLOCK Group, set on a trap).
123 * ueSTATE (useful for saving and restoring during context switch,
124 and for providing fast transitions)
125 * ueSVPSTATE when SVPrefix is implemented
126 Note: ueSVPSTATE is mirrored in the top 32 bits of ueSTATE.
127
128 There are also three additional CSRs for Supervisor-Mode:
129
130 * sePCVBLK
131 * seSTATE (which contains seSVPSTATE)
132 * seSVPSTATE
133
134 And likewise for M-Mode:
135
136 * mePCVBLK
137 * meSTATE (which contains meSVPSTATE)
138 * meSVPSTATE
139
140 The u/m/s CSRs are treated and handled exactly like their (x)epc
141 equivalents. On entry to or exit from a privilege level, the contents
142 of its (x)eSTATE are swapped with STATE.
143
144 Thus for example, a User Mode trap will end up swapping STATE and ueSTATE
145 (on both entry and exit), allowing User Mode traps to have their own
146 Vectorisation Context set up, separated from and unaffected by normal
147 user applications. If an M Mode trap occurs in the middle of the U Mode
148 trap, STATE is swapped with meSTATE, and restored on exit: the U Mode
149 trap continues unaware that the M Mode trap even occurred.
150
151 Likewise, Supervisor Mode may perform context-switches, safe in the
152 knowledge that its Vectorisation State is unaffected by User Mode.
153
154 The access pattern for these groups of CSRs in each mode follows the
155 same pattern for other CSRs that have M-Mode and S-Mode "mirrors":
156
157 * In M-Mode, the S-Mode and U-Mode CSRs are separate and distinct.
158 * In S-Mode, accessing and changing of the M-Mode CSRs is transparently
159 identical
160 to changing the S-Mode CSRs. Accessing and changing the U-Mode
161 CSRs is permitted.
162 * In U-Mode, accessing and changing of the S-Mode and U-Mode CSRs
163 is prohibited.
164
165 An interesting side effect of SV STATE being separate and distinct in S
166 Mode is that Vectorised saving of an entire register file to the stack
167 is a single instruction (through accidental provision of LOAD-MULTI
168 semantics). If the SVPrefix P64-LD-type format is used, LOAD-MULTI may
169 even be done with a single standalone 64 bit opcode (P64 may set up SVPSTATE.SUBVL,
170 SVPSTATE.VL and SVPSTATE.MVL from an immediate field, to cover the full regfile). It can
171 even be predicated, which opens up some very interesting possibilities.
172
173 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
174 equivalents. See VBLOCK section for details.
175
176 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
177
178 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
179 is variable length and may be dynamically set. MVL is
180 however limited to the regfile bitwidth XLEN (1-32 for RV32,
181 1-64 for RV64 and so on).
182
183 The reason for setting this limit is so that predication registers, when
184 marked as such, may fit into a single register as opposed to fanning
185 out over several registers. This keeps the hardware implementation a
186 little simpler.
187
188 The other important factor to note is that the actual MVL is internally
189 stored **offset by one**, so that it can fit into only 6 bits (for RV64)
190 and still cover a range up to XLEN bits. Attempts to set MVL to zero will
191 return an exception. This is expressed more clearly in the "pseudocode"
192 section, where there are subtle differences between CSRRW and CSRRWI.
193
194 ## Vector Length (VL) <a name="vl" />
195
196 VL is very different from RVV's VL. It contains the scalar register *number* that is to be treated as the Vector Length. It is a sub-field of STATE. When set to zero (x0) VL (vectorisation) is disabled.
197
198 Implementations realistically should keep a cached copy of the register pointed to by VL in the instruction issue and decode phases. Out of Order Engines must then, if it is not x0, add this register to Vectorised instruction Dependency Checking as an additional read/write hazard as appropriate.
199
200 Setting VL via this CSR is very unusual. It should not normally be needed except when [[specification/sv.setvl]] is not implemented. Note that unlike in sv.setvl, setting VL does not change the contents of the scalar register that it points to, although if the scalar register's contents are not within the range of MVL at the time that VL is set, an illegal instruction exception must be raised.
201
202 ## SUBVL - Sub Vector Length
203
204 This is a "group by quantity" that effectively asks each iteration
205 of the hardware loop to load SUBVL elements of width elwidth at a
206 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
207 operation issued, SUBVL operations are issued.
208
209 Another way to view SUBVL is that each element in the VL length vector is
210 now SUBVL times elwidth bits in length and now comprises SUBVL discrete
211 sub operations. This can be viewed as an inner SUBVL hardware for-loop within a VL hardware for-loop in effect,
212 with the sub-element increased every time in the innermost loop. This
213 is best illustrated in the (simplified) pseudocode example, in the
214 [[appendix]].
215
216 The primary use case for SUBVL is for 3D FP Vectors. A Vector of 3D
217 coordinates X,Y,Z for example may be loaded and multiplied then stored, per
218 VL element iteration, rather than having to set VL to three times larger.
219
220 Setting this CSR to 0 must raise an exception. Setting it to a value
221 greater than 4 likewise. To see the relationship with STATE, see below.
222
223 The main effect of SUBVL is that predication bits are applied per
224 **group**, rather than by individual element.
225
226 This saves a not insignificant number of instructions when handling 3D
227 vectors, as otherwise a much longer predicate mask would have to be set
228 up with regularly-repeated bit patterns.
229
230 See SUBVL Pseudocode illustration in the [[appendix]], for details.
231
232 ## STATE
233
234 out of date, see <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001896.html>
235
236 This is a standard CSR that contains sufficient information for a
237 full context save/restore. It contains (and permits setting of):
238
239 * MVL
240 * VL
241 * destoffs - the destination element offset of the current parallel
242 instruction being executed
243 * srcoffs - for twin-predication, the source element offset as well.
244 * SUBVL
245 * svdestoffs - the subvector destination element offset of the current
246 parallel instruction being executed
247
248 Interestingly STATE may hypothetically also be modified to make the
249 immediately-following instruction to skip a certain number of elements,
250 by playing with destoffs and srcoffs (and the subvector offsets as well)
251
252 Setting destoffs and srcoffs is realistically intended for saving state
253 so that exceptions (page faults in particular) may be serviced and the
254 hardware-loop that was being executed at the time of the trap, from
255 user-mode (or Supervisor-mode), may be returned to and continued from
256 exactly where it left off. The reason why this works is because setting
257 User-Mode STATE will not change (not be used) in M-Mode or S-Mode (and
258 is entirely why M-Mode and S-Mode have their own STATE CSRs, meSTATE
259 and seSTATE).
260
261 The format of the STATE CSR is as follows:
262
263 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
264 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
265 | rsvd | dsvoffs | subvl | destoffs | srcoffs | vl | maxvl |
266
267 Legal values of vl are between 0 and 31.
268
269 The relationship between SUBVL and the subvl field is:
270
271 | SUBVL | (25..24) |
272 | ----- | -------- |
273 | 1 | 0b00 |
274 | 2 | 0b01 |
275 | 3 | 0b10 |
276 | 4 | 0b11 |
277
278 When setting this CSR, the following characteristics will be enforced:
279
280 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
281 * **VL** must be set to a scalar register between 0 and 31.
282 * **SUBVL** which sets a SIMD-like quantity, has only 4 values so there
283 are no changes needed
284 * **srcoffs** will be truncated to be within the range 0 to VL-1
285 * **destoffs** will be truncated to be within the range 0 to VL-1
286 * **dsvoffs** will be truncated to be within the range 0 to SUBVL-1
287
288 NOTE: if the following instruction is not a twin predicated instruction,
289 and destoffs or dsvoffs has been set to non-zero, subsequent execution
290 behaviour is undefined. **USE WITH CARE**.
291
292 NOTE: sub-vector looping does not require a twin-predicate corresponding
293 index, because sub-vectors use the *main* (VL) loop predicate bit.
294
295 When SVPrefix is implemented, it can have its own VL, MVL and SUBVL, as well as element offsets. SVSTATE.VL acts slightly differently in that it is no longer a pointer to a scalar register but is an actual value just like RVV's VL.
296
297 The format of SVSTATE, which fits into *both* the top bits of STATE and also into a separate CSR, is as follows:
298
299 | (31..28) | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
300 | -------- | -------- | -------- | -------- | -------- | ------- | ------- |
301 | rsvd | dsvoffs | subvl | destoffs | srcoffs | vl | maxvl |
302
303 ### Hardware rules for when to increment STATE offsets
304
305 The offsets inside STATE are like the indices in a loop, except
306 in hardware. They are also partially (conceptually) similar to a
307 "sub-execution Program Counter". As such, and to allow proper context
308 switching and to define correct exception behaviour, the following rules
309 must be observed:
310
311 * When the VL CSR is set, srcoffs and destoffs are reset to zero.
312 * Each instruction that contains a "tagged" register shall start
313 execution at the *current* value of srcoffs (and destoffs in the case
314 of twin predication)
315 * Unpredicated bits (in nonzeroing mode) shall cause the element operation
316 to skip, incrementing the srcoffs (or destoffs)
317 * On execution of an element operation, Exceptions shall **NOT** cause
318 srcoffs or destoffs to increment.
319 * On completion of the full Vector Loop (srcoffs = VL-1 or destoffs =
320 VL-1 after the last element is executed), both srcoffs and destoffs
321 shall be reset to zero.
322
323 This latter is why srcoffs and destoffs may be stored as values from
324 0 to XLEN-1 in the STATE CSR, because as loop indices they refer to
325 elements. srcoffs and destoffs never need to be set to VL: their maximum
326 operating values are limited to 0 to VL-1.
327
328 The same corresponding rules apply to SUBVL, svsrcoffs and svdestoffs.
329
330 ## MVL and VL Pseudocode
331
332 The pseudo-code for get and set of VL and MVL use the following internal
333 functions as follows:
334
335 set_mvl_csr(value, rd):
336 STATE.MVL = MIN(value, STATE.MVL)
337
338 get_mvl_csr(rd):
339 regs[rd] = STATE.VL
340
341 set_vl_csr(value, rd):
342 STATE.VL = rd
343 return STATE.VL
344
345 get_vl_csr(rd):
346 return STATE.VL
347
348 Note that where setting MVL behaves as a normal CSR (returns the old
349 value), unlike standard CSR behaviour, setting VL will return the **new**
350 value of VL **not** the old one.
351
352 For CSRRWI, the range of the immediate is restricted to 5 bits. In order to
353 maximise the effectiveness, an immediate of 0 is used to set VL=1,
354 an immediate of 1 is used to set VL=2 and so on:
355
356 CSRRWI_Set_MVL(value):
357 set_mvl_csr(value+1, x0)
358
359 CSRRWI_Set_VL(value):
360 set_vl_csr(value+1, x0)
361
362 However for CSRRW the following pseudocode is used for MVL and VL,
363 where setting the value to zero will cause an exception to be raised.
364 The reason is that if VL or MVL are set to zero, the STATE CSR is
365 not capable of storing that value.
366
367 CSRRW_Set_MVL(rs1, rd):
368 value = regs[rs1]
369 if value == 0 or value > XLEN:
370 raise Exception
371 set_mvl_csr(value, rd)
372
373 CSRRW_Set_VL(rs1, rd):
374 value = regs[rs1]
375 if value == 0 or value > XLEN:
376 raise Exception
377 set_vl_csr(value, rd)
378
379 In this way, when CSRRW is utilised with a loop variable, the value
380 that goes into VL (and into the destination register) may be used
381 in an instruction-minimal fashion:
382
383 CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
384 CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
385 CSRRWI MVL, 3 # sets MVL == **4** (not 3)
386 j zerotest # in case loop counter a0 already 0
387 loop:
388 CSRRW VL, t0, a0 # vl = t0 = min(mvl, a0)
389 ld a3, a1 # load 4 registers a3-6 from x
390 slli t1, t0, 3 # t1 = vl * 8 (in bytes)
391 ld a7, a2 # load 4 registers a7-10 from y
392 add a1, a1, t1 # increment pointer to x by vl*8
393 fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
394 sub a0, a0, t0 # n -= vl (t0)
395 st a7, a2 # store 4 registers a7-10 to y
396 add a2, a2, t1 # increment pointer to y by vl*8
397 zerotest:
398 bnez a0, loop # repeat if n != 0
399
400 With the STATE CSR, just like with CSRRWI, in order to maximise the
401 utilisation of the limited bitspace, "000000" in binary represents
402 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
403
404 CSRRW_Set_SV_STATE(rs1, rd):
405 value = regs[rs1]
406 get_state_csr(rd)
407 STATE.MVL = set_mvl_csr(value[11:6]+1)
408 STATE.VL = set_vl_csr(value[5:0]+1)
409 STATE.destoffs = value[23:18]>>18
410 STATE.srcoffs = value[23:18]>>12
411
412 get_state_csr(rd):
413 regs[rd] = (STATE.MVL-1) | (STATE.VL-1)<<6 | (STATE.srcoffs)<<12 |
414 (STATE.destoffs)<<18
415 return regs[rd]
416
417 In both cases, whilst CSR read of VL and MVL return the exact values
418 of VL and MVL respectively, reading and writing the STATE CSR returns
419 those values **minus one**. This is absolutely critical to implement
420 if the STATE CSR is to be used for fast context-switching.
421
422 ## VL, MVL and SUBVL instruction aliases
423
424 This table contains pseudo-assembly instruction aliases. Note the
425 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
426 reduced range of the 5 bit immediate.
427
428 | alias | CSR |
429 | - | - |
430 | SETVL rd, rs | CSRRW VL, rd, rs |
431 | SETVLi rd, #n | CSRRWI VL, rd, #n-1 |
432 | GETVL rd | CSRRW VL, rd, x0 |
433 | SETMVL rd, rs | CSRRW MVL, rd, rs |
434 | SETMVLi rd, #n | CSRRWI MVL,rd, #n-1 |
435 | GETMVL rd | CSRRW MVL, rd, x0 |
436
437 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
438
439 ## Register key-value (CAM) table <a name="regcsrtable" />
440
441 *NOTE: in prior versions of SV, this table used to be writable and
442 accessible via CSRs. It is now stored in the VBLOCK instruction format. Note
443 that this table does *not* get applied to the SVPrefix P48/64 format,
444 only to scalar opcodes*
445
446 The purpose of the Register table is three-fold:
447
448 * To mark integer and floating-point registers as requiring "redirection"
449 if it is ever used as a source or destination in any given operation.
450 This involves a level of indirection through a 5-to-7-bit lookup table,
451 such that **unmodified** operands with 5 bits (3 for some RVC ops) may
452 access up to **128** registers.
453 * To indicate whether, after redirection through the lookup table, the
454 register is a vector (or remains a scalar).
455 * To over-ride the implicit or explicit bitwidth that the operation would
456 normally give the register.
457
458 Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15)
459 and the Register table contains entried that only refer to registerd
460 x1-x14 or x16-x31, such operations will *never* activate the VL hardware
461 loop!
462
463 If however the (16 bit) Register table does contain such an entry (x8-x15
464 or x2 in the case of LWSP), that src or dest reg may be redirected
465 anywhere to the *full* 128 register range. Thus, RVC becomes far more
466 powerful and has many more opportunities to reduce code size that in
467 Standard RV32/RV64 executables.
468
469 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
470
471 i/f is set to "1" to indicate that the redirection/tag entry is to
472 be applied to integer registers; 0 indicates that it is relevant to
473 floating-point registers.
474
475 The 8 bit format is used for a much more compact expression. "isvec"
476 is implicit and, similar to [[sv_prefix_proposal]], the target vector
477 is "regnum<<2", implicitly. Contrast this with the 16-bit format where
478 the target vector is *explicitly* named in bits 8 to 14, and bit 15 may
479 optionally set "scalar" mode.
480
481 Note that whilst SVPrefix adds one extra bit to each of rd, rs1 etc.,
482 and thus the "vector" mode need only shift the (6 bit) regnum by 1 to
483 get the actual (7 bit) register number to use, there is not enough space
484 in the 8 bit format (only 5 bits for regnum) so "regnum<<2" is required.
485
486 vew has the following meanings, indicating that the instruction's
487 operand size is "over-ridden" in a polymorphic fashion:
488
489 | vew | bitwidth |
490 | --- | ------------------- |
491 | 00 | default (XLEN/FLEN) |
492 | 01 | 8 bit |
493 | 10 | 16 bit |
494 | 11 | 32 bit |
495
496 As the above table is a CAM (key-value store) it may be appropriate
497 (faster, implementation-wise) to expand it as follows:
498
499 [[!inline raw="yes" pages="simple_v_extension/reg_table" ]]
500
501 ## Predication Table <a name="predication_csr_table"></a>
502
503 *NOTE: in prior versions of SV, this table used to be writable and
504 accessible via CSRs. It is now stored in the VBLOCK instruction format.
505 The table does **not** apply to SVPrefix opcodes*
506
507 The Predication Table is a key-value store indicating whether, if a
508 given destination register (integer or floating-point) is referred to
509 in an instruction, it is to be predicated. Like the Register table, it
510 is an indirect lookup that allows the RV opcodes to not need modification.
511
512 It is particularly important to note
513 that the *actual* register used can be *different* from the one that is
514 in the instruction, due to the redirection through the lookup table.
515
516 * regidx is the register that in combination with the
517 i/f flag, if that integer or floating-point register is referred to in a
518 (standard RV) instruction results in the lookup table being referenced
519 to find the predication mask to use for this operation.
520 * predidx is the *actual* (full, 7 bit) register to be used for the
521 predication mask.
522 * inv indicates that the predication mask bits are to be inverted
523 prior to use *without* actually modifying the contents of the
524 register from which those bits originated.
525 * zeroing is either 1 or 0, and if set to 1, the operation must
526 place zeros in any element position where the predication mask is
527 set to zero. If zeroing is set to 0, unpredicated elements *must*
528 be left alone. Some microarchitectures may choose to interpret
529 this as skipping the operation entirely. Others which wish to
530 stick more closely to a SIMD architecture may choose instead to
531 interpret unpredicated elements as an internal "copy element"
532 operation (which would be necessary in SIMD microarchitectures
533 that perform register-renaming)
534 * ffirst is a special mode that stops sequential element processing when
535 a data-dependent condition occurs, whether a trap or a conditional test.
536 The handling of each (trap or conditional test) is slightly different:
537 see Instruction sections for further details
538
539 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
540
541 The 8 bit format is a compact and less expressive variant of the full
542 16 bit format. Using the 8 bit format is very different: the predicate
543 register to use is implicit, and numbering begins inplicitly from x9. The
544 regnum is still used to "activate" predication, in the same fashion as
545 described above.
546
547 The 16 bit Predication CSR Table is a key-value store, so
548 implementation-wise it will be faster to turn the table around (maintain
549 topologically equivalent state). Opportunities then exist to access
550 registers in unary form instead of binary, saving gates and power by
551 only activating "redirection" with a single AND gate, instead of
552 multiple multi-bit XORs (a CAM):
553
554 [[!inline raw="yes" pages="simple_v_extension/pred_table" ]]
555
556 So when an operation is to be predicated, it is the internal state that
557 is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
558 pseudo-code for operations is given, where p is the explicit (direct)
559 reference to the predication register to be used:
560
561 for (int i=0; i<vl; ++i)
562 if ([!]preg[p][i])
563 (d ? vreg[rd][i] : sreg[rd]) =
564 iop(s1 ? vreg[rs1][i] : sreg[rs1],
565 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
566
567 This instead becomes an *indirect* reference using the *internal* state
568 table generated from the Predication CSR key-value store, which is used
569 as follows.
570
571 if type(iop) == INT:
572 preg = int_pred_reg[rd]
573 else:
574 preg = fp_pred_reg[rd]
575
576 for (int i=0; i<vl; ++i)
577 predicate, zeroing = get_pred_val(type(iop) == INT, rd):
578 if (predicate && (1<<i))
579 result = iop(s1 ? regfile[rs1+i] : regfile[rs1],
580 s2 ? regfile[rs2+i] : regfile[rs2]);
581 (d ? regfile[rd+i] : regfile[rd]) = result
582 if preg.ffirst and result == 0:
583 VL = i # result was zero, end loop early, return VL
584 return
585 else if (zeroing)
586 (d ? regfile[rd+i] : regfile[rd]) = 0
587
588 Note:
589
590 * d, s1 and s2 are booleans indicating whether destination,
591 source1 and source2 are vector or scalar
592 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
593 above, for clarity. rd, rs1 and rs2 all also must ALSO go through
594 register-level redirection (from the Register table) if they are
595 vectors.
596 * fail-on-first mode stops execution early whenever an operation
597 returns a zero value. floating-point results count both
598 positive-zero as well as negative-zero as "fail".
599
600 If written as a function, obtaining the predication mask (and whether
601 zeroing takes place) may be done as follows:
602
603 [[!inline raw="yes" pages="simple_v_extension/get_pred_value" ]]
604
605 Note here, critically, that **only** if the register is marked
606 in its **register** table entry as being "active" does the testing
607 proceed further to check if the **predicate** table entry is
608 also active.
609
610 Note also that this is in direct contrast to branch operations
611 for the storage of comparisions: in these specific circumstances
612 the requirement for there to be an active *register* entry
613 is removed.
614
615 ## Fail-on-First Mode <a name="ffirst-mode"></a>
616
617 ffirst is a special data-dependent predicate mode. There are two
618 variants: one is for faults: typically for LOAD/STORE operations,
619 which may encounter end of page faults during a series of operations.
620 The other variant is comparisons such as FEQ (or the augmented behaviour
621 of Branch), and any operation that returns a result of zero (whether
622 integer or floating-point). In the FP case, this includes negative-zero.
623
624 ffirst interacts with zero- and non-zero predication. In non-zeroing
625 mode, masked-out operations are simply excluded from testing (can never
626 fail). However for fail-comparisons (not faults) in zeroing mode, the
627 result will be zero: this *always* "fails", thus on the very first
628 masked-out element ffirst will always terminate.
629
630 Note that ffirst mode works because the execution order must "appear" to be
631 (in "program order"). An in-order architecture must execute the element
632 operations in sequence, whilst an out-of-order architecture must *commit*
633 the element operations in sequence and cancel speculatively-executed
634 ones (giving the appearance of in-order execution).
635
636 Note also, that if ffirst mode is needed without predication, a special
637 "always-on" Predicate Table Entry may be constructed by setting
638 inverse-on and using x0 as the predicate register. This
639 will have the effect of creating a mask of all ones, allowing ffirst
640 to be set.
641
642 See [[appendix]] for more details on fail-on-first modes, as well as
643 pseudo-code, below.
644
645 ## REMAP and SHAPE CSRs <a name="remap" />
646
647 See optional [[remap]] section.
648
649 # Instruction Execution Order
650
651 Simple-V behaves as if it is a hardware-level "macro expansion system",
652 substituting and expanding a single instruction into multiple sequential
653 instructions with contiguous and sequentially-incrementing registers.
654 As such, it does **not** modify - or specify - the behaviour and semantics of
655 the execution order: that may be deduced from the **existing** RV
656 specification in each and every case.
657
658 So for example if a particular micro-architecture permits out-of-order
659 execution, and it is augmented with Simple-V, then wherever instructions
660 may be out-of-order then so may the "post-expansion" SV ones.
661
662 If on the other hand there are memory guarantees which specifically
663 prevent and prohibit certain instructions from being re-ordered
664 (such as the Atomicity Axiom, or FENCE constraints), then clearly
665 those constraints **MUST** also be obeyed "post-expansion".
666
667 It should be absolutely clear that SV is **not** about providing new
668 functionality or changing the existing behaviour of a micro-architetural
669 design, or about changing the RISC-V Specification.
670 It is **purely** about compacting what would otherwise be contiguous
671 instructions that use sequentially-increasing register numbers down
672 to the **one** instruction.
673
674 # Instructions <a name="instructions" />
675
676 See [[appendix]] for specific cases where instruction behaviour is
677 augmented. A greatly simplified example is below. Note that this
678 is the ADD implementation, not a separate VADD instruction:
679
680 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
681
682 Note that several things have been left out of this example.
683 See [[appendix]] for additional examples that show how to add
684 support for additional features (twin predication, elwidth,
685 zeroing, SUBVL etc.)
686
687 Branches in particular have been transparently augmented to include
688 "collation" of comparison results into a tagged register.
689
690 # Exceptions
691
692 Exceptions may occur at any time, in any given underlying scalar
693 operation. This implies that context-switching (traps) may occur, and
694 operation must be returned to where it left off. That in turn implies
695 that the full state - including the current parallel element being
696 processed - has to be saved and restored. This is what the **STATE**
697 and **PCVBLK** CSRs are for.
698
699 The implications are that all underlying individual scalar operations
700 "issued" by the parallelisation have to appear to be executed sequentially.
701 The further implications are that if two or more individual element
702 operations are underway, and one with an earlier index causes an exception,
703 it will be necessary for the microarchitecture to **discard** or terminate
704 operations with higher indices. Optimisated microarchitectures could
705 hypothetically store (cache) results, for subsequent replay if appropriate.
706
707 In short: exception handling **MUST** be precise, in-order, and exactly
708 like Standard RISC-V as far as the instruction execution order is
709 concerned, regardless of whether it is PC, PCVBLK, VL or SUBVL that
710 is currently being incremented.
711
712 # Hints
713
714 A "HINT" is an operation that has no effect on architectural state,
715 where its use may, by agreed convention, give advance notification
716 to the microarchitecture: branch prediction notification would be
717 a good example. Usually HINTs are where rd=x0.
718
719 With Simple-V being capable of issuing *parallel* instructions where
720 rd=x0, the space for possible HINTs is expanded considerably. VL
721 could be used to indicate different hints. In addition, if predication
722 is set, the predication register itself could hypothetically be passed
723 in as a *parameter* to the HINT operation.
724
725 No specific hints are yet defined in Simple-V
726
727 # Vector Block Format <a name="vliw-format"></a>
728
729 The VBLOCK Format allows Register, Predication and Vector Length to be contextually associated with a group of RISC-V scalar opcodes. The format is as follows:
730
731 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
732
733 For more details, including the CSRs, see ancillary resource: [[vblock_format]]
734
735 # Under consideration <a name="issues"></a>
736
737 See [[discussion]]
738