(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 # SV Load and Store
2
3 Links:
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[ldst/discussion]]
13
14 ## Rationale
15
16 All Vector ISAs dating back fifty years have extensive and comprehensive
17 Load and Store operations that go far beyond the capabilities of Scalar
18 RISC and most CISC processors, yet at their heart on an individual element
19 basis may be found to be no different from RISC Scalar equivalents.
20
21 The resource savings from Vector LD/ST are significant and stem
22 from the fact that one single instruction can trigger a dozen (or in
23 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
24 element-level Memory accesses.
25
26 Additionally, and simply: if the Arithmetic side of an ISA supports
27 Vector Operations, then in order to keep the ALUs 100% occupied the
28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
29 Memory Operations as well.
30
31 Vectorised Load and Store also presents an extra dimension (literally)
32 which creates scenarios unique to Vector applications, that a Scalar (and
33 even a SIMD) ISA simply never encounters. SVP64 endeavours to add the
34 modes typically found in *all* Scalable Vector ISAs, without changing the
35 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
36 (The sole apparent exception is Post-Increment Mode on LD/ST-update
37 instructions)
38
39 ## Modes overview
40
41 Vectorisation of Load and Store requires creation, from scalar operations,
42 a number of different modes:
43
44 * **fixed aka "unit" stride** - contiguous sequence with no gaps
45 * **element strided** - sequential but regularly offset, with gaps
46 * **vector indexed** - vector of base addresses and vector of offsets
47 * **Speculative fail-first** - where it makes sense to do so
48 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
49
50 *Despite being constructed from Scalar LD/ST none of these Modes exist
51 or make sense in any Scalar ISA. They **only** exist in Vector ISAs
52 and are a critical part of its value*.
53
54 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
55 as well as Element-width overrides and Twin-Predication.
56
57 Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
58 LD/ST Immediate Defined Words *and* LD/ST Indexed Defined Words.
59 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
60 clarification is provided below.
61
62 **Determining the LD/ST Modes**
63
64 A minor complication (caused by the retro-fitting of modern Vector
65 features to a Scalar ISA) is that certain features do not exactly make
66 sense or are considered a security risk. Fail-first on Vector Indexed
67 would allow attackers to probe large numbers of pages from userspace,
68 where strided fail-first (by creating contiguous sequential LDs) does not.
69
70 In addition, reduce mode makes no sense. Realistically we need an
71 alternative table definition for [[sv/svp64]] `RM.MODE`. The following
72 modes make sense:
73
74 * saturation
75 * predicate-result would be useful but is lower priority than Data-Dependent Fail-First
76 * simple (no augmentation)
77 * fail-first (where Vector Indexed is banned)
78 * Signed Effective Address computation (Vector Indexed only)
79
80 More than that however it is necessary to fit the usual Vector ISA
81 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
82 Indexed. They present subtly different Mode tables, which, due to lack
83 of space, have the following quirks:
84
85 * LD/ST Immediate has no individual control over src/dest zeroing,
86 whereas LD/ST Indexed does.
87 * LD/ST Immediate has saturation but LD/ST Indexed does not.
88
89 ## Format and fields
90
91 Fields used in tables below:
92
93 * **sz / dz** if predication is enabled will put zeros into the dest
94 (or as src in the case of twin pred) when the predicate bit is zero.
95 otherwise the element is ignored or skipped, depending on context.
96 * **zz**: both sz and dz are set equal to this flag.
97 * **inv CR bit** just as in branches (BO) these bits allow testing of
98 a CR bit and whether it is set (inv=0) or unset (inv=1)
99 * **N** sets signed/unsigned saturation.
100 * **RC1** as if Rc=1, stores CRs *but not the result*
101 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
102 registers that have been reduced due to elwidth overrides
103 * **PI** - post-increment mode (applies to LD/ST with update only).
104 the Effective Address utilised is always just RA, i.e. the computation of
105 EA is stored in RA **after** it is actually used.
106 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
107 may be truncated to (at least) one element, and VL altered to indicate such.
108 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
109 in the Truncated Vector.
110 * **els** - Element-strided Mode: the element index (after REMAP)
111 is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
112
113 When VLi=0 on Store Operations the Memory update does **not** take place
114 on the element that failed. EA does **not** update into RA on Load/Store
115 with Update instructions either.
116
117 **LD/ST immediate**
118
119 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
120 (bits 19:23 of `RM`) is:
121
122 | 0 | 1 | 2 | 3 4 | description |
123 |---|---| --- |---------|--------------------------- |
124 | 0 | 0 | 0 | zz els | simple mode |
125 | 0 | 0 | 1 | PI LF | post-increment and Fault-First |
126 | 1 | 0 | N | zz els | sat mode: N=0/1 u/s |
127 |VLi| 1 | inv | CR-bit | ffirst CR sel |
128
129 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
130 whether stride is unit or element:
131
132 ```
133 if RA.isvec:
134 svctx.ldstmode = indexed
135 elif els == 0:
136 svctx.ldstmode = unitstride
137 elif immediate != 0:
138 svctx.ldstmode = elementstride
139 ```
140
141 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
142 the multiplication of the immediate-offset by zero results in reading from
143 the exact same memory location, *even with a Vector register*. (Normally
144 this type of behaviour is reserved for the mapreduce modes)
145
146 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
147 the once and be copied, rather than hitting the Data Cache multiple
148 times with the same memory read at the same location. The benefit of
149 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
150 to have multiple data values read in quick succession and stored in
151 sequentially numbered registers (but, see Note below).
152
153 For non-cache-inhibited ST from a vector source onto a scalar destination:
154 with the Vector loop effectively creating multiple memory writes to
155 the same location, we can deduce that the last of these will be the
156 "successful" one. Thus, implementations are free and clear to optimise
157 out the overwriting STs, leaving just the last one as the "winner".
158 Bear in mind that predicate masks will skip some elements (in source
159 non-zeroing mode). Cache-inhibited ST operations on the other hand
160 **MUST** write out a Vector source multiple successive times to the exact
161 same Scalar destination. Just like Cache-inhibited LDs, multiple values
162 may be written out in quick succession to a memory-mapped peripheral
163 from sequentially-numbered registers.
164
165 Note that any memory location may be Cache-inhibited
166 (Power ISA v3.1, Book III, 1.6.1, p1033)
167
168 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
169 mode is simply not possible: there are not enough Mode bits. One single
170 Scalar Load operation may be used instead, followed by any arithmetic
171 operation (including a simple mv) in "Splat" mode.*
172
173 **LD/ST Indexed**
174
175 The modes for `RA+RB` indexed version are slightly different
176 but are the same `RM.MODE` bits (19:23 of `RM`):
177
178 | 0 | 1 | 2 | 3 4 | description |
179 |---|---| --- |---------|--------------------------- |
180 |els| 0 | SEA | dz sz | simple mode |
181 |VLi| 1 | inv | CR-bit | ffirst CR sel |
182
183 Vector Indexed Strided Mode is qualified as follows:
184
185 ```
186 if els and !RA.isvec and !RB.isvec:
187 svctx.ldstmode = elementstride
188 ```
189
190 A summary of the effect of Vectorisation of src or dest:
191
192 ```
193 imm(RA) RT.v RA.v no stride allowed
194 imm(RA) RT.s RA.v no stride allowed
195 imm(RA) RT.v RA.s stride-select allowed
196 imm(RA) RT.s RA.s not vectorised
197 RA,RB RT.v {RA|RB}.v Standard Indexed
198 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
199 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
200 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
201 ```
202
203 Signed Effective Address computation is only relevant for Vector Indexed
204 Mode, when elwidth overrides are applied. The source override applies to
205 RB, and before adding to RA in order to calculate the Effective Address,
206 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
207 For other Modes (ffirst, saturate), all EA computation with elwidth
208 overrides is unsigned.
209
210 Note that cache-inhibited LD/ST when VSPLAT is activated will perform
211 **multiple** LD/ST operations, sequentially. Even with scalar src
212 a Cache-inhibited LD will read the same memory location *multiple
213 times*, storing the result in successive Vector destination registers.
214 This because the cache-inhibit instructions are typically used to read
215 and write memory-mapped peripherals. If a genuine cache-inhibited
216 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
217 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
218 value into multiple register destinations.
219
220 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
221 This allows for example to issue a massive batch of memory-mapped
222 peripheral reads, stopping at the first NULL-terminated character and
223 truncating VL to that point. No branch is needed to issue that large
224 burst of LDs, which may be valuable in Embedded scenarios.
225
226 ## Vectorisation of Scalar Power ISA v3.0B
227
228 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
229 and [[isa/fixedstore]] pseudocode to be of the form:
230
231 ```
232 lbux RT, RA, RB
233 EA <- (RA) + (RB)
234 RT <- MEM(EA)
235 ```
236
237 and for immediate variants:
238
239 ```
240 lb RT,D(RA)
241 EA <- RA + EXTS(D)
242 RT <- MEM(EA)
243 ```
244
245 Thus in the first example, the source registers may each be independently
246 marked as scalar or vector, and likewise the destination; in the second
247 example only the one source and one dest may be marked as scalar or
248 vector.
249
250 Thus we can see that Vector Indexed may be covered, and, as demonstrated
251 with the pseudocode below, the immediate can be used to give unit
252 stride or element stride. With there being no way to tell which from
253 the Power v3.0B Scalar opcode alone, the choice is provided instead by
254 the SV Context.
255
256 ```
257 # LD not VLD! format - ldop RT, immed(RA)
258 # op_width: lb=1, lh=2, lw=4, ld=8
259 op_load(RT, RA, op_width, immed, svctx, RAupdate):
260  ps = get_pred_val(FALSE, RA); # predication on src
261  pd = get_pred_val(FALSE, RT); # ... AND on dest
262  for (i=0, j=0, u=0; i < VL && j < VL;):
263 # skip nonpredicates elements
264 if (RA.isvec) while (!(ps & 1<<i)) i++;
265 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
266 if (RT.isvec) while (!(pd & 1<<j)) j++;
267 if postinc:
268 offs = 0; # added afterwards
269 if RA.isvec: srcbase = ireg[RA+i]
270 else srcbase = ireg[RA]
271 elif svctx.ldstmode == elementstride:
272 # element stride mode
273 srcbase = ireg[RA]
274 offs = i * immed # j*immed for a ST
275 elif svctx.ldstmode == unitstride:
276 # unit stride mode
277 srcbase = ireg[RA]
278 offs = immed + (i * op_width) # j*op_width for ST
279 elif RA.isvec:
280 # quirky Vector indexed mode but with an immediate
281 srcbase = ireg[RA+i]
282 offs = immed;
283 else
284 # standard scalar mode (but predicated)
285 # no stride multiplier means VSPLAT mode
286 srcbase = ireg[RA]
287 offs = immed
288
289 # compute EA
290 EA = srcbase + offs
291 # load from memory
292 ireg[RT+j] <= MEM[EA];
293 # check post-increment of EA
294 if postinc: EA = srcbase + immed;
295 # update RA?
296 if RAupdate: ireg[RAupdate+u] = EA;
297 if (!RT.isvec)
298 break # destination scalar, end now
299 if (RA.isvec) i++;
300 if (RAupdate.isvec) u++;
301 if (RT.isvec) j++;
302 ```
303
304 Indexed LD is:
305
306 ```
307 # format: ldop RT, RA, RB
308 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
309  ps = get_pred_val(FALSE, RA); # predication on src
310  pd = get_pred_val(FALSE, RT); # ... AND on dest
311  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
312 # skip nonpredicated RA, RB and RT
313 if (RA.isvec) while (!(ps & 1<<i)) i++;
314 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
315 if (RB.isvec) while (!(ps & 1<<k)) k++;
316 if (RT.isvec) while (!(pd & 1<<j)) j++;
317 if svctx.ldstmode == elementstride:
318 EA = ireg[RA] + ireg[RB]*j # register-strided
319 else
320 EA = ireg[RA+i] + ireg[RB+k] # indexed address
321 if RAupdate: ireg[RAupdate+u] = EA
322 ireg[RT+j] <= MEM[EA];
323 if (!RT.isvec)
324 break # destination scalar, end immediately
325 if (RA.isvec) i++;
326 if (RAupdate.isvec) u++;
327 if (RB.isvec) k++;
328 if (RT.isvec) j++;
329 ```
330
331 Note that Element-Strided uses the Destination Step because with both
332 sources being Scalar as a prerequisite condition of activation of
333 Element-Stride Mode, the source step (being Scalar) would never advance.
334
335 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
336 mode (`ldux`) to be effectively a *completely different* register from
337 RA-as-a-source. This because there is room in svp64 to extend RA-as-src
338 as well as RA-as-dest, both independently as scalar or vector *and*
339 independently extending their range.
340
341 *Programmer's note: being able to set RA-as-a-source as separate from
342 RA-as-a-destination as Scalar is **extremely valuable** once it is
343 remembered that Simple-V element operations must be in Program Order,
344 especially in loops, for saving on multiple address computations. Care
345 does have to be taken however that RA-as-src is not overwritten by
346 RA-as-dest unless intentionally desired, especially in element-strided
347 Mode.*
348
349 ## LD/ST Indexed vs Indexed REMAP
350
351 Unfortunately the word "Indexed" is used twice in completely different
352 contexts, potentially causing confusion.
353
354 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
355 its creation: these are called "LD/ST Indexed" instructions and their
356 name and meaning is well-established.
357 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
358 Mode that can be applied to *any* instruction **including those
359 named LD/ST Indexed**.
360
361 Whilst it may be costly in terms of register reads to allow REMAP Indexed
362 Mode to be applied to any Vectorised LD/ST Indexed operation such as
363 `sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
364 the strict application of the RISC Paradigm that Simple-V follows makes
365 it awkward to consider *preventing* the application of Indexed REMAP to
366 such operations, and secondly they are not actually the same at all.
367
368 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
369 effectively performs an *in-place* re-ordering of the offsets, RB.
370 To achieve the same effect without Indexed REMAP would require taking
371 a *copy* of the Vector of offsets starting at RB, manually explicitly
372 reordering them, and finally using the copy of re-ordered offsets in a
373 non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode
374 showing what actually occurs, where the pseudocode for `indexed_remap`
375 may be found in [[sv/remap]]:
376
377 ```
378 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
379 for i in 0..VL-1:
380 if remap.indexed:
381 rb_idx = indexed_remap(i) # remap
382 else:
383 rb_idx = i # use the index as-is
384 EA = GPR(RA) + GPR(RB+rb_idx)
385 GPR(RT+i) = MEM(EA, 8)
386 ```
387
388 Thus it can be seen that the use of Indexed REMAP saves copying
389 and manual reordering of the Vector of RB offsets.
390
391 ## LD/ST ffirst (Fault-First)
392
393 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
394 is not active) as an ordinary one, with all behaviour with respect to
395 Interrupts Exceptions Page Faults Memory Management being identical
396 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
397 1 and above, if an exception would occur, then VL is **truncated**
398 to the previous element: the exception is **not** then raised because
399 the LD/ST that would otherwise have caused an exception is *required*
400 to be cancelled. Additionally an implementor may choose to truncate VL
401 for any arbitrary reason *except for the very first*.
402
403 ffirst LD/ST to multiple pages via a Vectorised Index base is
404 considered a security risk due to the abuse of probing multiple
405 pages in rapid succession and getting speculative feedback on which
406 pages would fail. Therefore Vector Indexed LD/ST is prohibited
407 entirely, and the Mode bit instead used for element-strided LD/ST.
408 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
409
410 ```
411 for(i = 0; i < VL; i++)
412 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
413 ```
414
415 High security implementations where any kind of speculative probing of
416 memory pages is considered a risk should take advantage of the fact
417 that implementations may truncate VL at any point, without requiring
418 software to be rewritten and made non-portable. Such implementations may
419 choose to *always* set VL=1 which will have the effect of terminating
420 any speculative probing (and also adversely affect performance), but
421 will at least not require applications to be rewritten.
422
423 Low-performance simpler hardware implementations may also choose (always)
424 to also set VL=1 as the bare minimum compliant implementation of LD/ST
425 Fail-First. It is however critically important to remember that the first
426 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST**
427 raise exceptions exactly like an ordinary LD/ST.
428
429 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
430 for any implementation-specific reason. For example: it is perfectly
431 reasonable for implementations to alter VL when ffirst LD or ST operations
432 are initiated on a nonaligned boundary, such that within a loop the
433 subsequent iteration of that loop begins the following ffirst LD/ST
434 operations on an aligned boundary such as the beginning of a cache line,
435 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
436 balance resources.
437
438 Vertical-First Mode is slightly strange in that only one element at a time
439 is ever executed anyway. Given that programmers may legitimately choose
440 to alter srcstep and dststep in non-sequential order as part of explicit
441 loops, it is neither possible nor safe to make speculative assumptions
442 about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is
443 `UNDEFINED`. This is very different from Arithmetic (Data-dependent)
444 FFirst where Vertical-First Mode is fully deterministic, not speculative.
445
446 ## Data-Dependent Fail-First (not Fail/Fault-First)
447
448 Not to be confused with Fail/Fault First, Data-Fail-First performs an
449 additional check on the data, and if the test
450 fails then VL is truncated and further looping terminates.
451 This is precisely the same as Arithmetic Data-Dependent Fail-First,
452 the only difference being that the result comes from the LD/ST
453 rather than from an Arithmetic operation.
454
455 Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
456 except for Store-Conditional a 4-bit Condition Register Field test is created
457 for testing purposes
458 *but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
459 The reason why a CR Field is not stored is because Load/Store, particularly
460 the Update instructions, is already expensive in register terms,
461 and adding an extra Vector write would be too costly in hardware.
462
463 *Programmer's note: Programmers
464 may use Data-Dependent Load with a test to truncate VL, and may then
465 follow up with a `sv.cmpi` or other operation. The important aspect is
466 that the Vector Load truncated on finding a NULL pointer, for example.*
467
468 *Programmer's note: Load-with-Update may be used to update
469 the register used in Effective Address computation of th
470 next element. This may be used to perform single-linked-list
471 walking, where Data-Dependent Fail-First terminates and
472 truncates the Vector at the first NULL.*
473
474 In the case of Store operations there is a quirk when VLi (VL inclusive
475 is "Valid") is clear. Bear in mind the criteria is that the truncated
476 Vector of results, when VLi is clear, must all pass the "test", but when
477 VLi is set the *current failed test* is permitted to be included. Thus,
478 the actual update (store) to Memory is **not permitted to take place**
479 should the test fail. Therefore, on testing the value to be stored,
480 when VLi=0 and finding that the test fails the Memory store must **not** occur.
481
482 Additionally, when VLi=0 and a test fails then RA does **not** receive a
483 copy of the Effective Address. Hardware implementations with Out-of-Order
484 Micro-Architectures should use speculative Shadow-Hold and Cancellation
485 when the test fails.
486
487 By contrast if VLi=1 and the test fails, Store may proceed *and then*
488 looping terminates. In this way, when non-Inclusive, the Vector of
489 Truncated results contains only Stores that passed the test (and RA=EA
490 updates if any), and when Inclusive the Vector of Truncated results
491 contains the first-failed data.
492
493 Below is an example of loading the starting addresses of Linked-List
494 nodes. If VLi=1 it will load the NULL pointer into the Vector of results.
495 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
496 one Element earlier.
497
498 *Programmer's Note: by also setting the RC1 qualifier as well as setting
499 VLi=1 it is possible to establish a Predicate Mask such that the first
500 zero in the predicate will be the NULL pointer*
501
502 ```
503 RT=1 # vec - deliberately overlaps by one with RA
504 RA=0 # vec - first one is valid, contains ptr
505 imm = 8 # offset_of(ptr->next)
506 for i in range(VL):
507 # this part is the Scalar Defined Word (standard scalar ld operation)
508 EA = GPR(RA+i) + imm # ptr + offset(next)
509 data = MEM(EA, 8) # 64-bit address of ptr->next
510 GPR(RT+i) = data # happens to be read on next loop!
511 # was a normal vector-ld up to this point. now the Data-Fail-First
512 cr_test = conditions(data)
513 if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
514 if cr_test.EQ == testbit: # check if zero
515 if VLI then VL = i+1 # update VL, inclusive
516 else VL = i # update VL, exclusive current
517 break # stop looping
518 ```
519
520 **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
521
522 There are very few instructions that allow Rc=1 for Load/Store:
523 one of those is the `stdcx.` and other Atomic Store-Conditional
524 instructions. With Simple-V being a loop around Scalar instructions
525 strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
526 on an Atomic Store-Conditional will always fail the second and all other
527 Store-Conditional instructions because
528 Load-Reservation and Store-Conditional are required to be executed
529 in pairs.
530
531 By contrast, in Vertical-First Mode it is in fact possible to issue
532 the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
533 useful.
534
535 Programmer's note: Care should be taken when VL is truncated in
536 Vertical-First Mode.
537
538 **Future potential**
539
540 Although Rc=1 on LD/ST is a rare occurrence at present, future versions
541 of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
542 with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that
543 is itself fully-independent of the Scalar Suffix Defined Words, prohibiting
544 the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
545 operations is not strategically sound.
546
547 ## LOAD/STORE Elwidths <a name="elwidth"></a>
548
549 Loads and Stores are almost unique in that the Power Scalar ISA
550 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
551 others like it provide an explicit operation width. There are therefore
552 *three* widths involved:
553
554 * operation width (lb=8, lh=16, lw=32, ld=64)
555 * src element width override (8/16/32/default)
556 * destination element width override (8/16/32/default)
557
558 Some care is therefore needed to express and make clear the transformations,
559 which are expressly in this order:
560
561 * Calculate the Effective Address from RA at full width
562 but (on Indexed Load) allow srcwidth overrides on RB
563 * Load at the operation width (lb/lh/lw/ld) as usual
564 * byte-reversal as usual
565 * Non-saturated mode:
566 - zero-extension or truncation from operation width to dest elwidth
567 - place result in destination at dest elwidth
568 * Saturated mode:
569 - Sign-extension or truncation from operation width to dest width
570 - signed/unsigned saturation down to dest elwidth
571
572 In order to respect Power v3.0B Scalar behaviour the memory side
573 is treated effectively as completely separate and distinct from SV
574 augmentation. This is primarily down to quirks surrounding LE/BE and
575 byte-reversal.
576
577 It is rather unfortunately possible to request an elwidth override on
578 the memory side which does not mesh with the overridden operation width:
579 these result in `UNDEFINED` behaviour. The reason is that the effect
580 of attempting a 64-bit `sv.ld` operation with a source elwidth override
581 of 8/16/32 would result in overlapping memory requests, particularly
582 on unit and element strided operations. Thus it is `UNDEFINED` when
583 the elwidth is smaller than the memory operation width. Examples include
584 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
585 from each other at 2-byte intervals. Store likewise is also `UNDEFINED`
586 where the dest elwidth override is less than the operation width.
587
588 Note the following regarding the pseudocode to follow:
589
590 * `scalar identity behaviour` SV Context parameter conditions turn this
591 into a straight absolute fully-compliant Scalar v3.0B LD operation
592 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
593 rather than `ld`)
594 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
595 a "normal" part of Scalar v3.0B LD
596 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
597 as a "normal" part of Scalar v3.0B LD
598 * `svctx` specifies the SV Context and includes VL as well as
599 source and destination elwidth overrides.
600
601 Below is the pseudocode for Unit-Strided LD (which includes Vector
602 capability). Observe in particular that RA, as the base address in both
603 Immediate and Indexed LD/ST, does not have element-width overriding
604 applied to it.
605
606 Note that predication, predication-zeroing, and other modes except
607 saturation have all been removed, for clarity and simplicity:
608
609 ```
610 # LD not VLD!
611 # this covers unit stride mode and a type of vector offset
612 function op_ld(RT, RA, op_width, imm_offs, svctx)
613 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
614 if not svctx.unit/el-strided:
615 # strange vector mode, compute 64 bit address which is
616 # not polymorphic! elwidth hardcoded to 64 here
617 srcbase = get_polymorphed_reg(RA, 64, i)
618 else:
619 # unit / element stride mode, compute 64 bit address
620 srcbase = get_polymorphed_reg(RA, 64, 0)
621 # adjust for unit/el-stride
622 srcbase += ....
623
624 # read the underlying memory
625 memread <= MEM(srcbase + imm_offs, op_width)
626
627 # check saturation.
628 if svpctx.saturation_mode:
629 # ... saturation adjustment...
630 memread = clamp(memread, op_width, svctx.dest_elwidth)
631 else:
632 # truncate/extend to over-ridden dest width.
633 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
634
635 # takes care of inserting memory-read (now correctly byteswapped)
636 # into regfile underlying LE-defined order, into the right place
637 # within the NEON-like register, respecting destination element
638 # bitwidth, and the element index (j)
639 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
640
641 # increments both src and dest element indices (no predication here)
642 i++;
643 j++;
644 ```
645
646 Note above that the source elwidth is *not used at all* in LD-immediate.
647
648 For LD/Indexed, the key is that in the calculation of the Effective Address,
649 RA has no elwidth override but RB does. Pseudocode below is simplified
650 for clarity: predication and all modes except saturation are removed:
651
652 ```
653 # LD not VLD! ld*rx if brev else ld*
654 function op_ld(RT, RA, RB, op_width, svctx, brev)
655 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
656 if not svctx.el-strided:
657 # RA not polymorphic! elwidth hardcoded to 64 here
658 srcbase = get_polymorphed_reg(RA, 64, i)
659 else:
660 # element stride mode, again RA not polymorphic
661 srcbase = get_polymorphed_reg(RA, 64, 0)
662 # RB *is* polymorphic
663 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
664 # sign-extend
665 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
666
667 # takes care of (merges) processor LE/BE and ld/ldbrx
668 bytereverse = brev XNOR MSR.LE
669
670 # read the underlying memory
671 memread <= MEM(srcbase + offs, op_width)
672
673 # optionally performs byteswap at op width
674 if (bytereverse):
675 memread = byteswap(memread, op_width)
676
677 if svpctx.saturation_mode:
678 # ... saturation adjustment...
679 memread = clamp(memread, op_width, svctx.dest_elwidth)
680 else:
681 # truncate/extend to over-ridden dest width.
682 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
683
684 # takes care of inserting memory-read (now correctly byteswapped)
685 # into regfile underlying LE-defined order, into the right place
686 # within the NEON-like register, respecting destination element
687 # bitwidth, and the element index (j)
688 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
689
690 # increments both src and dest element indices (no predication here)
691 i++;
692 j++;
693 ```
694
695 ## Remapped LD/ST
696
697 In the [[sv/remap]] page the concept of "Remapping" is described. Whilst
698 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
699 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
700 of LDs or STs. The usual interest in such re-mapping is for example in
701 separating out 24-bit RGB channel data into separate contiguous registers.
702 NEON covers this as shown in the diagram below:
703
704 ![Load/Store remap](/openpower/sv/load-store.svg)
705
706 REMAP easily covers this capability, and with dest elwidth overrides
707 and saturation may do so with built-in conversion that would normally
708 require additional width-extension, sign-extension and min/max Vectorised
709 instructions as post-processing stages.
710
711 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
712 because the generic abstracted concept of "Remapping", when applied to
713 LD/ST, will give that same capability, with far more flexibility.
714
715 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
716 established through `svstep`, are also an easy way to perform regular
717 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that,
718 REMAP will need to be used.
719
720 **Parallel Reduction REMAP**
721
722 No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
723 is completely separate from the RISC-paradigm Scalar Defined Words. Although
724 obscure there does exist the outside possibility that a potential use for
725 Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
726 Readers are invited to contact the authors of this document if one is ever
727 found.
728
729 --------
730
731 [[!tag standards]]
732
733 \newpage{}