(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 # SV Load and Store
2
3 This section describes how Standard Load/Store Defined Word-instructions are exploited as
4 Element-level Load/Stores and augmented to create direct equivalents of
5 Vector Load/Store instructions.
6
7 <!-- hide -->
8 Links:
9
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
15 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
16 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
17 * [[ldst/discussion]]
18
19 ## Rationale
20
21 All Vector ISAs dating back fifty years have extensive and comprehensive
22 Load and Store operations that go far beyond the capabilities of Scalar
23 RISC and most CISC processors, yet at their heart on an individual element
24 basis may be found to be no different from RISC Scalar equivalents.
25
26 The resource savings from Vector LD/ST are significant and stem
27 from the fact that one single instruction can trigger a dozen (or in
28 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
29 element-level Memory accesses.
30
31 Additionally, and simply: if the Arithmetic side of an ISA supports
32 Vector Operations, then in order to keep the ALUs 100% occupied the
33 Memory infrastructure (and the ISA itself) correspondingly needs Vector
34 Memory Operations as well.
35
36 Vectorized Load and Store also presents an extra dimension (literally)
37 which creates scenarios unique to Vector applications, that a Scalar (and
38 even a SIMD) ISA simply never encounters: not even the complex Addressing
39 Modes of the 68,000 or S/360 resemble Vector Load/Store.
40 SVP64 endeavours to add the
41 modes typically found in *all* Scalable Vector ISAs, without changing the
42 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
43 (The sole apparent exception is Post-Increment Mode on LD/ST-update
44 instructions)
45 <!-- show -->
46
47 ## Modes overview
48
49 Vectorization of Load and Store requires creation, from scalar operations,
50 a number of different modes:
51
52 * **fixed aka "unit" stride** - contiguous sequence with no gaps
53 * **element strided** - sequential but regularly offset, with gaps
54 * **vector indexed** - vector of base addresses and vector of offsets
55 * **Speculative Fault-first** - where it makes sense to do so
56 * **Data-Dependent Fail-First** - Conditional truncation of Vector Length
57 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
58
59 *Despite being constructed from Scalar LD/ST none of these Modes exist
60 or make sense in any Scalar ISA. They **only** exist in Vector ISAs
61 and are a critical part of its value*.
62
63 Also included in SVP64 LD/ST is Element-width overrides and Twin-Predication.
64
65 Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
66 LD/ST Immediate Defined Word-instructions *and* LD/ST Indexed Defined Word-instructions.
67 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
68 clarification is provided below.
69
70 **Determining the LD/ST Modes**
71
72 A minor complication (caused by the retro-fitting of modern Vector
73 features to a Scalar ISA) is that certain features do not exactly make
74 sense or are considered a security risk. Fault-first on Vector Indexed
75 would allow attackers to probe large numbers of pages from userspace,
76 where strided Fault-first (by creating contiguous sequential LDs likely
77 to be in the same Page) does not.
78
79 In addition, reduce mode makes no sense. Realistically we need an
80 alternative table definition for [[sv/svp64]] `RM.MODE`. The following
81 modes make sense:
82
83 * simple (no augmentation)
84 * Fault-first (where Vector Indexed is banned)
85 * Data-dependent Fail-First (extremely useful for Linked-List pointer-chasing)
86 * Signed Effective Address computation (Vector Indexed only, on RB)
87
88 More than that however it is necessary to fit the usual Vector ISA
89 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
90 Indexed. They present subtly different Mode tables, which, due to lack
91 of space, have the following quirks:
92
93 * LD/ST Immediate has no individual control over src/dest zeroing,
94 whereas LD/ST Indexed does.
95 * LD/ST Immediate has saturation but LD/ST Indexed does not.
96
97 ## Format and fields
98
99 Fields used in tables below:
100
101 * **zz**: both sz and dz are set equal to this flag.
102 If predication is enabled will put zeros into the dest
103 (or as src in the case of twin pred) when the predicate bit is zero.
104 otherwise the element is ignored or skipped, depending on context.
105 * **inv CR bit** just as in branches (BO) these bits allow testing of
106 a CR bit and whether it is set (inv=0) or unset (inv=1)
107 * **RC1** as if Rc=1, stores CRs *but not the result*
108 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
109 registers that have been reduced due to elwidth overrides
110 * **PI** - post-increment mode (applies to LD/ST with update only).
111 the Effective Address utilised is always just RA, i.e. the computation of
112 EA is stored in RA **after** it is actually used.
113 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
114 may be truncated to (at least) one element, and VL altered to indicate such.
115 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
116 in the Truncated Vector.
117 * **els** - Element-strided Mode: the element index (after REMAP)
118 is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
119
120 When VLi=0 on Store Operations the Memory update does **not** take place
121 on the element that failed. EA does **not** update into RA on Load/Store
122 with Update instructions either.
123
124 **LD/ST immediate**
125
126 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
127 (bits 19:23 of `RM`) is:
128
129 | 0 | 1 | 2 | 3 4 | description |
130 |---|---| --- |---------|--------------------------- |
131 |els| 0 | PI | zz LF | post-increment and Fault-First |
132 |VLi| 1 | inv | CR-bit | Data-Dependent ffirst CR sel |
133
134 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
135 whether stride is unit or element:
136
137 ```
138 if RA.isvec:
139 svctx.ldstmode = indexed
140 elif els == 0:
141 svctx.ldstmode = unitstride
142 elif immediate != 0:
143 svctx.ldstmode = elementstride
144 ```
145
146 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
147 the multiplication of the immediate-offset by zero results in reading from
148 the exact same memory location, *even with a Vector register*. (Normally
149 this type of behaviour is reserved for the mapreduce modes)
150
151 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
152 the once and be copied, rather than hitting the Data Cache multiple
153 times with the same memory read at the same location. The benefit of
154 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
155 to have multiple data values read in quick succession and stored in
156 sequentially numbered registers (but, see Note below).
157
158 For non-cache-inhibited ST from a vector source onto a scalar destination:
159 with the Vector loop effectively creating multiple memory writes to
160 the same location, we can deduce that the last of these will be the
161 "successful" one. Thus, implementations are free and clear to optimise
162 out the overwriting STs, leaving just the last one as the "winner".
163 Bear in mind that predicate masks will skip some elements (in source
164 non-zeroing mode). Cache-inhibited ST operations on the other hand
165 **MUST** write out a Vector source multiple successive times to the exact
166 same Scalar destination. Just like Cache-inhibited LDs, multiple values
167 may be written out in quick succession to a memory-mapped peripheral
168 from sequentially-numbered registers.
169
170 Note that any memory location may be Cache-inhibited
171 (Power ISA v3.1, Book III, 1.6.1, p1033)
172
173 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
174 mode is simply not possible: there are not enough Mode bits. One single
175 Scalar Load operation may be used instead, followed by any arithmetic
176 operation (including a simple mv) in "Splat" mode.*
177
178 **LD/ST Indexed**
179
180 The modes for `RA+RB` indexed version are slightly different
181 but are the same `RM.MODE` bits (19:23 of `RM`):
182
183 | 0 | 1 | 2 | 3 4 | description |
184 |---|---| --- |---------|--------------------------- |
185 |els| 0 | PI | zz SEA | post-increment and Fault-First |
186 |VLi| 1 | inv | CR-bit | Data-Dependent ffirst CR sel |
187
188 Vector Indexed Strided Mode is qualified as follows:
189
190 ```
191 if els and !RA.isvec and !RB.isvec:
192 svctx.ldstmode = elementstride
193 ```
194
195 A summary of the effect of Vectorization of src or dest:
196
197 ```
198 imm(RA) RT.v RA.v no stride allowed
199 imm(RA) RT.s RA.v no stride allowed
200 imm(RA) RT.v RA.s stride-select allowed
201 imm(RA) RT.s RA.s not vectorized
202 RA,RB RT.v {RA|RB}.v Standard Indexed
203 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
204 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
205 RA,RB RT.s {RA&RB}.s not vectorized (scalar identity)
206 ```
207
208 Signed Effective Address computation is only relevant for Vector Indexed
209 Mode, when elwidth overrides are applied. The source override applies to
210 RB, and before adding to RA in order to calculate the Effective Address,
211 if SEA is set then RB is sign-extended from elwidth bits to the full 64 bits.
212 For other Modes (ffirst), all EA computation with elwidth
213 overrides is unsigned. RA is *never* altered (not truncated)
214 by element-width overrides.
215
216 Note that cache-inhibited LD/ST when VSPLAT is activated will perform
217 **multiple** LD/ST operations, sequentially. Even with scalar src
218 a Cache-inhibited LD will read the same memory location *multiple
219 times*, storing the result in successive Vector destination registers.
220 This because the cache-inhibit instructions are typically used to read
221 and write memory-mapped peripherals. If a genuine cache-inhibited
222 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
223 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
224 value into multiple register destinations.
225
226 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
227 This allows for example to issue a massive batch of memory-mapped
228 peripheral reads, stopping at the first NULL-terminated character and
229 truncating VL to that point. No branch is needed to issue that large
230 burst of LDs, which may be valuable in Embedded scenarios.
231
232 ## Vectorization of Scalar Power ISA v3.0B
233
234 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
235 and [[isa/fixedstore]] pseudocode to be of the form:
236
237 ```
238 lbux RT, RA, RB
239 EA <- (RA) + (RB)
240 RT <- MEM(EA)
241 ```
242
243 and for immediate variants:
244
245 ```
246 lb RT,D(RA)
247 EA <- RA + EXTS(D)
248 RT <- MEM(EA)
249 ```
250
251 Thus in the first example, the source registers may each be independently
252 marked as scalar or vector, and likewise the destination; in the second
253 example only the one source and one dest may be marked as scalar or
254 vector.
255
256 Thus we can see that Vector Indexed may be covered, and, as demonstrated
257 with the pseudocode below, the immediate can be used to give unit
258 stride or element stride. With there being no way to tell which from
259 the Power v3.0B Scalar opcode alone, the choice is provided instead by
260 the SV Context.
261
262 ```
263 # LD not VLD! format - ldop RT, immed(RA)
264 # op_width: lb=1, lh=2, lw=4, ld=8
265 op_load(RT, RA, op_width, immed, svctx, RAupdate):
266  ps = get_pred_val(FALSE, RA); # predication on src
267  pd = get_pred_val(FALSE, RT); # ... AND on dest
268  for (i=0, j=0, u=0; i < VL && j < VL;):
269 # skip nonpredicates elements
270 if (RA.isvec) while (!(ps & 1<<i)) i++;
271 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
272 if (RT.isvec) while (!(pd & 1<<j)) j++;
273 if postinc:
274 offs = 0; # added afterwards
275 if RA.isvec: srcbase = ireg[RA+i]
276 else srcbase = ireg[RA]
277 elif svctx.ldstmode == elementstride:
278 # element stride mode
279 srcbase = ireg[RA]
280 offs = i * immed # j*immed for a ST
281 elif svctx.ldstmode == unitstride:
282 # unit stride mode
283 srcbase = ireg[RA]
284 offs = immed + (i * op_width) # j*op_width for ST
285 elif RA.isvec:
286 # quirky Vector indexed mode but with an immediate
287 srcbase = ireg[RA+i]
288 offs = immed;
289 else
290 # standard scalar mode (but predicated)
291 # no stride multiplier means VSPLAT mode
292 srcbase = ireg[RA]
293 offs = immed
294
295 # compute EA
296 EA = srcbase + offs
297 # load from memory
298 ireg[RT+j] <= MEM[EA];
299 # check post-increment of EA
300 if postinc: EA = srcbase + immed;
301 # update RA?
302 if RAupdate: ireg[RAupdate+u] = EA;
303 if (!RT.isvec)
304 break # destination scalar, end now
305 if (RA.isvec) i++;
306 if (RAupdate.isvec) u++;
307 if (RT.isvec) j++;
308 ```
309
310 Indexed LD is:
311
312 ```
313 # format: ldop RT, RA, RB
314 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
315  ps = get_pred_val(FALSE, RA); # predication on src
316  pd = get_pred_val(FALSE, RT); # ... AND on dest
317  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
318 # skip nonpredicated RA, RB and RT
319 if (RA.isvec) while (!(ps & 1<<i)) i++;
320 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
321 if (RB.isvec) while (!(ps & 1<<k)) k++;
322 if (RT.isvec) while (!(pd & 1<<j)) j++;
323 if svctx.ldstmode == elementstride:
324 EA = ireg[RA] + ireg[RB]*j # register-strided
325 else
326 EA = ireg[RA+i] + ireg[RB+k] # indexed address
327 if RAupdate: ireg[RAupdate+u] = EA
328 ireg[RT+j] <= MEM[EA];
329 if (!RT.isvec)
330 break # destination scalar, end immediately
331 if (RA.isvec) i++;
332 if (RAupdate.isvec) u++;
333 if (RB.isvec) k++;
334 if (RT.isvec) j++;
335 ```
336
337 Note that Element-Strided uses the Destination Step because with both
338 sources being Scalar as a prerequisite condition of activation of
339 Element-Stride Mode, the source step (being Scalar) would never advance.
340
341 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
342 mode (`ldux`) to be effectively a *completely different* register from
343 RA-as-a-source. This because there is room in svp64 to extend RA-as-src
344 as well as RA-as-dest, both independently as scalar or vector *and*
345 independently extending their range.
346
347 *Programmer's note: being able to set RA-as-a-source as separate from
348 RA-as-a-destination as Scalar is **extremely valuable** once it is
349 remembered that Simple-V element operations must be in Program Order,
350 especially in loops, for saving on multiple address computations. Care
351 does have to be taken however that RA-as-src is not overwritten by
352 RA-as-dest unless intentionally desired, especially in element-strided
353 Mode.*
354
355 ## LD/ST Indexed vs Indexed REMAP
356
357 Unfortunately the word "Indexed" is used twice in completely different
358 contexts, potentially causing confusion.
359
360 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
361 its creation: these are called "LD/ST Indexed" instructions and their
362 name and meaning is well-established.
363 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
364 Mode that can be applied to *any* instruction **including those
365 named LD/ST Indexed**.
366
367 Whilst it may be costly in terms of register reads to allow REMAP Indexed
368 Mode to be applied to any Vectorized LD/ST Indexed operation such as
369 `sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
370 the strict application of the RISC Paradigm that Simple-V follows makes
371 it awkward to consider *preventing* the application of Indexed REMAP to
372 such operations, and secondly they are not actually the same at all.
373
374 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
375 effectively performs an *in-place* re-ordering of the offsets, RB.
376 To achieve the same effect without Indexed REMAP would require taking
377 a *copy* of the Vector of offsets starting at RB, manually explicitly
378 reordering them, and finally using the copy of re-ordered offsets in a
379 non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode
380 showing what actually occurs, where the pseudocode for `indexed_remap`
381 may be found in [[sv/remap]]:
382
383 ```
384 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
385 for i in 0..VL-1:
386 if remap.indexed:
387 rb_idx = indexed_remap(i) # remap
388 else:
389 rb_idx = i # use the index as-is
390 EA = GPR(RA) + GPR(RB+rb_idx)
391 GPR(RT+i) = MEM(EA, 8)
392 ```
393
394 Thus it can be seen that the use of Indexed REMAP saves copying
395 and manual reordering of the Vector of RB offsets.
396
397 ## LD/ST ffirst (Fault-First)
398
399 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
400 is not active and predication is not applied)
401 as an ordinary one, with all behaviour with respect to
402 Interrupts Exceptions Page Faults Memory Management being identical
403 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
404 1 and above, if an exception would occur, then VL is **truncated**
405 to the previous element: the exception is **not** then raised because
406 the LD/ST that would otherwise have caused an exception is *required*
407 to be cancelled. Additionally an implementor may choose to truncate VL
408 for any arbitrary reason *except for the very first*.
409
410 ffirst LD/ST to multiple pages via a Vectorized Index base is
411 considered a security risk due to the abuse of probing multiple
412 pages in rapid succession and getting speculative feedback on which
413 pages would fail. Therefore Vector Indexed LD/ST is prohibited
414 entirely, and the Mode bit instead used for element-strided LD/ST.
415 <!-- hide -->
416 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
417 <!-- show -->
418
419 ```
420 for(i = 0; i < VL; i++)
421 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
422 ```
423
424 High security implementations where any kind of speculative probing of
425 memory pages is considered a risk should take advantage of the fact
426 that implementations may truncate VL at any point, without requiring
427 software to be rewritten and made non-portable. Such implementations may
428 choose to *always* set VL=1 which will have the effect of terminating
429 any speculative probing (and also adversely affect performance), but
430 will at least not require applications to be rewritten.
431
432 Low-performance simpler hardware implementations may also choose (always)
433 to also set VL=1 as the bare minimum compliant implementation of LD/ST
434 Fail-First. It is however critically important to remember that the first
435 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST**
436 raise exceptions exactly like an ordinary LD/ST.
437
438 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
439 for any implementation-specific reason. For example: it is perfectly
440 reasonable for implementations to alter VL when ffirst LD or ST operations
441 are initiated on a nonaligned boundary, such that within a loop the
442 subsequent iteration of that loop begins the following ffirst LD/ST
443 operations on an aligned boundary such as the beginning of a cache line,
444 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
445 balance resources.
446
447 When Predication is used, the "first" element is considered to be the first
448 non-predicated element rather than specifically `srcstep=0`.
449
450 Vertical-First Mode is slightly strange in that only one element at a time
451 is ever executed anyway. Given that programmers may legitimately choose
452 to alter srcstep and dststep in non-sequential order as part of explicit
453 loops, it is neither possible nor safe to make speculative assumptions
454 about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is
455 `UNDEFINED`. This is very different from Arithmetic (Data-dependent)
456 FFirst where Vertical-First Mode is fully deterministic, not speculative.
457
458 ## Data-Dependent Fail-First (not Fail/Fault-First)
459
460 Not to be confused with Fail/Fault First, Data-Fail-First performs an
461 additional check on the data, and if the test
462 fails then VL is truncated and further looping terminates.
463 This is precisely the same as Arithmetic Data-Dependent Fail-First,
464 the only difference being that the result comes from the LD/ST
465 rather than from an Arithmetic operation.
466
467 Important to note is that reduce mode is implied by Data-Dependent Fail-First.
468 In other words where normally if the destination is Scalar, the looping
469 terminates at the first load/store, Data-Dependent Fail-First *continues*
470 just as it does in reduce mode.
471
472 Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
473 except for Store-Conditional a 4-bit Condition Register Field test is created
474 for testing purposes
475 *but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
476 The reason why a CR Field is not stored is because Load/Store, particularly
477 the Update instructions, is already expensive in register terms,
478 and adding an extra Vector write would be too costly in hardware.
479
480 *Programmer's note: Programmers
481 may use Data-Dependent Load with a test to truncate VL, and may then
482 follow up with a `sv.cmpi` or other operation. The important aspect is
483 that the Vector Load truncated on finding a NULL pointer, for example.*
484
485 *Programmer's note: Load-with-Update may be used to update
486 the register used in Effective Address computation of th
487 next element. This may be used to perform single-linked-list
488 walking, where Data-Dependent Fail-First terminates and
489 truncates the Vector at the first NULL.*
490
491 **Load/Store Data-Dependent Fail-First, VLi=0**
492
493 In the case of Store operations there is a quirk when VLi (VL inclusive
494 is "Valid") is clear. Bear in mind the criteria is that the truncated
495 Vector of results, when VLi is clear, must all pass the "test", but when
496 VLi is set the *current failed test* is permitted to be included. Thus,
497 the actual update (store) to Memory is **not permitted to take place**
498 should the test fail.
499
500 Additionally in any Load/Store with Update instruction,
501 when VLi=0 and a test fails then RA does **not** receive a
502 copy of the Effective Address. Hardware implementations with Out-of-Order
503 Micro-Architectures should use speculative Shadow-Hold and Cancellation
504 (or other Transactional Rollback mechanism) when the test fails.
505
506 * **Load, VLi=0**: perform the Memory Load, do not put the result into the regfile yet (or EA into RA). Test the Loaded data: if fail do not store the Load in the register file (or EA into RA). Otherwise proceed with updating regfiles. VL is truncated to "only elements that passed the test"
507 * **Store, VLi=0**: even before the Store takes place, perform the test on the data to *be* stored. If fail do not proceed with the Store at all. VL is truncated to "only elements that passed the test"
508
509 **Load/Store Data-Dependent Fail-First, VLi=1**
510
511 By contrast if VLi=1 and the test fails, the Store may proceed *and then*
512 looping terminates. In this way, when Inclusive the Vector of Truncated results
513 contains the first-failed data (including RA on Updates)
514
515 * **Load, VLi=1**: perform the Memory Load, complete it in full (including EA into RA). Test the Loaded data: if fail then VL is truncated to "elements tested".
516 * **Store, VLi=0**: same as Load. Perform the Store in full and after-the-fact carry out the test of the original data requested to be stored. If fail then VL is truncated to "elements tested".
517
518 Below is an example of loading the starting addresses of Linked-List
519 nodes. If VLi=1 it will load the NULL pointer into the Vector of results.
520 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
521 one Element earlier (only loading non-NULL data into registers).
522
523 *Programmer's Note: by also setting the RC1 qualifier as well as setting
524 VLi=1 it is possible to establish a Predicate Mask such that the first
525 zero in the predicate will be the NULL pointer*
526
527 ```
528 RT=1 # vec - deliberately overlaps by one with RA
529 RA=0 # vec - first one is valid, contains ptr
530 imm = 8 # offset_of(ptr->next)
531 for i in range(VL):
532 # this part is the Scalar Defined Word-instruction (standard scalar ld operation)
533 EA = GPR(RA+i) + imm # ptr + offset(next)
534 data = MEM(EA, 8) # 64-bit address of ptr->next
535 # was a normal vector-ld up to this point. now the Data-Fail-First
536 cr_test = conditions(data)
537 if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
538 action_load = True
539 if cr_test.EQ == testbit: # check if zero
540 if VLI then
541 VL = i+1 # update VL, inclusive
542 else
543 VL = i # update VL, exclusive current
544 action_load = False # current load excluded
545 stop = True # stop looping
546 if action_load:
547 GPR(RT+i) = data # happens to be read on next loop!
548 if stop: break
549 ```
550
551 **Data-Dependent Fail-First on Store-Conditional (Rc=1)**
552
553 There are very few instructions that allow Rc=1 for Load/Store:
554 one of those is the `stdcx.` and other Atomic Store-Conditional
555 instructions. With Simple-V being a loop around Scalar instructions
556 strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
557 on an Atomic Store-Conditional will always fail the second and all other
558 Store-Conditional instructions because
559 Load-Reservation and Store-Conditional are required to be executed
560 in pairs.
561
562 By contrast, in Vertical-First Mode it is in fact possible to issue
563 the pairs, and consequently allowing Vectorized Data-Dependent Fail-First is
564 useful.
565
566 Programmer's note: Care should be taken when VL is truncated in
567 Vertical-First Mode.
568
569 **Future potential**
570
571 Although Rc=1 on LD/ST is a rare occurrence at present, future versions
572 of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
573 with the SVP64 Vectorization Prefixing being itself a RISC-paradigm that
574 is itself fully-independent of the Scalar Suffix Defined Word-instructions, prohibiting
575 the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
576 operations is not strategically sound.
577
578 ## LOAD/STORE Elwidths <a name="elwidth"></a>
579
580 Loads and Stores are almost unique in that the Power Scalar ISA
581 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
582 others like it provide an explicit operation width. There are therefore
583 *three* widths involved:
584
585 * operation width (lb=8, lh=16, lw=32, ld=64)
586 * src element width override (8/16/32/default)
587 * destination element width override (8/16/32/default)
588
589 Some care is therefore needed to express and make clear the transformations,
590 which are expressly in this order:
591
592 * Calculate the Effective Address from RA at full width
593 but (on Indexed Load) allow srcwidth overrides on RB
594 * Load at the operation width (lb/lh/lw/ld) as usual
595 * byte-reversal as usual
596 * zero-extension or truncation from operation width to dest elwidth
597 * place result in destination at dest elwidth
598
599 In order to respect Power v3.0B Scalar behaviour the memory side
600 is treated effectively as completely separate and distinct from SV
601 augmentation. This is primarily down to quirks surrounding LE/BE and
602 byte-reversal.
603
604 It is rather unfortunately possible to request an elwidth override on
605 the memory side which does not mesh with the overridden operation width:
606 these result in `UNDEFINED` behaviour. The reason is that the effect
607 of attempting a 64-bit `sv.ld` operation with a source elwidth override
608 of 8/16/32 would result in overlapping memory requests, particularly
609 on unit and element strided operations. Thus it is `UNDEFINED` when
610 the elwidth is smaller than the memory operation width. Examples include
611 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
612 from each other at 2-byte intervals. Store likewise is also `UNDEFINED`
613 where the dest elwidth override is less than the operation width.
614
615 Note the following regarding the pseudocode to follow:
616
617 * `scalar identity behaviour` SV Context parameter conditions turn this
618 into a straight absolute fully-compliant Scalar v3.0B LD operation
619 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
620 rather than `ld`)
621 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
622 a "normal" part of Scalar v3.0B LD
623 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
624 as a "normal" part of Scalar v3.0B LD
625 * `svctx` specifies the SV Context and includes VL as well as
626 source and destination elwidth overrides.
627
628 Below is the pseudocode for Unit-Strided LD (which includes Vector
629 capability). Observe in particular that RA, as the base address in both
630 Immediate and Indexed LD/ST, does not have element-width overriding
631 applied to it.
632
633 Note that predication, predication-zeroing, and other modes
634 have all been removed, for clarity and simplicity:
635
636 ```
637 # LD not VLD!
638 # this covers unit stride mode and a type of vector offset
639 function op_ld(RT, RA, op_width, imm_offs, svctx)
640 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
641 if not svctx.unit/el-strided:
642 # strange vector mode, compute 64 bit address which is
643 # not polymorphic! elwidth hardcoded to 64 here
644 srcbase = get_polymorphed_reg(RA, 64, i)
645 else:
646 # unit / element stride mode, compute 64 bit address
647 srcbase = get_polymorphed_reg(RA, 64, 0)
648 # adjust for unit/el-stride
649 srcbase += .... uses op_width here
650
651 # read the underlying memory
652 memread <= MEM(srcbase + imm_offs, op_width)
653
654 # truncate/extend to over-ridden dest width.
655 memread = adjust_wid(memread, op_width, svctx.elwidth)
656
657 # takes care of inserting memory-read (now correctly byteswapped)
658 # into regfile underlying LE-defined order, into the right place
659 # using Element-Packing starting at register RT, respecting destination
660 # element bitwidth, and the element index (j)
661 set_polymorphed_reg(RT, svctx.elwidth, j, memread)
662
663 # increments both src and dest element indices (no predication here)
664 i++;
665 j++;
666 ```
667
668 Note above that the source elwidth is *not used at all* in LD-immediate: RA
669 never has elwidth overrides, leaving the elwidth free for truncation/extension
670 of the result.
671
672 For LD/Indexed, the key is that in the calculation of the Effective Address,
673 RA has no elwidth override but RB does. Pseudocode below is simplified
674 for clarity: predication and all modes are removed:
675
676 ```
677 # LD not VLD! ld*rx if brev else ld*
678 function op_ld(RT, RA, RB, op_width, svctx, brev)
679 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
680 if not svctx.el-strided:
681 # RA not polymorphic! elwidth hardcoded to 64 here
682 srcbase = get_polymorphed_reg(RA, 64, i)
683 else:
684 # element stride mode, again RA not polymorphic
685 srcbase = get_polymorphed_reg(RA, 64, 0)
686 # RB *is* polymorphic
687 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
688 # sign-extend
689 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
690
691 # takes care of (merges) processor LE/BE and ld/ldbrx
692 bytereverse = brev XNOR MSR.LE
693
694 # read the underlying memory
695 memread <= MEM(srcbase + offs, op_width)
696
697 # optionally performs byteswap at op width
698 if (bytereverse):
699 memread = byteswap(memread, op_width)
700
701 # truncate/extend to over-ridden dest width.
702 dest_width = op_width if RT.isvec else 64
703 memread = adjust_wid(memread, op_width, dest_width)
704
705 # takes care of inserting memory-read (now correctly byteswapped)
706 # into regfile underlying LE-defined order, into the right place
707 # within the NEON-like register, respecting destination element
708 # bitwidth, and the element index (j)
709 set_polymorphed_reg(RT, destwidth, j, memread)
710
711 # increments both src and dest element indices (no predication here)
712 i++;
713 j++;
714 ```
715
716 *Programmer's note: with no destination elwidth override the destination
717 width must be implicitly ascertained. The assumption is that if the destination
718 is a Scalar that the entire 64-bit register must be written, thus the width is
719 extended to 64-bit. If however the destination is a Vector then it is deemed
720 appropriate to use the LD/ST width and to perform contiguous register element
721 packing at that width. The justification for doing so is that if further
722 sign-extension or saturation is required after a LD, these may be performed by a
723 follow-up instruction that uses a source elwidth override matching the exact width
724 of the LD operation. Correspondingly for a ST a destination elwidth override
725 on a prior instruction may match the exact width of the ST instruction.*
726
727 ## Remapped LD/ST
728
729 In the [[sv/remap]] page the concept of "Remapping" is described. Whilst
730 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
731 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
732 of LDs or STs. The usual interest in such re-mapping is for example in
733 separating out 24-bit RGB channel data into separate contiguous registers.
734 NEON covers this as shown in the diagram below:
735
736 ![Load/Store remap](/openpower/sv/load-store.svg)
737
738 REMAP easily covers this capability, and with dest elwidth overrides
739 and saturation may do so with built-in conversion that would normally
740 require additional width-extension, sign-extension and min/max Vectorized
741 instructions as post-processing stages.
742
743 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
744 because the generic abstracted concept of "Remapping", when applied to
745 LD/ST, will give that same capability, with far more flexibility.
746
747 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
748 established through `svstep`, are also an easy way to perform regular
749 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that,
750 REMAP will need to be used.
751
752 **Parallel Reduction REMAP**
753
754 No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
755 is completely separate from the RISC-paradigm Scalar Defined Word-instructions. Although
756 obscure there does exist the outside possibility that a potential use for
757 Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
758 Readers are invited to contact the authors of this document if one is ever
759 found.
760
761 --------
762
763 [[!tag standards]]
764
765 \newpage{}