openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 This section describes how Standard Load/Store Defined Word-instructions are exploited as
   4 Element-level Load/Stores and augmented to create direct equivalents of
   5 Vector Load/Store instructions.
   6
   7 <!-- hide -->
   8 Links:
   9
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
  14 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
  15 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  16 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  17 * [[ldst/discussion]]
  18
  19 ## Rationale
  20
  21 All Vector ISAs dating back fifty years have extensive and comprehensive
  22 Load and Store operations that go far beyond the capabilities of Scalar
  23 RISC and most CISC processors, yet at their heart on an individual element
  24 basis may be found to be no different from RISC Scalar equivalents.
  25
  26 The resource savings from Vector LD/ST are significant and stem
  27 from the fact that one single instruction can trigger a dozen (or in
  28 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
  29 element-level Memory accesses.
  30
  31 Additionally, and simply: if the Arithmetic side of an ISA supports
  32 Vector Operations, then in order to keep the ALUs 100% occupied the
  33 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  34 Memory Operations as well.
  35
  36 Vectorized Load and Store also presents an extra dimension (literally)
  37 which creates scenarios unique to Vector applications, that a Scalar (and
  38 even a SIMD) ISA simply never encounters: not even the complex Addressing
  39 Modes of the 68,000 or S/360 resemble Vector Load/Store.
  40 SVP64 endeavours to add the
  41 modes typically found in *all* Scalable Vector ISAs, without changing the
  42 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
  43 (The sole apparent exception is Post-Increment Mode on LD/ST-update
  44 instructions)
  45 <!-- show -->
  46
  47 ## Modes overview
  48
  49 Vectorization of Load and Store requires creation, from scalar operations,
  50 a number of different modes:
  51
  52 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  53 * **element strided** - sequential but regularly offset, with gaps
  54 * **vector indexed** - vector of base addresses and vector of offsets
  55 * **Speculative Fault-first** - where it makes sense to do so
  56 * **Data-Dependent Fail-First** - Conditional truncation of Vector Length
  57 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  58
  59 *Despite being constructed from Scalar LD/ST none of these Modes exist
  60 or make sense in any Scalar ISA. They **only** exist in Vector ISAs
  61 and are a critical part of its value*.
  62
  63 Also included in SVP64 LD/ST is Element-width overrides and Twin-Predication.
  64
  65 Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
  66 LD/ST Immediate Defined Word-instructions *and* LD/ST Indexed Defined Word-instructions.
  67 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
  68 clarification is provided below.
  69
  70 **Determining the LD/ST Modes**
  71
  72 A minor complication (caused by the retro-fitting of modern Vector
  73 features to a Scalar ISA) is that certain features do not exactly make
  74 sense or are considered a security risk.  Fault-first on Vector Indexed
  75 would allow attackers to probe large numbers of pages from userspace,
  76 where strided Fault-first (by creating contiguous sequential LDs likely
  77 to be in the same Page) does not.
  78
  79 In addition, reduce mode makes no sense.  Realistically we need an
  80 alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
  81 modes make sense:
  82
  83 * simple (no augmentation)
  84 * Fault-first (where Vector Indexed is banned)
  85 * Data-dependent Fail-First (extremely useful for Linked-List pointer-chasing)
  86 * Signed Effective Address computation (Vector Indexed only, on RB)
  87
  88 More than that however it is necessary to fit the usual Vector ISA
  89 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
  90 Indexed. They present subtly different Mode tables, which, due to lack
  91 of space, have the following quirks:
  92
  93 * LD/ST Immediate has no individual control over src/dest zeroing,
  94   whereas LD/ST Indexed does.
  95 * LD/ST Immediate has saturation but LD/ST Indexed does not.
  96
  97 ## Format and fields
  98
  99 Fields used in tables below:
 100
 101 * **zz**: both sz and dz are set equal to this flag.
 102   If predication is enabled will put zeros into the dest
 103   (or as src in the case of twin pred) when the predicate bit is zero.
 104   otherwise the element is ignored or skipped, depending on context.
 105 * **inv CR bit** just as in branches (BO) these bits allow testing of
 106   a CR bit and whether it is set (inv=0) or unset (inv=1)
 107 * **RC1** as if Rc=1, stores CRs *but not the result*
 108 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
 109   registers that have been reduced due to elwidth overrides
 110 * **PI** - post-increment mode (applies to LD/ST with update only).
 111   the Effective Address utilised is always just RA, i.e. the computation of
 112   EA is stored in RA **after** it is actually used.
 113 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
 114   may be truncated to (at least) one element, and VL altered to indicate such.
 115 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
 116   in the Truncated Vector.
 117 * **els** - Element-strided Mode: the element index (after REMAP)
 118   is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
 119
 120 When VLi=0 on Store Operations the Memory update does **not** take place
 121 on the element that failed.  EA does **not** update into RA on Load/Store
 122 with Update instructions either.
 123
 124 **LD/ST immediate**
 125
 126 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
 127 (bits 19:23 of `RM`) is:
 128
 129 | 0 | 1 |  2  |  3   4  |  description               |
 130 |---|---| --- |---------|--------------------------- |
 131 |els| 0 | PI  |  zz LF  | post-increment and Fault-First  |
 132 |VLi| 1 | inv | CR-bit  | Data-Dependent  ffirst CR sel   |
 133
 134 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 135 whether stride is unit or element:
 136
 137 ```
 138     if RA.isvec:
 139         svctx.ldstmode = indexed
 140     elif els == 0:
 141         svctx.ldstmode = unitstride
 142     elif immediate != 0:
 143         svctx.ldstmode = elementstride
 144 ```
 145
 146 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
 147 the multiplication of the immediate-offset by zero results in reading from
 148 the exact same memory location, *even with a Vector register*. (Normally
 149 this type of behaviour is reserved for the mapreduce modes)
 150
 151 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
 152 the once and be copied, rather than hitting the Data Cache multiple
 153 times with the same memory read at the same location.  The benefit of
 154 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
 155 to have multiple data values read in quick succession and stored in
 156 sequentially numbered registers (but, see Note below).
 157
 158 For non-cache-inhibited ST from a vector source onto a scalar destination:
 159 with the Vector loop effectively creating multiple memory writes to
 160 the same location, we can deduce that the last of these will be the
 161 "successful" one. Thus, implementations are free and clear to optimise
 162 out the overwriting STs, leaving just the last one as the "winner".
 163 Bear in mind that predicate masks will skip some elements (in source
 164 non-zeroing mode).  Cache-inhibited ST operations on the other hand
 165 **MUST** write out a Vector source multiple successive times to the exact
 166 same Scalar destination. Just like Cache-inhibited LDs, multiple values
 167 may be written out in quick succession to a memory-mapped peripheral
 168 from sequentially-numbered registers.
 169
 170 Note that any memory location may be Cache-inhibited
 171 (Power ISA v3.1, Book III, 1.6.1, p1033)
 172
 173 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
 174 mode is simply not possible: there are not enough Mode bits. One single
 175 Scalar Load operation may be used instead, followed by any arithmetic
 176 operation (including a simple mv) in "Splat" mode.*
 177
 178 **LD/ST Indexed**
 179
 180 The modes for `RA+RB` indexed version are slightly different
 181 but are the same `RM.MODE` bits (19:23 of `RM`):
 182
 183 | 0 | 1 |  2  |  3   4  |  description               |
 184 |---|---| --- |---------|--------------------------- |
 185 |els| 0 | PI  |  zz SEA | post-increment and Fault-First        |
 186 |VLi| 1 | inv | CR-bit  | Data-Dependent ffirst CR sel        |
 187
 188 Vector Indexed Strided Mode is qualified as follows:
 189
 190 ```
 191     if els and !RA.isvec and !RB.isvec:
 192         svctx.ldstmode = elementstride
 193 ```
 194
 195 A summary of the effect of Vectorization of src or dest:
 196
 197 ```
 198     imm(RA)  RT.v   RA.v   no stride allowed
 199     imm(RA)  RT.s   RA.v   no stride allowed
 200     imm(RA)  RT.v   RA.s   stride-select allowed
 201     imm(RA)  RT.s   RA.s   not vectorized
 202     RA,RB    RT.v  {RA|RB}.v Standard Indexed
 203     RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 204     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 205     RA,RB    RT.s  {RA&RB}.s not vectorized (scalar identity)
 206 ```
 207
 208 Signed Effective Address computation is only relevant for Vector Indexed
 209 Mode, when elwidth overrides are applied.  The source override applies to
 210 RB, and before adding to RA in order to calculate the Effective Address,
 211 if SEA is set then RB is sign-extended from elwidth bits to the full 64 bits.
 212 For other Modes (ffirst), all EA computation with elwidth
 213 overrides is unsigned.  RA is *never* altered (not truncated)
 214 by element-width overrides.
 215
 216 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
 217 **multiple** LD/ST operations, sequentially.  Even with scalar src
 218 a Cache-inhibited LD will read the same memory location *multiple
 219 times*, storing the result in successive Vector destination registers.
 220 This because the cache-inhibit instructions are typically used to read
 221 and write memory-mapped peripherals.  If a genuine cache-inhibited
 222 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
 223 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
 224 value into multiple register destinations.
 225
 226 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
 227 This allows for example to issue a massive batch of memory-mapped
 228 peripheral reads, stopping at the first NULL-terminated character and
 229 truncating VL to that point. No branch is needed to issue that large
 230 burst of LDs, which may be valuable in Embedded scenarios.
 231
 232 ## Vectorization of Scalar Power ISA v3.0B
 233
 234 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
 235 and [[isa/fixedstore]] pseudocode to be of the form:
 236
 237 ```
 238     lbux RT, RA, RB
 239     EA <- (RA) + (RB)
 240     RT <- MEM(EA)
 241 ```
 242
 243 and for immediate variants:
 244
 245 ```
 246     lb RT,D(RA)
 247     EA <- RA + EXTS(D)
 248     RT <- MEM(EA)
 249 ```
 250
 251 Thus in the first example, the source registers may each be independently
 252 marked as scalar or vector, and likewise the destination; in the second
 253 example only the one source and one dest may be marked as scalar or
 254 vector.
 255
 256 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 257 with the pseudocode below, the immediate can be used to give unit
 258 stride or element stride.  With there being no way to tell which from
 259 the Power v3.0B Scalar opcode alone, the choice is provided instead by
 260 the SV Context.
 261
 262 ```
 263     # LD not VLD!  format - ldop RT, immed(RA)
 264     # op_width: lb=1, lh=2, lw=4, ld=8
 265     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 266       ps = get_pred_val(FALSE, RA); # predication on src
 267       pd = get_pred_val(FALSE, RT); # ... AND on dest
 268       for (i=0, j=0, u=0; i < VL && j < VL;):
 269         # skip nonpredicates elements
 270         if (RA.isvec) while (!(ps & 1<<i)) i++;
 271         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 272         if (RT.isvec) while (!(pd & 1<<j)) j++;
 273         if postinc:
 274             offs = 0; # added afterwards
 275             if RA.isvec: srcbase = ireg[RA+i]
 276             else         srcbase = ireg[RA]
 277         elif svctx.ldstmode == elementstride:
 278           # element stride mode
 279           srcbase = ireg[RA]
 280           offs = i * immed              # j*immed for a ST
 281         elif svctx.ldstmode == unitstride:
 282           # unit stride mode
 283           srcbase = ireg[RA]
 284           offs = immed + (i * op_width) # j*op_width for ST
 285         elif RA.isvec:
 286           # quirky Vector indexed mode but with an immediate
 287           srcbase = ireg[RA+i]
 288           offs = immed;
 289         else
 290           # standard scalar mode (but predicated)
 291           # no stride multiplier means VSPLAT mode
 292           srcbase = ireg[RA]
 293           offs = immed
 294
 295         # compute EA
 296         EA = srcbase + offs
 297         # load from memory
 298         ireg[RT+j] <= MEM[EA];
 299         # check post-increment of EA
 300         if postinc: EA = srcbase + immed;
 301         # update RA?
 302         if RAupdate: ireg[RAupdate+u] = EA;
 303         if (!RT.isvec)
 304             break # destination scalar, end now
 305         if (RA.isvec) i++;
 306         if (RAupdate.isvec) u++;
 307         if (RT.isvec) j++;
 308 ```
 309
 310 Indexed LD is:
 311
 312 ```
 313     # format: ldop RT, RA, RB
 314     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 315       ps = get_pred_val(FALSE, RA); # predication on src
 316       pd = get_pred_val(FALSE, RT); # ... AND on dest
 317       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 318         # skip nonpredicated RA, RB and RT
 319         if (RA.isvec) while (!(ps & 1<<i)) i++;
 320         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 321         if (RB.isvec) while (!(ps & 1<<k)) k++;
 322         if (RT.isvec) while (!(pd & 1<<j)) j++;
 323         if svctx.ldstmode == elementstride:
 324             EA = ireg[RA] + ireg[RB]*j   # register-strided
 325         else
 326             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 327         if RAupdate: ireg[RAupdate+u] = EA
 328         ireg[RT+j] <= MEM[EA];
 329         if (!RT.isvec)
 330             break # destination scalar, end immediately
 331         if (RA.isvec) i++;
 332         if (RAupdate.isvec) u++;
 333         if (RB.isvec) k++;
 334         if (RT.isvec) j++;
 335 ```
 336
 337 Note that Element-Strided uses the Destination Step because with both
 338 sources being Scalar as a prerequisite condition of activation of
 339 Element-Stride Mode, the source step (being Scalar) would never advance.
 340
 341 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
 342 mode (`ldux`) to be effectively a *completely different* register from
 343 RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
 344 as well as RA-as-dest, both independently as scalar or vector *and*
 345 independently extending their range.
 346
 347 *Programmer's note: being able to set RA-as-a-source as separate from
 348 RA-as-a-destination as Scalar is **extremely valuable** once it is
 349 remembered that Simple-V element operations must be in Program Order,
 350 especially in loops, for saving on multiple address computations. Care
 351 does have to be taken however that RA-as-src is not overwritten by
 352 RA-as-dest unless intentionally desired, especially in element-strided
 353 Mode.*
 354
 355 ## LD/ST Indexed vs Indexed REMAP
 356
 357 Unfortunately the word "Indexed" is used twice in completely different
 358 contexts, potentially causing confusion.
 359
 360 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 361   its creation: these are called "LD/ST Indexed" instructions and their
 362   name and meaning is well-established.
 363 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 364   Mode that can be applied to *any* instruction **including those
 365   named LD/ST Indexed**.
 366
 367 Whilst it may be costly in terms of register reads to allow REMAP Indexed
 368 Mode to be applied to any Vectorized LD/ST Indexed operation such as
 369 `sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
 370 the strict application of the RISC Paradigm that Simple-V follows makes
 371 it awkward to consider *preventing* the application of Indexed REMAP to
 372 such operations, and secondly they are not actually the same at all.
 373
 374 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 375 effectively performs an *in-place* re-ordering of the offsets, RB.
 376 To achieve the same effect without Indexed REMAP would require taking
 377 a *copy* of the Vector of offsets starting at RB, manually explicitly
 378 reordering them, and finally using the copy of re-ordered offsets in a
 379 non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
 380 showing what actually occurs, where the pseudocode for `indexed_remap`
 381 may be found in [[sv/remap]]:
 382
 383 ```
 384     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 385     for i in 0..VL-1:
 386         if remap.indexed:
 387             rb_idx = indexed_remap(i) # remap
 388         else:
 389             rb_idx = i # use the index as-is
 390         EA = GPR(RA) + GPR(RB+rb_idx)
 391         GPR(RT+i) = MEM(EA, 8)
 392 ```
 393
 394 Thus it can be seen that the use of Indexed REMAP saves copying
 395 and manual reordering of the Vector of RB offsets.
 396
 397 ## LD/ST ffirst (Fault-First)
 398
 399 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 400 is not active and predication is not applied)
 401 as an ordinary one, with all behaviour with respect to
 402 Interrupts Exceptions Page Faults Memory Management being identical
 403 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
 404 1 and above, if an exception would occur, then VL is **truncated**
 405 to the previous element: the exception is **not** then raised because
 406 the LD/ST that would otherwise have caused an exception is *required*
 407 to be cancelled. Additionally an implementor may choose to truncate VL
 408 for any arbitrary reason *except for the very first*.
 409
 410 ffirst LD/ST to multiple pages via a Vectorized Index base is
 411 considered a security risk due to the abuse of probing multiple
 412 pages in rapid succession and getting speculative feedback on which
 413 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
 414 entirely, and the Mode bit instead used for element-strided LD/ST.
 415 <!-- hide -->
 416 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 417 <!-- show -->
 418
 419 ```
 420     for(i = 0; i < VL; i++)
 421         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 422 ```
 423
 424 High security implementations where any kind of speculative probing of
 425 memory pages is considered a risk should take advantage of the fact
 426 that implementations may truncate VL at any point, without requiring
 427 software to be rewritten and made non-portable. Such implementations may
 428 choose to *always* set VL=1 which will have the effect of terminating
 429 any speculative probing (and also adversely affect performance), but
 430 will at least not require applications to be rewritten.
 431
 432 Low-performance simpler hardware implementations may also choose (always)
 433 to also set VL=1 as the bare minimum compliant implementation of LD/ST
 434 Fail-First. It is however critically important to remember that the first
 435 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
 436 raise exceptions exactly like an ordinary LD/ST.
 437
 438 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
 439 for any implementation-specific reason. For example: it is perfectly
 440 reasonable for implementations to alter VL when ffirst LD or ST operations
 441 are initiated on a nonaligned boundary, such that within a loop the
 442 subsequent iteration of that loop begins the following ffirst LD/ST
 443 operations on an aligned boundary such as the beginning of a cache line,
 444 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
 445 balance resources.
 446
 447 When Predication is used, the "first" element is considered to be the first
 448 non-predicated element rather than specifically `srcstep=0`.
 449
 450 Vertical-First Mode is slightly strange in that only one element at a time
 451 is ever executed anyway.  Given that programmers may legitimately choose
 452 to alter srcstep and dststep in non-sequential order as part of explicit
 453 loops, it is neither possible nor safe to make speculative assumptions
 454 about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
 455 `UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
 456 FFirst where Vertical-First Mode is fully deterministic, not speculative.
 457
 458 ## Data-Dependent Fail-First (not Fail/Fault-First)
 459
 460 Not to be confused with Fail/Fault First, Data-Fail-First performs an
 461 additional check on the data, and if the test
 462 fails then VL is truncated and further looping terminates.
 463 This is precisely the same as Arithmetic Data-Dependent Fail-First,
 464 the only difference being that the result comes from the LD/ST
 465 rather than from an Arithmetic operation.
 466
 467 Important to note is that reduce mode is implied by Data-Dependent Fail-First.
 468 In other words where normally if the destination is Scalar, the looping
 469 terminates at the first load/store, Data-Dependent Fail-First *continues*
 470 just as it does in reduce mode.
 471
 472 Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
 473 except for Store-Conditional a 4-bit Condition Register Field test is created
 474 for testing purposes
 475 *but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
 476 The reason why a CR Field is not stored is because Load/Store, particularly
 477 the Update instructions, is already expensive in register terms,
 478 and adding an extra Vector write would be too costly in hardware.
 479
 480 *Programmer's note: Programmers
 481 may use Data-Dependent Load with a test to truncate VL, and may then
 482 follow up with a `sv.cmpi` or other operation. The important aspect is
 483 that the Vector Load truncated on finding a NULL pointer, for example.*
 484
 485 *Programmer's note: Load-with-Update may be used to update
 486 the register used in Effective Address computation of th
 487 next element.  This may be used to perform single-linked-list
 488 walking, where Data-Dependent Fail-First terminates and
 489 truncates the Vector at the first NULL.*
 490
 491 **Load/Store Data-Dependent Fail-First, VLi=0**
 492
 493 In the case of Store operations there is a quirk when VLi (VL inclusive
 494 is "Valid") is clear. Bear in mind the criteria is that the truncated
 495 Vector of results, when VLi is clear, must all pass the "test", but when
 496 VLi is set the *current failed test* is permitted to be included.  Thus,
 497 the actual update (store) to Memory is **not permitted to take place**
 498 should the test fail.
 499
 500 Additionally in any Load/Store with Update instruction,
 501 when VLi=0 and a test fails then RA does **not** receive a
 502 copy of the Effective Address.  Hardware implementations with Out-of-Order
 503 Micro-Architectures should use speculative Shadow-Hold and Cancellation
 504 (or other Transactional Rollback mechanism) when the test fails.
 505
 506 * **Load, VLi=0**: perform the Memory Load, do not put the result into the regfile yet (or EA into RA). Test the Loaded data: if fail do not store the Load in the register file (or EA into RA). Otherwise proceed with updating regfiles. VL is truncated to "only elements that passed the test"
 507 * **Store, VLi=0**: even before the Store takes place, perform the test on the data to *be* stored.  If fail do not proceed with the Store at all. VL is truncated to "only elements that passed the test"
 508
 509 **Load/Store Data-Dependent Fail-First, VLi=1**
 510
 511 By contrast if VLi=1 and the test fails, the Store may proceed *and then*
 512 looping terminates.  In this way, when Inclusive the Vector of Truncated results
 513 contains the first-failed data (including RA on Updates)
 514
 515 * **Load, VLi=1**: perform the Memory Load, complete it in full (including EA into RA). Test the Loaded data: if fail then VL is truncated to "elements tested".
 516 * **Store, VLi=0**: same as Load. Perform the Store in full and after-the-fact carry out the test of the original data requested to be stored. If fail then VL is truncated to "elements tested".
 517
 518 Below is an example of loading the starting addresses of Linked-List
 519 nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
 520 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
 521 one Element earlier (only loading non-NULL data into registers).
 522
 523 *Programmer's Note: by also setting the RC1 qualifier as well as setting
 524 VLi=1 it is possible to establish a Predicate Mask such that the first
 525 zero in the predicate will be the NULL pointer*
 526
 527 ```
 528    RT=1 # vec - deliberately overlaps by one with RA
 529    RA=0 # vec - first one is valid, contains ptr
 530    imm = 8 # offset_of(ptr->next)
 531    for i in range(VL):
 532        # this part is the Scalar Defined Word-instruction (standard scalar ld operation)
 533        EA = GPR(RA+i) + imm          # ptr + offset(next)
 534        data = MEM(EA, 8)             # 64-bit address of ptr->next
 535        # was a normal vector-ld up to this point. now the Data-Fail-First
 536        cr_test = conditions(data)
 537        if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
 538        action_load = True
 539        if cr_test.EQ == testbit:             # check if zero
 540            if VLI then
 541               VL = i+1            # update VL, inclusive
 542            else
 543               VL = i              # update VL, exclusive current
 544               action_load = False # current load excluded
 545            stop = True            # stop looping
 546        if action_load:
 547           GPR(RT+i) = data        # happens to be read on next loop!
 548        if stop: break
 549 ```
 550
 551 **Data-Dependent Fail-First on Store-Conditional (Rc=1)**
 552
 553 There are very few instructions that allow Rc=1 for Load/Store:
 554 one of those is the `stdcx.` and other Atomic Store-Conditional
 555 instructions.  With Simple-V being a loop around Scalar instructions
 556 strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
 557 on an Atomic Store-Conditional will always fail the second and all other
 558 Store-Conditional instructions because
 559 Load-Reservation and Store-Conditional are required to be executed
 560 in pairs.
 561
 562 By contrast, in Vertical-First Mode it is in fact possible to issue
 563 the pairs, and consequently allowing Vectorized Data-Dependent Fail-First is
 564 useful.
 565
 566 Programmer's note: Care should be taken when VL is truncated in
 567 Vertical-First Mode.
 568
 569 **Future potential**
 570
 571 Although Rc=1 on LD/ST is a rare occurrence at present, future versions
 572 of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
 573 with the SVP64 Vectorization Prefixing being itself a RISC-paradigm that
 574 is itself fully-independent of the Scalar Suffix Defined Word-instructions, prohibiting
 575 the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
 576 operations is not strategically sound.
 577
 578 ## LOAD/STORE Elwidths <a name="elwidth"></a>
 579
 580 Loads and Stores are almost unique in that the Power Scalar ISA
 581 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 582 others like it provide an explicit operation width.  There are therefore
 583 *three* widths involved:
 584
 585 * operation width (lb=8, lh=16, lw=32, ld=64)
 586 * src element width override (8/16/32/default)
 587 * destination element width override (8/16/32/default)
 588
 589 Some care is therefore needed to express and make clear the transformations,
 590 which are expressly in this order:
 591
 592 * Calculate the Effective Address from RA at full width
 593   but (on Indexed Load) allow srcwidth overrides on RB
 594 * Load at the operation width (lb/lh/lw/ld) as usual
 595 * byte-reversal as usual
 596 * zero-extension or truncation from operation width to dest elwidth
 597 * place result in destination at dest elwidth
 598
 599 In order to respect Power v3.0B Scalar behaviour the memory side
 600 is treated effectively as completely separate and distinct from SV
 601 augmentation.  This is primarily down to quirks surrounding LE/BE and
 602 byte-reversal.
 603
 604 It is rather unfortunately possible to request an elwidth override on
 605 the memory side which does not mesh with the overridden operation width:
 606 these result in `UNDEFINED` behaviour.  The reason is that the effect
 607 of attempting a 64-bit `sv.ld` operation with a source elwidth override
 608 of 8/16/32 would result in overlapping memory requests, particularly
 609 on unit and element strided operations.  Thus it is `UNDEFINED` when
 610 the elwidth is smaller than the memory operation width. Examples include
 611 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
 612 from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 613 where the dest elwidth override is less than the operation width.
 614
 615 Note the following regarding the pseudocode to follow:
 616
 617 * `scalar identity behaviour` SV Context parameter conditions turn this
 618   into a straight absolute fully-compliant Scalar v3.0B LD operation
 619 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 620   rather than `ld`)
 621 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 622   a "normal" part of Scalar v3.0B LD
 623 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 624   as a "normal" part of Scalar v3.0B LD
 625 * `svctx` specifies the SV Context and includes VL as well as
 626   source and destination elwidth overrides.
 627
 628 Below is the pseudocode for Unit-Strided LD (which includes Vector
 629 capability). Observe in particular that RA, as the base address in both
 630 Immediate and Indexed LD/ST, does not have element-width overriding
 631 applied to it.
 632
 633 Note that predication, predication-zeroing, and other modes
 634 have all been removed, for clarity and simplicity:
 635
 636 ```
 637     # LD not VLD!
 638     # this covers unit stride mode and a type of vector offset
 639     function op_ld(RT, RA, op_width, imm_offs, svctx)
 640       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 641         if not svctx.unit/el-strided:
 642             # strange vector mode, compute 64 bit address which is
 643             # not polymorphic! elwidth hardcoded to 64 here
 644             srcbase = get_polymorphed_reg(RA, 64, i)
 645         else:
 646             # unit / element stride mode, compute 64 bit address
 647             srcbase = get_polymorphed_reg(RA, 64, 0)
 648             # adjust for unit/el-stride
 649             srcbase += .... uses op_width here
 650
 651         # read the underlying memory
 652         memread <= MEM(srcbase + imm_offs, op_width)
 653
 654         # truncate/extend to over-ridden dest width.
 655         memread = adjust_wid(memread, op_width, svctx.elwidth)
 656
 657         # takes care of inserting memory-read (now correctly byteswapped)
 658         # into regfile underlying LE-defined order, into the right place
 659         # using Element-Packing starting at register RT, respecting destination
 660         # element bitwidth, and the element index (j)
 661         set_polymorphed_reg(RT, svctx.elwidth, j, memread)
 662
 663         # increments both src and dest element indices (no predication here)
 664         i++;
 665         j++;
 666 ```
 667
 668 Note above that the source elwidth is *not used at all* in LD-immediate: RA
 669 never has elwidth overrides, leaving the elwidth free for truncation/extension
 670 of the result.
 671
 672 For LD/Indexed, the key is that in the calculation of the Effective Address,
 673 RA has no elwidth override but RB does.  Pseudocode below is simplified
 674 for clarity: predication and all modes are removed:
 675
 676 ```
 677     # LD not VLD! ld*rx if brev else ld*
 678     function op_ld(RT, RA, RB, op_width, svctx, brev)
 679       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 680         if not svctx.el-strided:
 681             # RA not polymorphic! elwidth hardcoded to 64 here
 682             srcbase = get_polymorphed_reg(RA, 64, i)
 683         else:
 684             # element stride mode, again RA not polymorphic
 685             srcbase = get_polymorphed_reg(RA, 64, 0)
 686         # RB *is* polymorphic
 687         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 688         # sign-extend
 689         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 690
 691         # takes care of (merges) processor LE/BE and ld/ldbrx
 692         bytereverse = brev XNOR MSR.LE
 693
 694         # read the underlying memory
 695         memread <= MEM(srcbase + offs, op_width)
 696
 697         # optionally performs byteswap at op width
 698         if (bytereverse):
 699             memread = byteswap(memread, op_width)
 700
 701         # truncate/extend to over-ridden dest width.
 702         dest_width = op_width if RT.isvec else 64
 703         memread = adjust_wid(memread, op_width, dest_width)
 704
 705         # takes care of inserting memory-read (now correctly byteswapped)
 706         # into regfile underlying LE-defined order, into the right place
 707         # within the NEON-like register, respecting destination element
 708         # bitwidth, and the element index (j)
 709         set_polymorphed_reg(RT, destwidth, j, memread)
 710
 711         # increments both src and dest element indices (no predication here)
 712         i++;
 713         j++;
 714 ```
 715
 716 *Programmer's note: with no destination elwidth override the destination
 717 width must be implicitly ascertained.  The assumption is that if the destination
 718 is a Scalar that the entire 64-bit register must be written, thus the width is
 719 extended to 64-bit.  If however the destination is a Vector then it is deemed
 720 appropriate to use the LD/ST width and to perform contiguous register element
 721 packing at that width.  The justification for doing so is that if further
 722 sign-extension or saturation is required after a LD, these may be performed by a
 723 follow-up instruction that uses a source elwidth override matching the exact width
 724 of the LD operation.  Correspondingly for a ST a destination elwidth override
 725 on a prior instruction may match the exact width of the ST instruction.*
 726
 727 ## Remapped LD/ST
 728
 729 In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
 730 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
 731 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
 732 of LDs or STs.  The usual interest in such re-mapping is for example in
 733 separating out 24-bit RGB channel data into separate contiguous registers.
 734 NEON covers this as shown in the diagram below:
 735
 736 ![Load/Store remap](/openpower/sv/load-store.svg)
 737
 738 REMAP easily covers this capability, and with dest elwidth overrides
 739 and saturation may do so with built-in conversion that would normally
 740 require additional width-extension, sign-extension and min/max Vectorized
 741 instructions as post-processing stages.
 742
 743 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 744 because the generic abstracted concept of "Remapping", when applied to
 745 LD/ST, will give that same capability, with far more flexibility.
 746
 747 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
 748 established through `svstep`, are also an easy way to perform regular
 749 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
 750 REMAP will need to be used.
 751
 752 **Parallel Reduction REMAP**
 753
 754 No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
 755 is completely separate from the RISC-paradigm Scalar Defined Word-instructions.  Although
 756 obscure there does exist the outside possibility that a potential use for
 757 Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
 758 Readers are invited to contact the authors of this document if one is ever
 759 found.
 760
 761 --------
 762
 763 [[!tag standards]]
 764
 765 \newpage{}