openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[ldst/discussion]]
  13
  14 ## Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC and most CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem
  22 from the fact that one single instruction can trigger a dozen (or in
  23 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
  24 element-level Memory accesses.
  25
  26 Additionally, and simply: if the Arithmetic side of an ISA supports
  27 Vector Operations, then in order to keep the ALUs 100% occupied the
  28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  29 Memory Operations as well.
  30
  31 Vectorised Load and Store also presents an extra dimension (literally)
  32 which creates scenarios unique to Vector applications, that a Scalar (and
  33 even a SIMD) ISA simply never encounters.  SVP64 endeavours to add the
  34 modes typically found in *all* Scalable Vector ISAs, without changing the
  35 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
  36 (The sole apparent exception is Post-Increment Mode on LD/ST-update
  37 instructions)
  38
  39 ## Modes overview
  40
  41 Vectorisation of Load and Store requires creation, from scalar operations,
  42 a number of different modes:
  43
  44 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  45 * **element strided** - sequential but regularly offset, with gaps
  46 * **vector indexed** - vector of base addresses and vector of offsets
  47 * **Speculative fail-first** - where it makes sense to do so
  48 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  49
  50 *Despite being constructed from Scalar LD/ST none of these Modes exist
  51 or make sense in any Scalar ISA. They **only** exist in Vector ISAs
  52 and are a critical part of its value*.
  53
  54 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  55 as well as Element-width overrides and Twin-Predication.
  56
  57 Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
  58 LD/ST Immediate Defined Words *and* LD/ST Indexed Defined Words.
  59 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
  60 clarification is provided below.
  61
  62 **Determining the LD/ST Modes**
  63
  64 A minor complication (caused by the retro-fitting of modern Vector
  65 features to a Scalar ISA) is that certain features do not exactly make
  66 sense or are considered a security risk.  Fail-first on Vector Indexed
  67 would allow attackers to probe large numbers of pages from userspace,
  68 where strided fail-first (by creating contiguous sequential LDs) does not.
  69
  70 In addition, reduce mode makes no sense.  Realistically we need an
  71 alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
  72 modes make sense:
  73
  74 * saturation
  75 * predicate-result would be useful but is lower priority than Data-Dependent Fail-First
  76 * simple (no augmentation)
  77 * fail-first (where Vector Indexed is banned)
  78 * Signed Effective Address computation (Vector Indexed only)
  79
  80 More than that however it is necessary to fit the usual Vector ISA
  81 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
  82 Indexed. They present subtly different Mode tables, which, due to lack
  83 of space, have the following quirks:
  84
  85 * LD/ST Immediate has no individual control over src/dest zeroing,
  86   whereas LD/ST Indexed does.
  87 * LD/ST Immediate has saturation but LD/ST Indexed does not.
  88
  89 ## Format and fields
  90
  91 Fields used in tables below:
  92
  93 * **sz / dz**  if predication is enabled will put zeros into the dest
  94   (or as src in the case of twin pred) when the predicate bit is zero.
  95   otherwise the element is ignored or skipped, depending on context.
  96 * **zz**: both sz and dz are set equal to this flag.
  97 * **inv CR bit** just as in branches (BO) these bits allow testing of
  98   a CR bit and whether it is set (inv=0) or unset (inv=1)
  99 * **N** sets signed/unsigned saturation.
 100 * **RC1** as if Rc=1, stores CRs *but not the result*
 101 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
 102   registers that have been reduced due to elwidth overrides
 103 * **PI** - post-increment mode (applies to LD/ST with update only).
 104   the Effective Address utilised is always just RA, i.e. the computation of
 105   EA is stored in RA **after** it is actually used.
 106 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
 107   may be truncated to (at least) one element, and VL altered to indicate such.
 108 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
 109   in the Truncated Vector.
 110 * **els** - Element-strided Mode: the element index (after REMAP)
 111   is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
 112
 113 When VLi=0 on Store Operations the Memory update does **not** take place
 114 on the element that failed.  EA does **not** update into RA on Load/Store
 115 with Update instructions either.
 116
 117 **LD/ST immediate**
 118
 119 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
 120 (bits 19:23 of `RM`) is:
 121
 122 | 0 | 1 |  2  |  3   4  |  description               |
 123 |---|---| --- |---------|--------------------------- |
 124 | 0 | 0 | 0   |  zz els | simple mode                |
 125 | 0 | 0 | 1   | PI  LF  | post-increment and Fault-First  |
 126 | 1 | 0 |   N | zz  els |  sat mode: N=0/1 u/s       |
 127 |VLi| 1 | inv | CR-bit  | ffirst CR sel             |
 128
 129 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 130 whether stride is unit or element:
 131
 132 ```
 133     if RA.isvec:
 134         svctx.ldstmode = indexed
 135     elif els == 0:
 136         svctx.ldstmode = unitstride
 137     elif immediate != 0:
 138         svctx.ldstmode = elementstride
 139 ```
 140
 141 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
 142 the multiplication of the immediate-offset by zero results in reading from
 143 the exact same memory location, *even with a Vector register*. (Normally
 144 this type of behaviour is reserved for the mapreduce modes)
 145
 146 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
 147 the once and be copied, rather than hitting the Data Cache multiple
 148 times with the same memory read at the same location.  The benefit of
 149 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
 150 to have multiple data values read in quick succession and stored in
 151 sequentially numbered registers (but, see Note below).
 152
 153 For non-cache-inhibited ST from a vector source onto a scalar destination:
 154 with the Vector loop effectively creating multiple memory writes to
 155 the same location, we can deduce that the last of these will be the
 156 "successful" one. Thus, implementations are free and clear to optimise
 157 out the overwriting STs, leaving just the last one as the "winner".
 158 Bear in mind that predicate masks will skip some elements (in source
 159 non-zeroing mode).  Cache-inhibited ST operations on the other hand
 160 **MUST** write out a Vector source multiple successive times to the exact
 161 same Scalar destination. Just like Cache-inhibited LDs, multiple values
 162 may be written out in quick succession to a memory-mapped peripheral
 163 from sequentially-numbered registers.
 164
 165 Note that any memory location may be Cache-inhibited
 166 (Power ISA v3.1, Book III, 1.6.1, p1033)
 167
 168 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
 169 mode is simply not possible: there are not enough Mode bits. One single
 170 Scalar Load operation may be used instead, followed by any arithmetic
 171 operation (including a simple mv) in "Splat" mode.*
 172
 173 **LD/ST Indexed**
 174
 175 The modes for `RA+RB` indexed version are slightly different
 176 but are the same `RM.MODE` bits (19:23 of `RM`):
 177
 178 | 0 | 1 |  2  |  3   4  |  description               |
 179 |---|---| --- |---------|--------------------------- |
 180 |els| 0 | SEA |  dz  sz | simple mode        |
 181 |VLi| 1 | inv | CR-bit  | ffirst CR sel        |
 182
 183 Vector Indexed Strided Mode is qualified as follows:
 184
 185 ```
 186     if els and !RA.isvec and !RB.isvec:
 187         svctx.ldstmode = elementstride
 188 ```
 189
 190 A summary of the effect of Vectorisation of src or dest:
 191
 192 ```
 193     imm(RA)  RT.v   RA.v   no stride allowed
 194     imm(RA)  RT.s   RA.v   no stride allowed
 195     imm(RA)  RT.v   RA.s   stride-select allowed
 196     imm(RA)  RT.s   RA.s   not vectorised
 197     RA,RB    RT.v  {RA|RB}.v Standard Indexed
 198     RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 199     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 200     RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 201 ```
 202
 203 Signed Effective Address computation is only relevant for Vector Indexed
 204 Mode, when elwidth overrides are applied.  The source override applies to
 205 RB, and before adding to RA in order to calculate the Effective Address,
 206 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
 207 For other Modes (ffirst, saturate), all EA computation with elwidth
 208 overrides is unsigned.
 209
 210 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
 211 **multiple** LD/ST operations, sequentially.  Even with scalar src
 212 a Cache-inhibited LD will read the same memory location *multiple
 213 times*, storing the result in successive Vector destination registers.
 214 This because the cache-inhibit instructions are typically used to read
 215 and write memory-mapped peripherals.  If a genuine cache-inhibited
 216 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
 217 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
 218 value into multiple register destinations.
 219
 220 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
 221 This allows for example to issue a massive batch of memory-mapped
 222 peripheral reads, stopping at the first NULL-terminated character and
 223 truncating VL to that point. No branch is needed to issue that large
 224 burst of LDs, which may be valuable in Embedded scenarios.
 225
 226 ## Vectorisation of Scalar Power ISA v3.0B
 227
 228 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
 229 and [[isa/fixedstore]] pseudocode to be of the form:
 230
 231 ```
 232     lbux RT, RA, RB
 233     EA <- (RA) + (RB)
 234     RT <- MEM(EA)
 235 ```
 236
 237 and for immediate variants:
 238
 239 ```
 240     lb RT,D(RA)
 241     EA <- RA + EXTS(D)
 242     RT <- MEM(EA)
 243 ```
 244
 245 Thus in the first example, the source registers may each be independently
 246 marked as scalar or vector, and likewise the destination; in the second
 247 example only the one source and one dest may be marked as scalar or
 248 vector.
 249
 250 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 251 with the pseudocode below, the immediate can be used to give unit
 252 stride or element stride.  With there being no way to tell which from
 253 the Power v3.0B Scalar opcode alone, the choice is provided instead by
 254 the SV Context.
 255
 256 ```
 257     # LD not VLD!  format - ldop RT, immed(RA)
 258     # op_width: lb=1, lh=2, lw=4, ld=8
 259     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 260       ps = get_pred_val(FALSE, RA); # predication on src
 261       pd = get_pred_val(FALSE, RT); # ... AND on dest
 262       for (i=0, j=0, u=0; i < VL && j < VL;):
 263         # skip nonpredicates elements
 264         if (RA.isvec) while (!(ps & 1<<i)) i++;
 265         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 266         if (RT.isvec) while (!(pd & 1<<j)) j++;
 267         if postinc:
 268             offs = 0; # added afterwards
 269             if RA.isvec: srcbase = ireg[RA+i]
 270             else         srcbase = ireg[RA]
 271         elif svctx.ldstmode == elementstride:
 272           # element stride mode
 273           srcbase = ireg[RA]
 274           offs = i * immed              # j*immed for a ST
 275         elif svctx.ldstmode == unitstride:
 276           # unit stride mode
 277           srcbase = ireg[RA]
 278           offs = immed + (i * op_width) # j*op_width for ST
 279         elif RA.isvec:
 280           # quirky Vector indexed mode but with an immediate
 281           srcbase = ireg[RA+i]
 282           offs = immed;
 283         else
 284           # standard scalar mode (but predicated)
 285           # no stride multiplier means VSPLAT mode
 286           srcbase = ireg[RA]
 287           offs = immed
 288
 289         # compute EA
 290         EA = srcbase + offs
 291         # load from memory
 292         ireg[RT+j] <= MEM[EA];
 293         # check post-increment of EA
 294         if postinc: EA = srcbase + immed;
 295         # update RA?
 296         if RAupdate: ireg[RAupdate+u] = EA;
 297         if (!RT.isvec)
 298             break # destination scalar, end now
 299         if (RA.isvec) i++;
 300         if (RAupdate.isvec) u++;
 301         if (RT.isvec) j++;
 302 ```
 303
 304 Indexed LD is:
 305
 306 ```
 307     # format: ldop RT, RA, RB
 308     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 309       ps = get_pred_val(FALSE, RA); # predication on src
 310       pd = get_pred_val(FALSE, RT); # ... AND on dest
 311       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 312         # skip nonpredicated RA, RB and RT
 313         if (RA.isvec) while (!(ps & 1<<i)) i++;
 314         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 315         if (RB.isvec) while (!(ps & 1<<k)) k++;
 316         if (RT.isvec) while (!(pd & 1<<j)) j++;
 317         if svctx.ldstmode == elementstride:
 318             EA = ireg[RA] + ireg[RB]*j   # register-strided
 319         else
 320             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 321         if RAupdate: ireg[RAupdate+u] = EA
 322         ireg[RT+j] <= MEM[EA];
 323         if (!RT.isvec)
 324             break # destination scalar, end immediately
 325         if (RA.isvec) i++;
 326         if (RAupdate.isvec) u++;
 327         if (RB.isvec) k++;
 328         if (RT.isvec) j++;
 329 ```
 330
 331 Note that Element-Strided uses the Destination Step because with both
 332 sources being Scalar as a prerequisite condition of activation of
 333 Element-Stride Mode, the source step (being Scalar) would never advance.
 334
 335 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
 336 mode (`ldux`) to be effectively a *completely different* register from
 337 RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
 338 as well as RA-as-dest, both independently as scalar or vector *and*
 339 independently extending their range.
 340
 341 *Programmer's note: being able to set RA-as-a-source as separate from
 342 RA-as-a-destination as Scalar is **extremely valuable** once it is
 343 remembered that Simple-V element operations must be in Program Order,
 344 especially in loops, for saving on multiple address computations. Care
 345 does have to be taken however that RA-as-src is not overwritten by
 346 RA-as-dest unless intentionally desired, especially in element-strided
 347 Mode.*
 348
 349 ## LD/ST Indexed vs Indexed REMAP
 350
 351 Unfortunately the word "Indexed" is used twice in completely different
 352 contexts, potentially causing confusion.
 353
 354 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 355   its creation: these are called "LD/ST Indexed" instructions and their
 356   name and meaning is well-established.
 357 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 358   Mode that can be applied to *any* instruction **including those
 359   named LD/ST Indexed**.
 360
 361 Whilst it may be costly in terms of register reads to allow REMAP Indexed
 362 Mode to be applied to any Vectorised LD/ST Indexed operation such as
 363 `sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
 364 the strict application of the RISC Paradigm that Simple-V follows makes
 365 it awkward to consider *preventing* the application of Indexed REMAP to
 366 such operations, and secondly they are not actually the same at all.
 367
 368 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 369 effectively performs an *in-place* re-ordering of the offsets, RB.
 370 To achieve the same effect without Indexed REMAP would require taking
 371 a *copy* of the Vector of offsets starting at RB, manually explicitly
 372 reordering them, and finally using the copy of re-ordered offsets in a
 373 non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
 374 showing what actually occurs, where the pseudocode for `indexed_remap`
 375 may be found in [[sv/remap]]:
 376
 377 ```
 378     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 379     for i in 0..VL-1:
 380         if remap.indexed:
 381             rb_idx = indexed_remap(i) # remap
 382         else:
 383             rb_idx = i # use the index as-is
 384         EA = GPR(RA) + GPR(RB+rb_idx)
 385         GPR(RT+i) = MEM(EA, 8)
 386 ```
 387
 388 Thus it can be seen that the use of Indexed REMAP saves copying
 389 and manual reordering of the Vector of RB offsets.
 390
 391 ## LD/ST ffirst (Fault-First)
 392
 393 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 394 is not active) as an ordinary one, with all behaviour with respect to
 395 Interrupts Exceptions Page Faults Memory Management being identical
 396 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
 397 1 and above, if an exception would occur, then VL is **truncated**
 398 to the previous element: the exception is **not** then raised because
 399 the LD/ST that would otherwise have caused an exception is *required*
 400 to be cancelled. Additionally an implementor may choose to truncate VL
 401 for any arbitrary reason *except for the very first*.
 402
 403 ffirst LD/ST to multiple pages via a Vectorised Index base is
 404 considered a security risk due to the abuse of probing multiple
 405 pages in rapid succession and getting speculative feedback on which
 406 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
 407 entirely, and the Mode bit instead used for element-strided LD/ST.
 408 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 409
 410 ```
 411     for(i = 0; i < VL; i++)
 412         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 413 ```
 414
 415 High security implementations where any kind of speculative probing of
 416 memory pages is considered a risk should take advantage of the fact
 417 that implementations may truncate VL at any point, without requiring
 418 software to be rewritten and made non-portable. Such implementations may
 419 choose to *always* set VL=1 which will have the effect of terminating
 420 any speculative probing (and also adversely affect performance), but
 421 will at least not require applications to be rewritten.
 422
 423 Low-performance simpler hardware implementations may also choose (always)
 424 to also set VL=1 as the bare minimum compliant implementation of LD/ST
 425 Fail-First. It is however critically important to remember that the first
 426 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
 427 raise exceptions exactly like an ordinary LD/ST.
 428
 429 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
 430 for any implementation-specific reason. For example: it is perfectly
 431 reasonable for implementations to alter VL when ffirst LD or ST operations
 432 are initiated on a nonaligned boundary, such that within a loop the
 433 subsequent iteration of that loop begins the following ffirst LD/ST
 434 operations on an aligned boundary such as the beginning of a cache line,
 435 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
 436 balance resources.
 437
 438 Vertical-First Mode is slightly strange in that only one element at a time
 439 is ever executed anyway.  Given that programmers may legitimately choose
 440 to alter srcstep and dststep in non-sequential order as part of explicit
 441 loops, it is neither possible nor safe to make speculative assumptions
 442 about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
 443 `UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
 444 FFirst where Vertical-First Mode is fully deterministic, not speculative.
 445
 446 ## Data-Dependent Fail-First (not Fail/Fault-First)
 447
 448 Not to be confused with Fail/Fault First, Data-Fail-First performs an
 449 additional check on the data, and if the test
 450 fails then VL is truncated and further looping terminates.
 451 This is precisely the same as Arithmetic Data-Dependent Fail-First,
 452 the only difference being that the result comes from the LD/ST
 453 rather than from an Arithmetic operation.
 454
 455 Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
 456 except for Store-Conditional a 4-bit Condition Register Field test is created
 457 for testing purposes
 458 *but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
 459 The reason why a CR Field is not stored is because Load/Store, particularly
 460 the Update instructions, is already expensive in register terms,
 461 and adding an extra Vector write would be too costly in hardware.
 462
 463 *Programmer's note: Programmers
 464 may use Data-Dependent Load with a test to truncate VL, and may then
 465 follow up with a `sv.cmpi` or other operation. The important aspect is
 466 that the Vector Load truncated on finding a NULL pointer, for example.*
 467
 468 *Programmer's note: Load-with-Update may be used to update
 469 the register used in Effective Address computation of th
 470 next element.  This may be used to perform single-linked-list
 471 walking, where Data-Dependent Fail-First terminates and
 472 truncates the Vector at the first NULL.*
 473
 474 In the case of Store operations there is a quirk when VLi (VL inclusive
 475 is "Valid") is clear. Bear in mind the criteria is that the truncated
 476 Vector of results, when VLi is clear, must all pass the "test", but when
 477 VLi is set the *current failed test* is permitted to be included.  Thus,
 478 the actual update (store) to Memory is **not permitted to take place**
 479 should the test fail. Therefore, on testing the value to be stored,
 480 when VLi=0 and finding that the test fails the Memory store must **not** occur.
 481
 482 Additionally, when VLi=0 and a test fails then RA does **not** receive a
 483 copy of the Effective Address.  Hardware implementations with Out-of-Order
 484 Micro-Architectures should use speculative Shadow-Hold and Cancellation
 485 when the test fails.
 486
 487 By contrast if VLi=1 and the test fails, Store may proceed *and then*
 488 looping terminates.  In this way, when non-Inclusive, the Vector of
 489 Truncated results contains only Stores that passed the test (and RA=EA
 490 updates if any), and when Inclusive the Vector of Truncated results
 491 contains the first-failed data.
 492
 493 Below is an example of loading the starting addresses of Linked-List
 494 nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
 495 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
 496 one Element earlier.
 497
 498 *Programmer's Note: by also setting the RC1 qualifier as well as setting
 499 VLi=1 it is possible to establish a Predicate Mask such that the first
 500 zero in the predicate will be the NULL pointer*
 501
 502 ```
 503    RT=1 # vec - deliberately overlaps by one with RA
 504    RA=0 # vec - first one is valid, contains ptr
 505    imm = 8 # offset_of(ptr->next)
 506    for i in range(VL):
 507        # this part is the Scalar Defined Word (standard scalar ld operation)
 508        EA = GPR(RA+i) + imm          # ptr + offset(next)
 509        data = MEM(EA, 8)             # 64-bit address of ptr->next
 510        GPR(RT+i) = data              # happens to be read on next loop!
 511        # was a normal vector-ld up to this point. now the Data-Fail-First
 512        cr_test = conditions(data)
 513        if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
 514        if cr_test.EQ == testbit:             # check if zero
 515            if VLI then   VL = i+1            # update VL, inclusive
 516            else          VL = i              # update VL, exclusive current
 517            break                             # stop looping
 518 ```
 519
 520 **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
 521
 522 There are very few instructions that allow Rc=1 for Load/Store:
 523 one of those is the `stdcx.` and other Atomic Store-Conditional
 524 instructions.  With Simple-V being a loop around Scalar instructions
 525 strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
 526 on an Atomic Store-Conditional will always fail the second and all other
 527 Store-Conditional instructions because
 528 Load-Reservation and Store-Conditional are required to be executed
 529 in pairs.
 530
 531 By contrast, in Vertical-First Mode it is in fact possible to issue
 532 the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
 533 useful.
 534
 535 Programmer's note: Care should be taken when VL is truncated in
 536 Vertical-First Mode.
 537
 538 **Future potential**
 539
 540 Although Rc=1 on LD/ST is a rare occurrence at present, future versions
 541 of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
 542 with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that
 543 is itself fully-independent of the Scalar Suffix Defined Words, prohibiting
 544 the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
 545 operations is not strategically sound.
 546
 547 ## LOAD/STORE Elwidths <a name="elwidth"></a>
 548
 549 Loads and Stores are almost unique in that the Power Scalar ISA
 550 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 551 others like it provide an explicit operation width.  There are therefore
 552 *three* widths involved:
 553
 554 * operation width (lb=8, lh=16, lw=32, ld=64)
 555 * src element width override (8/16/32/default)
 556 * destination element width override (8/16/32/default)
 557
 558 Some care is therefore needed to express and make clear the transformations,
 559 which are expressly in this order:
 560
 561 * Calculate the Effective Address from RA at full width
 562   but (on Indexed Load) allow srcwidth overrides on RB
 563 * Load at the operation width (lb/lh/lw/ld) as usual
 564 * byte-reversal as usual
 565 * Non-saturated mode:
 566    - zero-extension or truncation from operation width to dest elwidth
 567    - place result in destination at dest elwidth
 568 * Saturated mode:
 569    - Sign-extension or truncation from operation width to dest width
 570    - signed/unsigned saturation down to dest elwidth
 571
 572 In order to respect Power v3.0B Scalar behaviour the memory side
 573 is treated effectively as completely separate and distinct from SV
 574 augmentation.  This is primarily down to quirks surrounding LE/BE and
 575 byte-reversal.
 576
 577 It is rather unfortunately possible to request an elwidth override on
 578 the memory side which does not mesh with the overridden operation width:
 579 these result in `UNDEFINED` behaviour.  The reason is that the effect
 580 of attempting a 64-bit `sv.ld` operation with a source elwidth override
 581 of 8/16/32 would result in overlapping memory requests, particularly
 582 on unit and element strided operations.  Thus it is `UNDEFINED` when
 583 the elwidth is smaller than the memory operation width. Examples include
 584 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
 585 from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 586 where the dest elwidth override is less than the operation width.
 587
 588 Note the following regarding the pseudocode to follow:
 589
 590 * `scalar identity behaviour` SV Context parameter conditions turn this
 591   into a straight absolute fully-compliant Scalar v3.0B LD operation
 592 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 593   rather than `ld`)
 594 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 595   a "normal" part of Scalar v3.0B LD
 596 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 597   as a "normal" part of Scalar v3.0B LD
 598 * `svctx` specifies the SV Context and includes VL as well as
 599   source and destination elwidth overrides.
 600
 601 Below is the pseudocode for Unit-Strided LD (which includes Vector
 602 capability). Observe in particular that RA, as the base address in both
 603 Immediate and Indexed LD/ST, does not have element-width overriding
 604 applied to it.
 605
 606 Note that predication, predication-zeroing, and other modes except
 607 saturation have all been removed, for clarity and simplicity:
 608
 609 ```
 610     # LD not VLD!
 611     # this covers unit stride mode and a type of vector offset
 612     function op_ld(RT, RA, op_width, imm_offs, svctx)
 613       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 614         if not svctx.unit/el-strided:
 615             # strange vector mode, compute 64 bit address which is
 616             # not polymorphic! elwidth hardcoded to 64 here
 617             srcbase = get_polymorphed_reg(RA, 64, i)
 618         else:
 619             # unit / element stride mode, compute 64 bit address
 620             srcbase = get_polymorphed_reg(RA, 64, 0)
 621             # adjust for unit/el-stride
 622             srcbase += ....
 623
 624         # read the underlying memory
 625         memread <= MEM(srcbase + imm_offs, op_width)
 626
 627         # check saturation.
 628         if svpctx.saturation_mode:
 629             # ... saturation adjustment...
 630             memread = clamp(memread, op_width, svctx.dest_elwidth)
 631         else:
 632             # truncate/extend to over-ridden dest width.
 633             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 634
 635         # takes care of inserting memory-read (now correctly byteswapped)
 636         # into regfile underlying LE-defined order, into the right place
 637         # within the NEON-like register, respecting destination element
 638         # bitwidth, and the element index (j)
 639         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 640
 641         # increments both src and dest element indices (no predication here)
 642         i++;
 643         j++;
 644 ```
 645
 646 Note above that the source elwidth is *not used at all* in LD-immediate.
 647
 648 For LD/Indexed, the key is that in the calculation of the Effective Address,
 649 RA has no elwidth override but RB does.  Pseudocode below is simplified
 650 for clarity: predication and all modes except saturation are removed:
 651
 652 ```
 653     # LD not VLD! ld*rx if brev else ld*
 654     function op_ld(RT, RA, RB, op_width, svctx, brev)
 655       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 656         if not svctx.el-strided:
 657             # RA not polymorphic! elwidth hardcoded to 64 here
 658             srcbase = get_polymorphed_reg(RA, 64, i)
 659         else:
 660             # element stride mode, again RA not polymorphic
 661             srcbase = get_polymorphed_reg(RA, 64, 0)
 662         # RB *is* polymorphic
 663         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 664         # sign-extend
 665         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 666
 667         # takes care of (merges) processor LE/BE and ld/ldbrx
 668         bytereverse = brev XNOR MSR.LE
 669
 670         # read the underlying memory
 671         memread <= MEM(srcbase + offs, op_width)
 672
 673         # optionally performs byteswap at op width
 674         if (bytereverse):
 675             memread = byteswap(memread, op_width)
 676
 677         if svpctx.saturation_mode:
 678             # ... saturation adjustment...
 679             memread = clamp(memread, op_width, svctx.dest_elwidth)
 680         else:
 681             # truncate/extend to over-ridden dest width.
 682             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 683
 684         # takes care of inserting memory-read (now correctly byteswapped)
 685         # into regfile underlying LE-defined order, into the right place
 686         # within the NEON-like register, respecting destination element
 687         # bitwidth, and the element index (j)
 688         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 689
 690         # increments both src and dest element indices (no predication here)
 691         i++;
 692         j++;
 693 ```
 694
 695 ## Remapped LD/ST
 696
 697 In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
 698 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
 699 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
 700 of LDs or STs.  The usual interest in such re-mapping is for example in
 701 separating out 24-bit RGB channel data into separate contiguous registers.
 702 NEON covers this as shown in the diagram below:
 703
 704 ![Load/Store remap](/openpower/sv/load-store.svg)
 705
 706 REMAP easily covers this capability, and with dest elwidth overrides
 707 and saturation may do so with built-in conversion that would normally
 708 require additional width-extension, sign-extension and min/max Vectorised
 709 instructions as post-processing stages.
 710
 711 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 712 because the generic abstracted concept of "Remapping", when applied to
 713 LD/ST, will give that same capability, with far more flexibility.
 714
 715 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
 716 established through `svstep`, are also an easy way to perform regular
 717 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
 718 REMAP will need to be used.
 719
 720 **Parallel Reduction REMAP**
 721
 722 No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
 723 is completely separate from the RISC-paradigm Scalar Defined Words.  Although
 724 obscure there does exist the outside possibility that a potential use for
 725 Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
 726 Readers are invited to contact the authors of this document if one is ever
 727 found.
 728
 729 --------
 730
 731 [[!tag standards]]
 732
 733 \newpage{}