openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> autoincrement mode
  11 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  12 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  13 * [[ldst/discussion]]
  14
  15 # Rationale
  16
  17 All Vector ISAs dating back fifty years have extensive and comprehensive
  18 Load and Store operations that go far beyond the capabilities of Scalar
  19 RISC and most CISC processors, yet at their heart on an individual element
  20 basis may be found to be no different from RISC Scalar equivalents.
  21
  22 The resource savings from Vector LD/ST are significant and stem from
  23 the fact that one single instruction can trigger a dozen (or in some
  24 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  25
  26 Additionally, and simply: if the Arithmetic side of an ISA supports
  27 Vector Operations, then in order to keep the ALUs 100% occupied the
  28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  29 Memory Operations as well.
  30
  31 Vectorised Load and Store also presents an extra dimension (literally)
  32 which creates scenarios unique to Vector applications, that a Scalar
  33 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  34 add the modes typically found in *all* Scalable Vector ISAs,
  35 without changing the behaviour of the underlying Base
  36 (Scalar) v3.0B operations in any way.
  37
  38 # Modes overview
  39
  40 Vectorisation of Load and Store requires creation, from scalar operations,
  41 a number of different modes:
  42
  43 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  44 * **element strided** - sequential but regularly offset, with gaps
  45 * **vector indexed** - vector of base addresses and vector of offsets
  46 * **Speculative fail-first** - where it makes sense to do so
  47 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  48
  49 *Despite being constructed from Scalar LD/ST none of these Modes
  50 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  51
  52 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  53 as well as Element-width overrides and Twin-Predication.
  54
  55 Note also that Indexed [[sv/remap]] mode may be applied to both
  56 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
  57 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
  58 is provided below.
  59
  60 **Determining the LD/ST Modes**
  61
  62 A minor complication (caused by the retro-fitting of modern Vector
  63 features to a Scalar ISA) is that certain features do not exactly make
  64 sense or are considered a security risk.  Fail-first on Vector Indexed
  65 would allow attackers to probe large numbers of pages from userspace, where
  66 strided fail-first (by creating contiguous sequential LDs) does not.
  67
  68 In addition, reduce mode makes no sense.
  69 Realistically we need
  70 an alternative table definition for [[sv/svp64]] `RM.MODE`.
  71 The following modes make sense:
  72
  73 * saturation
  74 * predicate-result (mostly for cache-inhibited LD/ST)
  75 * simple (no augmentation)
  76 * fail-first (where Vector Indexed is banned)
  77 * Signed Effective Address computation (Vector Indexed only)
  78 * Pack/Unpack (on LD/ST immediate operations only)
  79
  80 More than that however it is necessary to fit the usual Vector ISA
  81 capabilities onto both Power ISA LD/ST with immediate and to
  82 LD/ST Indexed. They present subtly different Mode tables, which, due
  83 to lack of space, have the following quirks:
  84
  85 * LD/ST Immediate has no individual control over src/dest zeroing,
  86   whereas LD/ST Indexed does.
  87 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
  88 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
  89
  90 # Format and fields
  91
  92 Fields used in tables below:
  93
  94 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
  95 * **zz**: both sz and dz are set equal to this flag.
  96 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
  97 * **N** sets signed/unsigned saturation.
  98 * **RC1** as if Rc=1, stores CRs *but not the result*
  99 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
 100   registers that have been reduced due to elwidth overrides
 101
 102 **LD/ST immediate**
 103
 104 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
 105 (bits 19:23 of `RM`) is:
 106
 107 | 0-1 |  2  |  3   4  |  description               |
 108 | --- | --- |---------|--------------------------- |
 109 | 00  | 0   |  zz els | simple mode                |
 110 | 00  | 1   | PI  LF  | post-increment and Fault-First  |
 111 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 112 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 113 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
 114 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 115 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 116
 117 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 118 whether stride is unit or element:
 119
 120     if RA.isvec:
 121         svctx.ldstmode = indexed
 122     elif els == 0:
 123         svctx.ldstmode = unitstride
 124     elif immediate != 0:
 125         svctx.ldstmode = elementstride
 126
 127 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 128 in effect the multiplication of the immediate-offset by zero results
 129 in reading from the exact same memory location, *even with a Vector
 130 register*. (Normally this type of behaviour is reserved for the
 131 mapreduce modes)
 132
 133 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 134 just the once and be copied, rather than hitting the Data Cache
 135 multiple times with the same memory read at the same location.
 136 The benefit of Cache-inhibited LD-splats is that it allows
 137 for memory-mapped peripherals to have multiple
 138 data values read in quick succession and stored in sequentially
 139 numbered registers (but, see Note below).
 140
 141 For non-cache-inhibited ST from a vector source onto a scalar
 142 destination: with the Vector
 143 loop effectively creating multiple memory writes to the same location,
 144 we can deduce that the last of these will be the "successful" one. Thus,
 145 implementations are free and clear to optimise out the overwriting STs,
 146 leaving just the last one as the "winner".  Bear in mind that predicate
 147 masks will skip some elements (in source non-zeroing mode).
 148 Cache-inhibited ST operations on the other hand **MUST** write out
 149 a Vector source multiple successive times to the exact same Scalar
 150 destination. Just like Cache-inhibited LDs, multiple values may be
 151 written out in quick succession to a memory-mapped peripheral from
 152 sequentially-numbered registers.
 153
 154 Note that any memory location may be Cache-inhibited
 155 (Power ISA v3.1, Book III, 1.6.1, p1033)
 156
 157 *Programmer's Note: an immediate also with a Scalar source as
 158 a "VSPLAT" mode is simply not possible: there are not enough
 159 Mode bits. One single Scalar Load operation may be used instead, followed
 160 by any arithmetic operation (including a simple mv) in "Splat"
 161 mode.*
 162
 163 **LD/ST Indexed**
 164
 165 The modes for `RA+RB` indexed version are slightly different
 166 but are the same `RM.MODE` bits (19:23 of `RM`):
 167
 168 | 0-1 |  2  |  3   4  |  description              |
 169 | --- | --- |---------|-------------------------- |
 170 | 00  | SEA |  dz  sz | simple mode        |
 171 | 01  | SEA | dz sz   | Strided (scalar only source)   |
 172 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 173 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 174 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 175
 176 Vector Indexed Strided Mode is qualified as follows:
 177
 178     if mode = 0b01 and !RA.isvec and !RB.isvec:
 179         svctx.ldstmode = elementstride
 180
 181 A summary of the effect of Vectorisation of src or dest:
 182
 183      imm(RA)  RT.v   RA.v   no stride allowed
 184      imm(RA)  RT.s   RA.v   no stride allowed
 185      imm(RA)  RT.v   RA.s   stride-select allowed
 186      imm(RA)  RT.s   RA.s   not vectorised
 187      RA,RB    RT.v  {RA|RB}.v Standard Indexed
 188      RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 189      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 190      RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 191
 192 Signed Effective Address computation is only relevant for
 193 Vector Indexed Mode, when elwidth overrides are applied.
 194 The source override applies to RB, and before adding to
 195 RA in order to calculate the Effective Address, if SEA is
 196 set RB is sign-extended from elwidth bits to the full 64
 197 bits.  For other Modes (ffirst, saturate),
 198 all EA computation with elwidth overrides is unsigned.
 199
 200 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  Even with scalar src a
 201 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
 202 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
 203 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
 204 copying the one *scalar* value into multiple register destinations.
 205
 206 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
 207 This allows for example to issue a massive batch of memory-mapped
 208 peripheral reads, stopping at the first NULL-terminated character and
 209 truncating VL to that point. No branch is needed to issue that large burst
 210 of LDs, which may be valuable in Embedded scenarios.
 211
 212 # Vectorisation of Scalar Power ISA v3.0B
 213
 214 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]] and
 215 [[isa/fixedstore]] pseudocode to be of the form:
 216
 217     lbux RT, RA, RB
 218     EA <- (RA) + (RB)
 219     RT <- MEM(EA)
 220
 221 and for immediate variants:
 222
 223     lb RT,D(RA)
 224     EA <- RA + EXTS(D)
 225     RT <- MEM(EA)
 226
 227 Thus in the first example, the source registers may each be independently
 228 marked as scalar or vector, and likewise the destination; in the second
 229 example only the one source and one dest may be marked as scalar or
 230 vector.
 231
 232 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 233 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the Power v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
 234
 235     # LD not VLD!  format - ldop RT, immed(RA)
 236     # op_width: lb=1, lh=2, lw=4, ld=8
 237     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 238       ps = get_pred_val(FALSE, RA); # predication on src
 239       pd = get_pred_val(FALSE, RT); # ... AND on dest
 240       for (i=0, j=0, u=0; i < VL && j < VL;):
 241         # skip nonpredicates elements
 242         if (RA.isvec) while (!(ps & 1<<i)) i++;
 243         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 244         if (RT.isvec) while (!(pd & 1<<j)) j++;
 245         if postinc:
 246             offs = 0; # added afterwards
 247             if RA.isvec: srcbase = ireg[RA+i]
 248             else         srcbase = ireg[RA]
 249         elif svctx.ldstmode == elementstride:
 250           # element stride mode
 251           srcbase = ireg[RA]
 252           offs = i * immed              # j*immed for a ST
 253         elif svctx.ldstmode == unitstride:
 254           # unit stride mode
 255           srcbase = ireg[RA]
 256           offs = immed + (i * op_width) # j*op_width for ST
 257         elif RA.isvec:
 258           # quirky Vector indexed mode but with an immediate
 259           srcbase = ireg[RA+i]
 260           offs = immed;
 261         else
 262           # standard scalar mode (but predicated)
 263           # no stride multiplier means VSPLAT mode
 264           srcbase = ireg[RA]
 265           offs = immed
 266
 267         # compute EA
 268         EA = srcbase + offs
 269         # load from memory
 270         ireg[RT+j] <= MEM[EA];
 271         # check post-increment of EA
 272         if postinc: EA = srcbase + immed;
 273         # update RA?
 274         if RAupdate: ireg[RAupdate+u] = EA;
 275         if (!RT.isvec)
 276             break # destination scalar, end now
 277         if (RA.isvec) i++;
 278         if (RAupdate.isvec) u++;
 279         if (RT.isvec) j++;
 280
 281 Indexed LD is:
 282
 283     # format: ldop RT, RA, RB
 284     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 285       ps = get_pred_val(FALSE, RA); # predication on src
 286       pd = get_pred_val(FALSE, RT); # ... AND on dest
 287       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 288         # skip nonpredicated RA, RB and RT
 289         if (RA.isvec) while (!(ps & 1<<i)) i++;
 290         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 291         if (RB.isvec) while (!(ps & 1<<k)) k++;
 292         if (RT.isvec) while (!(pd & 1<<j)) j++;
 293         if svctx.ldstmode == elementstride:
 294             EA = ireg[RA] + ireg[RB]*j   # register-strided
 295         else
 296             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 297         if RAupdate: ireg[RAupdate+u] = EA
 298         ireg[RT+j] <= MEM[EA];
 299         if (!RT.isvec)
 300             break # destination scalar, end immediately
 301         if (RA.isvec) i++;
 302         if (RAupdate.isvec) u++;
 303         if (RB.isvec) k++;
 304         if (RT.isvec) j++;
 305
 306 Note that Element-Strided uses the Destination Step because with both
 307 sources being Scalar as a prerequisite condition of activation of
 308 Element-Stride Mode, the source step (being Scalar) would never advance.
 309
 310 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 311
 312 *Programmer's note: being able to set RA-as-a-source
 313  as separate from RA-as-a-destination as Scalar is **extremely valuable**
 314  once it is remembered that Simple-V element operations must
 315  be in Program Order, especially in loops, for saving on
 316  multiple address computations. Care does have
 317  to be taken however that RA-as-src is not overwritten by
 318  RA-as-dest unless intentionally desired, especially in element-strided Mode.*
 319
 320 # LD/ST Indexed vs Indexed REMAP
 321
 322 Unfortunately the word "Indexed" is used twice in completely different
 323 contexts, potentially causing confusion.
 324
 325 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 326   its creation: these are called "LD/ST Indexed" instructions and their
 327   name and meaning is well-established.
 328 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 329   Mode that can be applied to *any* instruction **including those
 330   named LD/ST Indexed**.
 331
 332 Whilst it may be costly in terms of register reads to allow REMAP
 333 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
 334 `sv.ld *RT,RA,*RB`, or even misleadingly
 335 labelled  as redundant, firstly the strict
 336 application of the RISC Paradigm that Simple-V follows makes it awkward
 337 to consider *preventing* the application of Indexed REMAP to such
 338 operations, and secondly they are not actually the same at all.
 339
 340 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 341 effectively performs an *in-place* re-ordering of the offsets, RB.
 342 To achieve the same effect without Indexed REMAP would require taking
 343 a *copy* of the Vector of offsets starting at RB, manually explicitly
 344 reordering them, and finally using the copy of re-ordered offsets in
 345 a non-REMAP'ed `sv.ld`.  Using non-strided LD as an example,
 346 pseudocode showing what actually occurs,
 347 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
 348
 349     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 350     for i in 0..VL-1:
 351         if remap.indexed:
 352             rb_idx = indexed_remap(i) # remap
 353         else:
 354             rb_idx = i # use the index as-is
 355         EA = GPR(RA) + GPR(RB+rb_idx)
 356         GPR(RT+i) = MEM(EA, 8)
 357
 358 Thus it can be seen that the use of Indexed REMAP saves copying
 359 and manual reordering of the Vector of RB offsets.
 360
 361 # LD/ST ffirst
 362
 363 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 364 is not active) as an
 365 ordinary one, with all behaviour with respect to Interrupts Exceptions
 366 Page Faults Memory Management being identical in every regard to Scalar
 367 v3.0 Power ISA LD/ST. However for elements 1
 368 and above, if an exception would occur, then VL is **truncated** to the
 369 previous element: the exception is **not** then raised because the
 370 LD/ST that would otherwise have caused an exception is *required* to be cancelled. Additionally an implementor may choose to truncate VL for
 371 any arbitrary reason *except for the very first*.
 372
 373 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 374
 375     for(i = 0; i < VL; i++)
 376         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 377
 378 High security implementations where any kind of speculative probing
 379 of memory pages is considered a risk should take advantage of the fact that
 380 implementations may truncate VL at any point, without requiring software
 381 to be rewritten and made non-portable. Such implementations may choose
 382 to *always* set VL=1 which will have the effect of terminating any
 383 speculative probing (and also adversely affect performance), but will
 384 at least not require applications to be rewritten.
 385
 386 Low-performance simpler hardware implementations may also
 387 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 388 LD/ST Fail-First. It is however critically important to remember that
 389 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 390 **MUST** raise exceptions exactly like an ordinary LD/ST.
 391
 392 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
 393 such as the beginning of a cache line, or beginning of a Virtual Memory
 394 page. Likewise, to reduce workloads or balance resources.
 395
 396 Vertical-First Mode is slightly strange in that only one element
 397 at a time is ever executed anyway.  Given that programmers may
 398 legitimately choose to alter srcstep and dststep in non-sequential
 399 order as part of explicit loops, it is neither possible nor
 400 safe to make speculative assumptions about future LD/STs.
 401 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 402 This is very different from Arithmetic (Data-dependent) FFirst
 403 where Vertical-First Mode is fully deterministic, not speculative.
 404
 405 # LOAD/STORE Elwidths <a name="elwidth"></a>
 406
 407 Loads and Stores are almost unique in that the Power Scalar ISA
 408 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 409 others like it provide an explicit operation width.  There are therefore
 410 *three* widths involved:
 411
 412 * operation width (lb=8, lh=16, lw=32, ld=64)
 413 * src element width override (8/16/32/default)
 414 * destination element width override (8/16/32/default)
 415
 416 Some care is therefore needed to express and make clear the transformations,
 417 which are expressly in this order:
 418
 419 * Calculate the Effective Address from RA at full width
 420   but (on Indexed Load) allow srcwidth overrides on RB
 421 * Load at the operation width (lb/lh/lw/ld) as usual
 422 * byte-reversal as usual
 423 * Non-saturated mode:
 424    - zero-extension or truncation from operation width to dest elwidth
 425    - place result in destination at dest elwidth
 426 * Saturated mode:
 427    - Sign-extension or truncation from operation width to dest width
 428    - signed/unsigned saturation down to dest elwidth
 429
 430 In order to respect Power v3.0B Scalar behaviour the memory side
 431 is treated effectively as completely separate and distinct from SV
 432 augmentation.  This is primarily down to quirks surrounding LE/BE and
 433 byte-reversal.
 434
 435 It is rather unfortunately possible to request an elwidth override
 436 on the memory side which
 437 does not mesh with the overridden operation width: these result in
 438 `UNDEFINED`
 439 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 440 operation with a source elwidth override of 8/16/32 would result in
 441 overlapping memory requests, particularly on unit and element strided
 442 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 443 the memory operation width. Examples include `sv.lw/sw=16/els` which
 444 requests (overlapping) 4-byte memory reads offset from
 445 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 446 where the dest elwidth override is less than the operation width.
 447
 448 Note the following regarding the pseudocode to follow:
 449
 450 * `scalar identity behaviour` SV Context parameter conditions turn this
 451   into a straight absolute fully-compliant Scalar v3.0B LD operation
 452 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 453   rather than `ld`)
 454 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 455   a "normal" part of Scalar v3.0B LD
 456 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 457   as a "normal" part of Scalar v3.0B LD
 458 * `svctx` specifies the SV Context and includes VL as well as
 459   source and destination elwidth overrides.
 460
 461 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
 462 both Immediate and Indexed LD/ST,
 463 does not have element-width overriding applied to it.
 464
 465 Note that predication, predication-zeroing,
 466 and other modes except saturation have all been removed,
 467 for clarity and simplicity:
 468
 469     # LD not VLD!
 470     # this covers unit stride mode and a type of vector offset
 471     function op_ld(RT, RA, op_width, imm_offs, svctx)
 472       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 473         if not svctx.unit/el-strided:
 474             # strange vector mode, compute 64 bit address which is
 475             # not polymorphic! elwidth hardcoded to 64 here
 476             srcbase = get_polymorphed_reg(RA, 64, i)
 477         else:
 478             # unit / element stride mode, compute 64 bit address
 479             srcbase = get_polymorphed_reg(RA, 64, 0)
 480             # adjust for unit/el-stride
 481             srcbase += ....
 482
 483         # read the underlying memory
 484         memread <= MEM(srcbase + imm_offs, op_width)
 485
 486         # check saturation.
 487         if svpctx.saturation_mode:
 488             # ... saturation adjustment...
 489             memread = clamp(memread, op_width, svctx.dest_elwidth)
 490         else:
 491             # truncate/extend to over-ridden dest width.
 492             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 493
 494         # takes care of inserting memory-read (now correctly byteswapped)
 495         # into regfile underlying LE-defined order, into the right place
 496         # within the NEON-like register, respecting destination element
 497         # bitwidth, and the element index (j)
 498         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 499
 500         # increments both src and dest element indices (no predication here)
 501         i++;
 502         j++;
 503
 504 Note above that the source elwidth is *not used at all* in LD-immediate.
 505
 506 For LD/Indexed, the key is that in the calculation of the Effective Address,
 507 RA has no elwidth override but RB does.  Pseudocode below is simplified
 508 for clarity: predication and all modes except saturation are removed:
 509
 510     # LD not VLD! ld*rx if brev else ld*
 511     function op_ld(RT, RA, RB, op_width, svctx, brev)
 512       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 513         if not svctx.el-strided:
 514             # RA not polymorphic! elwidth hardcoded to 64 here
 515             srcbase = get_polymorphed_reg(RA, 64, i)
 516         else:
 517             # element stride mode, again RA not polymorphic
 518             srcbase = get_polymorphed_reg(RA, 64, 0)
 519         # RB *is* polymorphic
 520         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 521         # sign-extend
 522         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 523
 524         # takes care of (merges) processor LE/BE and ld/ldbrx
 525         bytereverse = brev XNOR MSR.LE
 526
 527         # read the underlying memory
 528         memread <= MEM(srcbase + offs, op_width)
 529
 530         # optionally performs byteswap at op width
 531         if (bytereverse):
 532             memread = byteswap(memread, op_width)
 533
 534         if svpctx.saturation_mode:
 535             # ... saturation adjustment...
 536             memread = clamp(memread, op_width, svctx.dest_elwidth)
 537         else:
 538             # truncate/extend to over-ridden dest width.
 539             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 540
 541         # takes care of inserting memory-read (now correctly byteswapped)
 542         # into regfile underlying LE-defined order, into the right place
 543         # within the NEON-like register, respecting destination element
 544         # bitwidth, and the element index (j)
 545         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 546
 547         # increments both src and dest element indices (no predication here)
 548         i++;
 549         j++;
 550
 551 # Remapped LD/ST
 552
 553 In the [[sv/remap]] page the concept of "Remapping" is described.
 554 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 555 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 556 elements worth of LDs or STs.  The usual interest in such re-mapping
 557 is for example in separating out 24-bit RGB channel data into separate
 558 contiguous registers.  NEON covers this as shown in the diagram below:
 559
 560 ![Load/Strore remap](/openpower/sv/load-store.svg)
 561
 562 Remap easily covers this capability, and with dest
 563 elwidth overrides and saturation may do so with built-in conversion that
 564 would normally require additional width-extension, sign-extension and
 565 min/max Vectorised instructions as post-processing stages.
 566
 567 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 568 because the generic abstracted concept of "Remapping", when applied to
 569 LD/ST, will give that same capability, with far more flexibility.
 570
 571 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
 572 established through sv.setvl, are also an easy way to perform regular
 573 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond
 574 that, REMAP will need to be used.