(no commit message)
[libreriscv.git] / simple_v_extension / v_comparative_analysis.mdwn
1 # V-Extension to Simple-V Comparative Analysis
2
3 [[!toc ]]
4
5 This section covers the ways in which Simple-V is comparable
6 to, or more flexible than, V-Extension (V2.3-draft). Also covered is
7 one major weak-point (register files are fixed size, where V is
8 arbitrary length), and how best to deal with that, should V be adapted
9 to be on top of Simple-V.
10
11 The first stages of this section go over each of the sections of V2.3-draft V
12 where appropriate
13
14 # 17.3 Shape Encoding
15
16 Simple-V's proposed means of expressing whether a register (from the
17 standard integer or the standard floating-point file) is a scalar or
18 a vector is to simply set the vector length to 1. The instruction
19 would however have to specify which register file (integer or FP) that
20 the vector-length was to be applied to.
21
22 Extended shapes (2-D etc) would not be part of Simple-V at all.
23
24 # 17.4 Representation Encoding
25
26 Simple-V would not have representation-encoding. This is part of
27 polymorphism, which is considered too complex to implement (TODO: confirm?)
28
29 # 17.5 Element Bitwidth
30
31 This is directly equivalent to Simple-V's "Packed", and implies that
32 integer (or floating-point) are divided down into vector-indexable
33 chunks of size Bitwidth.
34
35 In this way it becomes possible to have ADD effectively and implicitly
36 turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where
37 vector-length has been set to greater than 1, it becomes a "Packed"
38 (SIMD) instruction.
39
40 It remains to be decided what should be done when RV32 / RV64 ADD (sized)
41 opcodes are used. One useful idea would be, on an RV64 system where
42 a 32-bit-sized ADD was performed, to simply use the least significant
43 32-bits of the register (exactly as is currently done) but at the same
44 time to *respect the packed bitwidth as well*.
45
46 The extended encoding (Table 17.6) would not be part of Simple-V.
47
48 # 17.6 Base Vector Extension Supported Types
49
50 TODO: analyse. probably exactly the same.
51
52 # 17.7 Maximum Vector Element Width
53
54 No equivalent in Simple-V
55
56 # 17.8 Vector Configuration Registers
57
58 TODO: analyse.
59
60 # 17.9 Legal Vector Unit Configurations
61
62 TODO: analyse
63
64 # 17.10 Vector Unit CSRs
65
66 TODO: analyse
67
68 > Ok so this is an aspect of Simple-V that I hadn't thought through,
69 > yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section
70 > 17.10 the CSRs are listed.  I note that there's some general-purpose
71 > CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
72 > don't precisely know what those are for.
73
74 >  In the Simple-V proposal, *every* register in both the integer
75 > register-file *and* the floating-point register-file would have at
76 > least a 2-bit "data-width" CSR and probably something like an 8-bit
77 > "vector-length" CSR (less in RV32E, by exactly one bit).
78
79 >  What I *don't* know is whether that would be considered perfectly
80 > reasonable or completely insane.  If it turns out that the proposed
81 > Simple-V CSRs can indeed be stored in SRAM then I would imagine that
82 > adding somewhere in the region of 10 bits per register would be... okay? 
83 > I really don't honestly know.
84
85 >  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
86 > be multi-ported? No I don't believe they would.
87
88 # 17.11 Maximum Vector Length (MVL)
89
90 Basically implicitly this is set to the maximum size of the register
91 file multiplied by the number of 8-bit packed ints that can fit into
92 a register (4 for RV32, 8 for RV64 and 16 for RV128).
93
94 # !7.12 Vector Instruction Formats
95
96 No equivalent in Simple-V because *all* instructions of *all* Extensions
97 are implicitly parallelised (and packed).
98
99 # 17.13 Polymorphic Vector Instructions
100
101 Polymorphism (implicit type-casting) is deliberately not supported
102 in Simple-V.
103
104 # 17.14 Rapid Configuration Instructions
105
106 TODO: analyse if this is useful to have an equivalent in Simple-V
107
108 # 17.15 Vector-Type-Change Instructions
109
110 TODO: analyse if this is useful to have an equivalent in Simple-V
111
112 # 17.16 Vector Length
113
114 Has a direct corresponding equivalent.
115
116 # 17.17 Predicated Execution
117
118 Predicated Execution is another name for "masking" or "tagging". Masked
119 (or tagged) implies that there is a bit field which is indexed, and each
120 bit associated with the corresponding indexed offset register within
121 the "Vector". If the tag / mask bit is 1, when a parallel operation is
122 issued, the indexed element of the vector has the operation carried out.
123 However if the tag / mask bit is *zero*, that particular indexed element
124 of the vector does *not* have the requested operation carried out.
125
126 In V2.3-draft V, there is a significant (not recommended) difference:
127 the zero-tagged elements are *set to zero*. This loses a *significant*
128 advantage of mask / tagging, particularly if the entire mask register
129 is itself a general-purpose register, as that general-purpose register
130 can be inverted, shifted, and'ed, or'ed and so on. In other words
131 it becomes possible, especially if Carry/Overflow from each vector
132 operation is also accessible, to do conditional (step-by-step) vector
133 operations including things like turn vectors into 1024-bit or greater
134 operands with very few instructions, by treating the "carry" from
135 one instruction as a way to do "Conditional add of 1 to the register
136 next door". If V2.3-draft V sets zero-tagged elements to zero, such
137 extremely powerful techniques are simply not possible.
138
139 It is noted that there is no mention of an equivalent to BEXT (element
140 skipping) which would be particularly fascinating and powerful to have.
141 In this mode, the "mask" would skip elements where its mask bit was zero
142 in either the source or the destination operand.
143
144 Lots to be discussed.
145
146 # 17.18 Vector Load/Store Instructions
147
148 The Vector Load/Store instructions as proposed in V are extremely powerful
149 and can be used for reordering and regular restructuring.
150
151 Vector Load:
152
153 if (unit-strided) stride = elsize;
154 else stride = areg[as2]; // constant-strided
155 for (int i=0; i<vl; ++i)
156 if ([!]preg[p][i])
157 for (int j=0; j<seglen+1; j++)
158 vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
159
160 Store:
161
162 if (unit-strided) stride = elsize;
163 else stride = areg[as2]; // constant-strided
164 for (int i=0; i<vl; ++i)
165 if ([!]preg[p][i])
166 for (int j=0; j<seglen+1; j++)
167 mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
168
169 Indexed Load:
170
171 for (int i=0; i<vl; ++i)
172 if ([!]preg[p][i])
173 for (int j=0; j<seglen+1; j++)
174 vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
175
176 Indexed Store:
177
178 for (int i=0; i<vl; ++i)
179 if ([!]preg[p][i])
180 for (int j=0; j<seglen+1; j++)
181 mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
182
183 Keeping these instructions as-is for Simple-V is highly recommended.
184 However: one of the goals of this Extension is to retro-fit (re-use)
185 existing RV Load/Store:
186
187 [[!table data="""
188 31 20 | 19 15 | 14 12 | 11 7 | 6 0 |
189 imm[11:0] | rs1 | funct3 | rd | opcode |
190 12 | 5 | 3 | 5 | 7 |
191 offset[11:0] | base | width | dest | LOAD |
192 """]]
193
194 [[!table data="""
195 31 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
196 imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode |
197 7 | 5 | 5 | 3 | 5 | 7 |
198 offset[11:5] | src | base | width | offset[4:0] | STORE |
199 """]]
200
201 The RV32 instruction opcodes as follows:
202
203 [[!table data="""
204 31 28 27 | 26 25 | 24 20 |19 15 |14| 13 12 | 11 7 | 6 0 | op |
205 imm[4:0] | 00 | 00000 | rs1 | 1| m | vd | 0000111 | VLD |
206 imm[4:0] | 01 | rs2 | rs1 | 1| m | vd | 0000111 | VLDS|
207 imm[4:0] | 11 | vs2 | rs1 | 1| m | vd | 0000111 | VLDX|
208 vs3 | 00 | 00000 | rs1 |1 | m |imm[4:0]| 0100111 |VST |
209 vs3 | 01 | rs2 | rs1 |1 | m |imm[4:0]| 0100111 |VSTS |
210 vs3 | 11 | vs2 | rs1 |1 | m |imm[4:0]| 0100111 |VSTX |
211 """]]
212
213 Conversion on LOAD as follows:
214
215 * rd or rs1 are CSR-vectorised indicating "Vector Mode"
216 * rd equivalent to vd
217 * rs1 equivalent to rs1
218 * imm[4:0] from RV format (11..7]) is same
219 * imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
220 * imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
221 * width from RV format (14..12) is same (width and zero/sign extend)
222
223 [[!table data="""
224 31 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
225 imm[11:0] ||| rs1 | funct3 | rd | opcode |
226 2 | 5 | 5 | 5 | 3 | 5 | 7 |
227 00 | 00000 | imm[4:0] | base | width | dest | LOAD |
228 01 | rs2 | imm[4:0] | base | width | dest | LOAD.S |
229 11 | rs2 | imm[4:0] | base | width | dest | LOAD.X |
230 """]]
231
232 Similar conversion on STORE as follows:
233
234 [[!table data="""
235 31 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
236 imm[11:0] ||| rs1 | funct3 | rd | opcode |
237 2 | 5 | 5 | 5 | 3 | 5 | 7 |
238 00 | 00000 | src | base | width | offs[4:0] | LOAD |
239 01 | rs3 | src | base | width | offs[4:0] | LOAD.S |
240 11 | rs3 | src | base | width | offs[4:0] | LOAD.X |
241 """]]
242
243 Notes:
244
245 * Predication CSR-marking register is not explicitly shown in instruction
246 * In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
247 * That in turn means that Indexed Load need not have an explicit opcode
248 * That in turn means that bit 30 may indicate "stride" and bit 31 is free
249
250 Revised LOAD:
251
252 [[!table data="""
253 31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
254 imm[11:0] |||| rs1 | funct3 | rd | opcode |
255 1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
256 ? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
257 """]]
258
259 Where in turn the pseudo-code may now combine the two:
260
261 if (unit-strided) stride = elsize;
262 else stride = areg[as2]; // constant-strided
263 for (int i=0; i<vl; ++i)
264 if ([!]preg[p][i])
265 for (int j=0; j<seglen+1; j++)
266 {
267 if CSRvectorised[rs2])
268 offs = vreg[rs2][i]
269 else
270 offs = i*(seglen+1)*stride;
271 vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
272 }
273
274 Notes:
275
276 * j is multiplied by stride, not elsize, including in the rs2 vectorised case.
277 * There may be more sophisticated variants involving the 31st bit, however
278 it would be nice to reserve that bit for post-increment of address registers
279 *
280
281 # 17.19 Vector Register Gather
282
283 TODO
284
285 # TODO, sort
286
287 > However, there are also several features that go beyond simply attaching VL
288 > to a scalar operation and are crucial to being able to vectorize a lot of
289 > code. To name a few:
290 > - Conditional execution (i.e., predicated operations)
291 > - Inter-lane data movement (e.g. SLIDE, SELECT)
292 > - Reductions (e.g., VADD with a scalar destination)
293
294 Ok so the Conditional and also the Reductions is one of the reasons
295 why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
296 of a decent name) i proposed that it be implemented as "if you say r0
297 is to be a vector / SIMD that means operations actually take place on
298 r0,r1,r2... r(N-1)".
299
300 Consequently any parallel operation could be paused (or... more
301 specifically: vectors disabled by resetting it back to a default /
302 scalar / vector-length=1) yet the results would actually be in the
303 *main register file* (integer or float) and so anything that wasn't
304 possible to easily do in "simple" parallel terms could be done *out*
305 of parallel "mode" instead.
306
307 I do appreciate that the above does imply that there is a limit to the
308 length that SimpleV (whatever) can be parallelised, namely that you
309 run out of registers! my thought there was, "leave space for the main
310 V-Ext proposal to extend it to the length that V currently supports".
311 Honestly i had not thought through precisely how that would work.
312
313 Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
314 it reminds me of the discussion with Clifford on bit-manipulation
315 (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
316 applied "globally and outside of V and P" SLIDE and SELECT might become
317 an extremely powerful way to do fast memory copy and reordering [2[.
318
319 However I haven't quite got my head round how that would work: i am
320 used to the concept of register "tags" (the modern term is "masks")
321 and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
322 STORE you would get the exact same thing as SELECT.
323
324 SLIDE you could do simply by setting say r0 vector-length to say 16
325 (meaning that if referred to in any operation it would be an implicit
326 parallel operation on *all* registers r0 through r15), and temporarily
327 set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it would
328 implicitly mean "load from memory into r7 through r11". Then you go
329 back and do an operation on r0 and ta-daa, you're actually doing an
330 operation on a SLID {SLIDED?) vector.
331
332 The advantage of Simple-V (whatever) over V would be that you could
333 actually do *operations* in the middle of vectors (not just SLIDEs)
334 simply by (as above) setting r0 vector-length to 16 and r7 vector-length
335 to 5. There would be nothing preventing you from doing an ADD on r0
336 (which meant do an ADD on r0 through r15) followed *immediately in the
337 next instruction with no setup cost* a MUL on r7 (which actually meant
338 "do a parallel MUL on r7 through r11").
339
340 btw it's worth mentioning that you'd get scalar-vector and vector-scalar
341 implicitly by having one of the source register be vector-length 1
342 (the default) and one being N > 1. but without having special opcodes
343 to do it. i *believe* (or more like "logically infer or deduce" as
344 i haven't got access to the spec) that that would result in a further
345 opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
346
347 Also, Reduction *might* be possible by specifying that the destination be
348 a scalar (vector-length=1) whilst the source be a vector. However... it
349 would be an awful lot of work to go through *every single instruction*
350 in *every* Extension, working out which ones could be parallelised (ADD,
351 MUL, XOR) and those that definitely could not (DIV, SUB). Is that worth
352 the effort? maybe. Would it result in huge complexity? probably.
353 Could an implementor just go "I ain't doing *that* as parallel!
354 let's make it virtual-parallelism (sequential reduction) instead"?
355 absolutely. So, now that I think it through, Simple-V (whatever)
356 covers Reduction as well. huh, that's a surprise.
357
358
359 > - Vector-length speculation (making it possible to vectorize some loops with
360 > unknown trip count) - I don't think this part of the proposal is written
361 > down yet.
362
363 Now that _is_ an interesting concept. A little scary, i imagine, with
364 the possibility of putting a processor into a hard infinite execution
365 loop... :)
366
367
368 > Also, note the vector ISA consumes relatively little opcode space (all the
369 > arithmetic fits in 7/8ths of a major opcode). This is mainly because data
370 > type and size is a function of runtime configuration, rather than of opcode.
371
372 yes. i love that aspect of V, i am a huge fan of polymorphism [1]
373 which is why i am keen to advocate that the same runtime principle be
374 extended to the rest of the RISC-V ISA [3]
375
376 Yikes that's a lot. I'm going to need to pull this into the wiki to
377 make sure it's not lost.
378
379 [1] inherent data type conversion: 25 years ago i designed a hypothetical
380 hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
381 (escape-extended) opcodes and 2-bit (escape-extended) operands that
382 only required a fixed 8-bit instruction length. that relied heavily
383 on polymorphism and runtime size configurations as well. At the time
384 I thought it would have meant one HELL of a lot of CSRs... but then I
385 met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
386
387 [2] Interestingly if you then also add in the other aspect of Simple-V
388 (the data-size, which is effectively functionally orthogonal / identical
389 to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
390 operations become byte / half-word / word augmenters of B-Ext's proposed
391 "BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
392 LOAD / STORE would deal with 8 / 16 / 32 bits at a time. Where it
393 would get really REALLY interesting would be masked-packed-vectored
394 B-Ext BGS instructions. I can't even get my head fully round that,
395 which is a good sign that the combination would be *really* powerful :)
396
397 [3] ok sadly maybe not the polymorphism, it's too complicated and I
398 think would be much too hard for implementors to easily "slide in" to an
399 existing non-Simple-V implementation.  i say that despite really *really*
400 wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
401 fashion, for optimising 3D Graphics.  *sigh*.
402
403 # TODO: analyse, auto-increment on unit-stride and constant-stride
404
405 so i thought about that for a day or so, and wondered if it would be
406 possible to propose a variant of zero-overhead loop that included
407 auto-incrementing the two address registers a2 and a3, as well as
408 providing a means to interact between the zero-overhead loop and the
409 vsetvl instruction. a sort-of pseudo-assembly of that would look like:
410
411 # a2 to be auto-incremented by t0 times 4
412 zero-overhead-set-auto-increment a2, t0, 4
413 # a2 to be auto-incremented by t0 times 4
414 zero-overhead-set-auto-increment a3, t0, 4
415 zero-overhead-set-loop-terminator-condition a0 zero
416 zero-overhead-set-start-end stripmine, stripmine+endoffset
417 stripmine:
418 vsetvl t0,a0
419 vlw v0, a2
420 vlw v1, a3
421 vfma v1, a1, v0, v1
422 vsw v1, a3
423 sub a0, a0, t0
424 stripmine+endoffset:
425
426 the question is: would something like this even be desirable? it's a
427 variant of auto-increment [1]. last time i saw any hint of auto-increment
428 register opcodes was in the 1980s... 68000 if i recall correctly... yep
429 see [1]
430
431 [1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html
432
433 Reply:
434
435 Another option for auto-increment is for vector-memory-access instructions
436 to support post-increment addressing for unit-stride and constant-stride
437 modes. This can be implemented by the scalar unit passing the operation
438 to the vector unit while itself executing an appropriate multiply-and-add
439 to produce the incremented address. This does *not* require additional
440 ports on the scalar register file, unlike scalar post-increment addressing
441 modes.
442
443 # TODO: instructions V-Ext duplication analysis <a name="duplication_analysis">
444
445 This is partly speculative due to lack of access to an up-to-date
446 V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing).
447 A cursory examination shows an **85%** duplication of V-Ext
448 operand-related instructions when compared to a standard RG64G base,
449 and a **95%** duplication of arithmetic and floating-point operations.
450
451 Exceptions are:
452
453 * The Vector Misc ops: VEIDX, VFIRST, VPOPC
454 and potentially more (9 control-related instructions)
455 * VCLIP and VCLIPI (the only 2 opcodes not duplicated out of 47
456 total arithmetic / floating-point operations)
457
458 Table of RV32V Instructions
459
460 | RV32V | RV Std (FP) | RV Std (Int) | Notes |
461 | ----- | --- | | |
462 | VADD | FADD | ADD | |
463 | VSUB | FSUB | SUB | |
464 | VSL | | SLL | |
465 | VSR | | SRL | |
466 | VAND | | AND | |
467 | VOR | | OR | |
468 | VXOR | | XOR | |
469 | VSEQ | FEQ | BEQ | (1) |
470 | VSNE | !FEQ | BNE | (1) |
471 | VSLT | FLT | BLT | (1) |
472 | VSGE | !FLE | BGE | (1) |
473 | VCLIP | | | |
474 | VCVT | FCVT | | |
475 | VMPOP | | | |
476 | VMFIRST | | | |
477 | VEXTRACT | | | |
478 | VINSERT | | | |
479 | VMERGE | | | |
480 | VSELECT | | | |
481 | VSLIDE | | | |
482 | VDIV | FDIV | DIV | |
483 | VREM | | REM | |
484 | VMUL | FMUL | MUL | |
485 | VMULH | | MULH | |
486 | VMIN | FMIN | | |
487 | VMAX | FMUX | | |
488 | VSGNJ | FSGNJ | | |
489 | VSGNJN | FSGNJN | | |
490 | VSGNJX | FSNGJX | | |
491 | VSQRT | FSQRT | | |
492 | VCLASS | FCLASS | | |
493 | VPOPC | | | |
494 | VADDI | | ADDI | |
495 | VSLI | | SLI | |
496 | VSRI | | SRI | |
497 | VANDI | | ANDI | |
498 | VORI | | ORI | |
499 | VXORI | | XORI | |
500 | VCLIPI | | | |
501 | VMADD | FMADD | | |
502 | VMSUB | FMSUB | | |
503 | VNMADD | FNMSUB | | |
504 | VNMSUB | FNMADD | | |
505 | VLD | FLD | LD | |
506 | VLDS | FLD | LD | (2) |
507 | VLDX | FLD | LD | (3) |
508 | VST | FST | ST | |
509 | VSTS | FST | ST | (2) |
510 | VSTX | FST | ST | (3) |
511 | VAMOSWAP | | AMOSWAP | |
512 | VAMOADD | | AMOADD | |
513 | VAMOAND | | AMOAND | |
514 | VAMOOR | | AMOOR | |
515 | VAMOXOR | | AMOXOR | |
516 | VAMOMIN | | AMOMIN | |
517 | VAMOMAX | | AMOMAX | |
518
519 Notes:
520
521 * (1) retro-fit predication variants into branch instructions (base and C),
522 decoding triggered by CSR bit marking register as "Vector type".
523 * (2) retro-fit LOAD/STORE constant-stride by reinterpreting one bit of
524 immediate-offset when register arguments are detected as being vectorised
525 * (3) retro-fit LOAD/STORE indexed-stride through detection of address
526 register argument being vectorised
527
528 # TODO: sort
529
530 > I suspect that the "hardware loop" in question is actually a zero-overhead
531 > loop unit that diverts execution from address X to address Y if a certain
532 > condition is met.
533
534  not quite.  The zero-overhead loop unit interestingly would be at
535 an [independent] level above vector-length.  The distinctions are
536 as follows:
537
538 * Vector-length issues *virtual* instructions where the register
539 operands are *specifically* altered (to cover a range of registers),
540 whereas zero-overhead loops *specifically* do *NOT* alter the operands
541 in *ANY* way.
542
543 * Vector-length-driven "virtual" instructions are driven by *one*
544 and *only* one instruction (whether it be a LOAD, STORE, or pure
545 one/two/three-operand opcode) whereas zero-overhead loop units
546 specifically apply to *multiple* instructions.
547
548 Where vector-length-driven "virtual" instructions might get conceptually
549 blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD /
550 STORE, to actually be useful, vector-length-driven LOAD / STORE should
551 increment the LOAD / STORE memory address to correspondingly match the
552 increment in the register bank.  example:
553
554 * set vector-length for r0 to 4
555 * issue RV32 LOAD from addr 0x1230 to r0
556
557 translates effectively to:
558
559 * RV32 LOAD from addr 0x1230 to r0
560 * ...
561 * ...
562 * RV32 LOAD from addr 0x123B to r3
563