-# Virtual Registers
-
-The fourth part is the grid in green, in the top left corner. This is
-a "virtual" to "real" one-bit table. It's here because the size of
-these matrices is so enormous that there is deep concern about the line
-driver strength, as well as the actual size. 128 registers means
-that one single gate, when it goes high or low, has to "drive" the
-input of 128 other gates. That takes longer and longer to do, the higher
-the number of gates, so it becomes a critical factor in determining the
-maximum speed of the entire processor. We will have to keep an eye
-on this.
-
-So, to keep the FU to Register matrix size down, this "virtual" register
-concept was introduced. Only one bit in each row of the green table
-may be active: it says, for example, "IR1 actually represents that there
-is an instruction being executed using R3". This does mean however that
-if this table is not high enough (not enough IRs), the processor has to
-stall until an instruction is completed, so that one register becomes
-free. Again, another thing to keep an eye on, in simulations.
-
-# Refinements
-
-The second major concern is the purple matrix: the FU-to-FU one. Basically
-where previously we would have FU1 cover all ADDs, FU2 would cover all MUL
-operations, FU3 covers BRANCH and so on, now we have to multiply those
-numbers by **four** (64-bit ops, 32-bit ops, 16-bit and 8), which in turn
-means that the size of the FU-to-FU Matrix has gone up by a staggering
-**sixteen** times. This is not really acceptable, so we have to do something
-different.
-
-So the refinement is based on an observation that 16-bit operations of
-course may be constructed from 8-bit values, and that 64-bit operations
-can be constructed from 32-bit ones. So, what if we skipped the
-cascade on 64 and 16 bit, and made the cascade out of just 32-bit and 8-bit?
-Then, very simply, the top half of a 64-bit source register is allocated
-to one Function Unit, the bottom half to the one next to it, and when it
-comes to actually passing the source registers to the relevant ALU, take
-from *both* FUs.
-
-The primary focus is on 32-bit (single-precision floating-point) performance
-anyway, for 3D, so if 64-bit operations happen to have half the number of
-Reservation Stations / Function Units, and block more often, we actually
-don't mind so much. Also, we can still apply the same "banks" trick on
-the Register File, except this time with 4-way multiplexing on 32-bit
-wide banks, and 4x4 crossbars on the bytes as well:
-
-{{register_file_multiplexing.jpg}}
-
-To cope with 16-bit operations, pairs of 8-bit values in adjacent Function
-Units are reserved. Likewise for 64-bit operations, the 8-bit crossbars
+### Virtual Registers
+
+The fourth part of the above diagram is the grid in green, in the top
+left corner. This is a "virtual" to "real" one-bit table. It's here
+because the size of these matrices is so enormous that there is deep
+concern about the line driver strength, as well as the actual size.
+128 registers means that one single gate, when it goes high or low,
+has to "drive" the input of 128 other gates. That takes longer and
+longer to do, the higher the number of gates, so it becomes a critical
+factor in determining the maximum speed of the entire processor. We
+will have to keep an eye on this.
+
+So, to keep the function unit to register matrix size down, this
+"virtual" register concept was introduced. Only one bit in each row
+of the green table may be active: it says, for example, "IR1 actually
+represents that there is an instruction being executed using R3."
+This does mean, however, that if this table is not high enough (not
+enough IRs), the processor has to stall until an instruction is
+completed, so that one register becomes free. Again, another thing to
+keep an eye on, in simulations.
+
+### Refinements
+
+The second major concern is the purple matrix, the function unit to
+function unit one. Basically, where previously we would have FU1
+cover all ADDs, FU2 would cover all MUL operations, FU3 covers BRANCH,
+and so on, now we have to multiply those numbers by **four** (64-bit
+ops, 32-bit ops, 16-bit, and 8), which in turn means that the size of
+the FU-to-FU matrix has gone up by a staggering **sixteen** times.
+This is not really acceptable, so we have to do something different.
+
+The refinement is based on an observation that 16-bit operations of
+course may be constructed from 8-bit values, and that 64-bit
+operations can be constructed from 32-bit ones. So, what if we
+skipped the cascade on 64 and 16 bit, and made the cascade out of just
+32-bit and 8-bit? Then, very simply, the top half of a 64-bit source
+register is allocated to one function unit, the bottom half to the one
+next to it, and when it comes to actually passing the source registers
+to the relevant ALU, take from *both* function units.
+
+For 3D, the primary focus is on 32-bit (single-precision
+floating-point) performance anyway, so if 64-bit operations happen to
+have half the number of reservation stations / function units, and
+block more often, we actually don't mind so much. Also, we can still
+apply the same "banks" trick on the register file, except this time
+with four-way multiplexing on 32-bit wide banks, and 4 x 4 crossbars
+on the bytes as well:
+
+{register-file-multiplexing | link}
+
+To cope with 16-bit operations, pairs of 8-bit values in adjacent function
+units are reserved. Likewise for 64-bit operations, the 8-bit crossbars