instruction_virtual_addressing.mdwn

   1 # Beyond 39-bit instruction virtual address extension
   2
   3 Peter says:
   4
   5 I'd like to propose a spec change and don't know who to contact. My
   6 suggestion is that the instruction virtual address remain at 39-bits
   7 (or lower) while moving the data virtual address to 48-bits. These 2
   8 spaces do not need to be the same size, and the instruction space will
   9 naturally be a very small subset. The reason we expand is to access
  10 more data, but the HW cost is primarily from the instruction virtual
  11 address. I don't believe there are any applications that require nearly
  12 this much instruction space, so it's possible compilers already abide by
  13 this restriction. However we would need to formalize it to take advantage
  14 in HW.
  15
  16 I've participated in many feasibilities to expand the virtual address
  17 through the years, and the costs (frequency, area, and power) are
  18 prohibitive and get worse with each process. The main reason it is so
  19 expensive is that the virtual address is used within the core to track
  20 each instruction, so it exists in almost every functional block. We try
  21 to implement address compression where possible, but it is still perhaps
  22 the costliest group of signals we have. This false dependency between
  23 instruction and data address space is the reason x86 processors have
  24 been stuck at 48 bits for more than a decade despite a strong demand
  25 for expansion from server customers.
  26
  27 This seems like the type of HW/SW collaboration that RISC-V was meant
  28 to address. Any suggestions how to proceed?
  29
  30 # Discussion with Peter and lkcl
  31
  32 >>  i *believe* that would have implications that only a 32/36/39 bit
  33 >> *total* application execution space could be fitted into the TLB at
  34 >> any one time, i.e. that if there were two applications approaching
  35 >> those limits, that the TLBs would need to be entirely swapped out to
  36 >> make room for one (and only one) of those insanely-large programs to
  37 >> execute at any one time.
  38 >>
  39 > Yes, one solution would be to restrict the instruction TLB to one (or a few)
  40 > segments. Our interface to SW is on page misses and when reading from
  41 > registers (e.g. indirect branches), so we can translate to the different
  42 > address size at these points. It would be preferable if the corner cases
  43 > were disallowed by SW.
  44
  45  ok so just to be clear:
  46
  47 * application instruction space addressing is restricted to
  48 32/36/39-bit (whatever)
  49 * virtual address space for applications is restricted to 48-bit (on
  50 rv64: rv128 has higher?)
  51 * TLBs for application instruction space can then be restricted to
  52 32+N/36+N/39+N where 0 <= N <= a small number.
  53 * the smaller application space results in less virtual instruction
  54 address routing hardware (the primary goal)
  55 * an indirect branch, which will always be to an address within the
  56 32/36/39-bit range, will result in a virtual TLB table miss
  57 * the miss will be in:
  58     -> the 32+N/36+N/39+N space that will be
  59     -> redirected to a virtual 48-bit address that will be
  60     -> redirected to real RAM through the TLB.
  61
  62 assuming i have that right, in this way:
  63
  64 * you still have up to 48-bit *actual* virtual addressing (and
  65 potentially even higher, even on RV64)
  66 * but any one application is limited in instruction addressing range
  67 to 32/36/39-bit
  68 * *BUT* you *CAN* actually have multiple such applications running
  69 simultaneously (depending on whether N is greater than zero or not).
  70
  71 is that about right?
  72
  73 if so, what are the disadvantages?  what is lost (vs what is gained)?
  74
  75 --------
  76
  77 reply:
  78
  79  ok so just to be clear:
  80
  81  * application instruction space addressing is restricted to
  82 32/36/39-bit (whatever)
  83
  84 The address space of a process would ideally be restricted to a range
  85 such as this. If not, SW would preferably help with corner cases
  86 (e.g. instruction overlaps segment boundary).
  87
  88  * virtual address space for applications is restricted to 48-bit (on
  89 rv64: rv128 has higher?)
  90
  91 Anything 64-bits or less would be fine (more of an ISA issue).
  92
  93  * TLBs for application instruction space can then be restricted to
  94 32+N/36+N/39+N where 0 <= N <= a small number.
  95
  96 Yes
  97
  98  * the smaller application space results in less virtual instruction
  99 address routing hardware (the primary goal)
 100
 101 The primary goal is frequency, but routing in key areas is a major
 102 component of this (and is increasingly important on each new silicon
 103 process). Area and power are secondary goals.
 104
 105  * an indirect branch, which will always be to an address within the
 106 32/36/39-bit range, will result in a virtual TLB table miss
 107
 108 Indirect branches would ideally always map to the range, but HW would
 109 always check.
 110
 111  * the miss will be in:
 112    -> the 32+N/36+N/39+N space that will be
 113    -> redirected to a virtual 48-bit address that will be
 114    -> redirected to real RAM through the TLB.
 115
 116 Actually a page walk through the page miss handler, but the concept
 117 is correct.
 118
 119 > if so, what are the disadvantages?  what is lost (vs what is gained)?
 120
 121 I think the disadvantages are mainly SW implementation costs. The
 122 advantages are frequency, power, and area. Also a mechanism for expanded
 123 addressability and security.
 124
 125 [hypothetically, the same scheme could equally be applied to 48-bit
 126 executables (so 32/36/39/48).)]
 127
 128 # Jacob and Albert discussion
 129
 130 Albert Cahalan wrote:
 131
 132 > The solution is likely to limit the addresses that can be living in the
 133 > pipeline at any one moment. If that would be exceeded, you wait.
 134 >
 135 > For example, split a 64-bit address into a 40-bit upper part and a
 136 > 24-bit lower part. Assign 3-bit codes in place of the 40-bit portion,
 137 > first-come-first-served.  Track just 27 bits (3+24) through the
 138 > processor. You can do a reference count on the 3-bit codes or just wait
 139 > for the whole pipeline to clear and then recycle all of the 3-bit codes.
 140
 141 > Adjust all those numbers as determined by benchmarking.
 142
 143 > I must say, this bears a strong resemblance to the TLB. Maybe you could
 144 > use a TLB entry index for the tracking.
 145
 146 I had thought of a similar solution.
 147
 148 The key is that the pipeline can only care about some subset of the
 149 virtual address space at any one time.  All that is needed is some way
 150 to distinguish the instructions that are currently in the pipeline,
 151 rather than every instruction in the process, as virtual addresses do.
 152
 153 I suggest using cache or TLB coordinates as instruction tags.  This would
 154 require that the L1 I-cache or ITLB "pin" each cacheline or slot that
 155 holds a currently-pending instruction until that instruction is retired.
 156 The L1 I-cache is probably an ideal reference, since the cache tag
 157 array has the current base virtual address for each cacheline and the
 158 rest of the pipeline would only need {cacheline number, offset} tuples.
 159 Evicting the cacheline containing the most-recently-fetched instruction
 160 would be insane in general, so this should have minimal impact on L1
 161 I-cache management.  If the virtual address of the instruction is needed
 162 for any reason, it can be read from the I-cache tag array.
 163
 164 This approach can be trivially extended to multi-ASID or even multi-VMID
 165 systems by simply adding VMID and ASID fields to the tag tuples.
 166
 167 The L1 I-cache provides an easy solution for assigning "short codes"
 168 to replace the upper portion of an instruction's virtual address.
 169 As an example, consider an 8KiB L1 I-cache with 128-byte cachelines.
 170 Such a cache has 64 cachelines (6 bits) and each cacheline has 64 or
 171 32 possible instructions (depending on implementation of RVC or other
 172 odd-alignment ISA extensions).  For an RVC-capable system (the worst
 173 case), each 128-byte cacheline has 64 possible instruction locations, for
 174 another 6 bits.  So now the rest of the pipeline need only track 12-bit
 175 tags that reference the L1 I-cache.  A similar approach could also use
 176 the ITLB, but the ITLB variant results in larger tags, due both to the
 177 need to track page offsets (11 bits) and the larger number of slots the
 178 ITLB is likely to have.
 179
 180 Conceivably, even the program counter could be internally implemented
 181 in this way.
 182
 183 -----
 184
 185 Jacob replies
 186
 187 The idea is that the internal encoding for (example) sepc could be the cache coordinates, and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array.  In other words, cache coordinates do not need to be resolved back to virtual addresses until software does something that requires the virtual address.
 188
 189 Branch target addresses get "interesting" since the implementation must either be able to carry a virtual address for a branch target into the pipeline (JALR needs the ability to transfer to a virtual address anyway) or prefetch all branch targets so the branch address can be written as a cache coordinate.  An implementation could also simply have both "branch to VA" and "branch to CC" macro-ops and probe the cache when a branch is decoded:  if the branch target is already in the cache, decode as "branch to CC", otherwise decode as "branch to VA".  This requires tracking both forms of the program counter, however, and adds a performance-optimization rule:  branch targets should be in the same  or next cacheline when feasible.  (I expect most implementations that implement I-cache prefetch at all to automatically prefetch the next cacheline of the instruction stream.  That is very cheap to implement and the prefetch will hit whenever execution proceeds sequentially, which should be fairly common.)
 190
 191 Limiting which instructions can take traps helps with this model, and interrupts (which can otherwise introduce interrupt traps anywhere) would need to be handled by inserting a "take interrupt trap" macro-op into the decoded instruction stream.
 192
 193 Also, this approach can use coordinates into either the L1 I-cache or the ITLB.  I have been describing the cache version because I find it more interesting and it can use smaller tags than the TLB version.  You mention evaluating TLB pointers and finding them insufficient; do cache pointers reduce or solve those issues?  What were the problems with using TLB coordinates instead of virtual addresses?
 194
 195 More directly addressing lkcl's question, I expect that use of cache coordinates to be completely transparent to software, requiring no change to the ISA spec.  As a purely microarchitectural solution, it also meets Dr. Waterman's goal.
 196
 197 # Microarchitecture design preference
 198
 199 andrew expressed a preference that the spec not require changes, instead that implementors design microarchitectures that solve the problem transparently.
 200
 201 > so jacob (and peter, and albert, and others), what are your thoughts
 202 > on whether these proposals would require a specification change.  are
 203 > they entirely transparent or are they guaranteed to have ramifications
 204 > that propagate through the hardware and on to the toolchains and OSes,
 205 > requiring formal platform-level specification and ratification?
 206
 207 I had hoped for software proposals, but these HW proposals would not require a specification change. I found that TLB ptrs didn't address our primary design issues (about 10 years ago), but it does simplify areas of the design. At least a partial TLB would be needed at other points in the pipeline when reading the VA from registers or checking branch addresses.
 208
 209 I still think the spec should recognize that the instruction space has very different requirements and costs.
 210
 211 ----
 212
 213 " sepc could be the cache coordinates [set,way?], and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array"
 214 This makes no sense to me. First, reading the CSR move the CSR into a GPR, it doesn't look up anything in the cache.
 215
 216 In an implementation using cache coordinates for *epc, reading *epc _does_ perform a cache tag lookup.
 217
 218 In case you instead meant that it is then used to index into the cache, then either:
 219  - Reading the CSR into a GPR resolves to a VA, or
 220
 221 This is correct.
 222
 223 [...]
 224 Neither of those explanations makes sense- could you explain better?
 225
 226 In this case, where sepc stores a (cache row, offset) tuple, reading sepc requires resolving that tuple into a virtual address, which is done by reading the high bits from the cache tag array and carrying over the offset within the cacheline.  CPU-internal "magic cookie" cache coordinates are not software-visible.  In this specific case, at entry to the trap handler, the relevant cacheline must be present -- it holds the most-recently executed instruction before the trap.
 227
 228 In general, the cacheline can be guaranteed to remain present using interlock logic that prevents its eviction unless no part of the processor is "looking at" it.  Reference counting is a solved problem and should be sufficient for this.  This gets a bit more complex with speculative execution and multiple privilege levels, although a cache-per-privilege-level model (needed to avoid side channels) also solves the problem of the cacheline being evicted -- the user cache is frozen while the supervisor runs and vice versa.  I have an outline for a solution to this problem involving shadow cachelines (enabling speculative prefetch/eviction in a VIPT cache) and a "trace scoreboard" (multi-column reference counter array -- each column tracks references from pending execution traces:  issuing an instruction increments a cell, retiring an instruction decrements a cell, dropping a speculative trace (resolving predicate as false) zeros an entire column, and a cacheline may be selected for eviction iff its entire row is zero).
 229
 230 CSR reads are allowed to have software-visible side effects in RISC-V, although none of the current standard CSRs have side-effects on read.  Looking at it this way, resolving cache coordinates to a virtual address upon reading sepc is simply a side effect that is not visible to software.