(no commit message)
[libreriscv.git] / instruction_virtual_addressing.mdwn
1 # Beyond 39-bit instruction virtual address extension
2
3 Peter says:
4
5 I'd like to propose a spec change and don't know who to contact. My
6 suggestion is that the instruction virtual address remain at 39-bits
7 (or lower) while moving the data virtual address to 48-bits. These 2
8 spaces do not need to be the same size, and the instruction space will
9 naturally be a very small subset. The reason we expand is to access
10 more data, but the HW cost is primarily from the instruction virtual
11 address. I don't believe there are any applications that require nearly
12 this much instruction space, so it's possible compilers already abide by
13 this restriction. However we would need to formalize it to take advantage
14 in HW.
15
16 I've participated in many feasibilities to expand the virtual address
17 through the years, and the costs (frequency, area, and power) are
18 prohibitive and get worse with each process. The main reason it is so
19 expensive is that the virtual address is used within the core to track
20 each instruction, so it exists in almost every functional block. We try
21 to implement address compression where possible, but it is still perhaps
22 the costliest group of signals we have. This false dependency between
23 instruction and data address space is the reason x86 processors have
24 been stuck at 48 bits for more than a decade despite a strong demand
25 for expansion from server customers.
26
27 This seems like the type of HW/SW collaboration that RISC-V was meant
28 to address. Any suggestions how to proceed?
29
30 # Discussion with Peter and lkcl
31
32 >> i *believe* that would have implications that only a 32/36/39 bit
33 >> *total* application execution space could be fitted into the TLB at
34 >> any one time, i.e. that if there were two applications approaching
35 >> those limits, that the TLBs would need to be entirely swapped out to
36 >> make room for one (and only one) of those insanely-large programs to
37 >> execute at any one time.
38 >>
39 > Yes, one solution would be to restrict the instruction TLB to one (or a few)
40 > segments. Our interface to SW is on page misses and when reading from
41 > registers (e.g. indirect branches), so we can translate to the different
42 > address size at these points. It would be preferable if the corner cases
43 > were disallowed by SW.
44
45 ok so just to be clear:
46
47 * application instruction space addressing is restricted to
48 32/36/39-bit (whatever)
49 * virtual address space for applications is restricted to 48-bit (on
50 rv64: rv128 has higher?)
51 * TLBs for application instruction space can then be restricted to
52 32+N/36+N/39+N where 0 <= N <= a small number.
53 * the smaller application space results in less virtual instruction
54 address routing hardware (the primary goal)
55 * an indirect branch, which will always be to an address within the
56 32/36/39-bit range, will result in a virtual TLB table miss
57 * the miss will be in:
58 -> the 32+N/36+N/39+N space that will be
59 -> redirected to a virtual 48-bit address that will be
60 -> redirected to real RAM through the TLB.
61
62 assuming i have that right, in this way:
63
64 * you still have up to 48-bit *actual* virtual addressing (and
65 potentially even higher, even on RV64)
66 * but any one application is limited in instruction addressing range
67 to 32/36/39-bit
68 * *BUT* you *CAN* actually have multiple such applications running
69 simultaneously (depending on whether N is greater than zero or not).
70
71 is that about right?
72
73 if so, what are the disadvantages? what is lost (vs what is gained)?
74
75 --------
76
77 reply:
78
79 ok so just to be clear:
80
81 * application instruction space addressing is restricted to
82 32/36/39-bit (whatever)
83
84 The address space of a process would ideally be restricted to a range
85 such as this. If not, SW would preferably help with corner cases
86 (e.g. instruction overlaps segment boundary).
87
88 * virtual address space for applications is restricted to 48-bit (on
89 rv64: rv128 has higher?)
90
91 Anything 64-bits or less would be fine (more of an ISA issue).
92
93 * TLBs for application instruction space can then be restricted to
94 32+N/36+N/39+N where 0 <= N <= a small number.
95
96 Yes
97
98 * the smaller application space results in less virtual instruction
99 address routing hardware (the primary goal)
100
101 The primary goal is frequency, but routing in key areas is a major
102 component of this (and is increasingly important on each new silicon
103 process). Area and power are secondary goals.
104
105 * an indirect branch, which will always be to an address within the
106 32/36/39-bit range, will result in a virtual TLB table miss
107
108 Indirect branches would ideally always map to the range, but HW would
109 always check.
110
111 * the miss will be in:
112 -> the 32+N/36+N/39+N space that will be
113 -> redirected to a virtual 48-bit address that will be
114 -> redirected to real RAM through the TLB.
115
116 Actually a page walk through the page miss handler, but the concept
117 is correct.
118
119 > if so, what are the disadvantages? what is lost (vs what is gained)?
120
121 I think the disadvantages are mainly SW implementation costs. The
122 advantages are frequency, power, and area. Also a mechanism for expanded
123 addressability and security.
124
125 [hypothetically, the same scheme could equally be applied to 48-bit
126 executables (so 32/36/39/48).)]
127
128 # Jacob and Albert discussion
129
130 Albert Cahalan wrote:
131
132 > The solution is likely to limit the addresses that can be living in the
133 > pipeline at any one moment. If that would be exceeded, you wait.
134 >
135 > For example, split a 64-bit address into a 40-bit upper part and a
136 > 24-bit lower part. Assign 3-bit codes in place of the 40-bit portion,
137 > first-come-first-served. Track just 27 bits (3+24) through the
138 > processor. You can do a reference count on the 3-bit codes or just wait
139 > for the whole pipeline to clear and then recycle all of the 3-bit codes.
140
141 > Adjust all those numbers as determined by benchmarking.
142
143 > I must say, this bears a strong resemblance to the TLB. Maybe you could
144 > use a TLB entry index for the tracking.
145
146 I had thought of a similar solution.
147
148 The key is that the pipeline can only care about some subset of the
149 virtual address space at any one time. All that is needed is some way
150 to distinguish the instructions that are currently in the pipeline,
151 rather than every instruction in the process, as virtual addresses do.
152
153 I suggest using cache or TLB coordinates as instruction tags. This would
154 require that the L1 I-cache or ITLB "pin" each cacheline or slot that
155 holds a currently-pending instruction until that instruction is retired.
156 The L1 I-cache is probably an ideal reference, since the cache tag
157 array has the current base virtual address for each cacheline and the
158 rest of the pipeline would only need {cacheline number, offset} tuples.
159 Evicting the cacheline containing the most-recently-fetched instruction
160 would be insane in general, so this should have minimal impact on L1
161 I-cache management. If the virtual address of the instruction is needed
162 for any reason, it can be read from the I-cache tag array.
163
164 This approach can be trivially extended to multi-ASID or even multi-VMID
165 systems by simply adding VMID and ASID fields to the tag tuples.
166
167 The L1 I-cache provides an easy solution for assigning "short codes"
168 to replace the upper portion of an instruction's virtual address.
169 As an example, consider an 8KiB L1 I-cache with 128-byte cachelines.
170 Such a cache has 64 cachelines (6 bits) and each cacheline has 64 or
171 32 possible instructions (depending on implementation of RVC or other
172 odd-alignment ISA extensions). For an RVC-capable system (the worst
173 case), each 128-byte cacheline has 64 possible instruction locations, for
174 another 6 bits. So now the rest of the pipeline need only track 12-bit
175 tags that reference the L1 I-cache. A similar approach could also use
176 the ITLB, but the ITLB variant results in larger tags, due both to the
177 need to track page offsets (11 bits) and the larger number of slots the
178 ITLB is likely to have.
179
180 Conceivably, even the program counter could be internally implemented
181 in this way.
182
183 -----
184
185 Jacob replies
186
187 The idea is that the internal encoding for (example) sepc could be the cache coordinates, and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array. In other words, cache coordinates do not need to be resolved back to virtual addresses until software does something that requires the virtual address.
188
189 Branch target addresses get "interesting" since the implementation must either be able to carry a virtual address for a branch target into the pipeline (JALR needs the ability to transfer to a virtual address anyway) or prefetch all branch targets so the branch address can be written as a cache coordinate. An implementation could also simply have both "branch to VA" and "branch to CC" macro-ops and probe the cache when a branch is decoded: if the branch target is already in the cache, decode as "branch to CC", otherwise decode as "branch to VA". This requires tracking both forms of the program counter, however, and adds a performance-optimization rule: branch targets should be in the same or next cacheline when feasible. (I expect most implementations that implement I-cache prefetch at all to automatically prefetch the next cacheline of the instruction stream. That is very cheap to implement and the prefetch will hit whenever execution proceeds sequentially, which should be fairly common.)
190
191 Limiting which instructions can take traps helps with this model, and interrupts (which can otherwise introduce interrupt traps anywhere) would need to be handled by inserting a "take interrupt trap" macro-op into the decoded instruction stream.
192
193 Also, this approach can use coordinates into either the L1 I-cache or the ITLB. I have been describing the cache version because I find it more interesting and it can use smaller tags than the TLB version. You mention evaluating TLB pointers and finding them insufficient; do cache pointers reduce or solve those issues? What were the problems with using TLB coordinates instead of virtual addresses?
194
195 More directly addressing lkcl's question, I expect that use of cache coordinates to be completely transparent to software, requiring no change to the ISA spec. As a purely microarchitectural solution, it also meets Dr. Waterman's goal.
196
197 # Microarchitecture design preference
198
199 andrew expressed a preference that the spec not require changes, instead that implementors design microarchitectures that solve the problem transparently.
200
201 > so jacob (and peter, and albert, and others), what are your thoughts
202 > on whether these proposals would require a specification change. are
203 > they entirely transparent or are they guaranteed to have ramifications
204 > that propagate through the hardware and on to the toolchains and OSes,
205 > requiring formal platform-level specification and ratification?
206
207 I had hoped for software proposals, but these HW proposals would not require a specification change. I found that TLB ptrs didn't address our primary design issues (about 10 years ago), but it does simplify areas of the design. At least a partial TLB would be needed at other points in the pipeline when reading the VA from registers or checking branch addresses.
208
209 I still think the spec should recognize that the instruction space has very different requirements and costs.
210
211 ----
212
213 " sepc could be the cache coordinates [set,way?], and reading the CSR uses the actual value stored as an address to perform a read from the L1 I-cache tag array"
214 This makes no sense to me. First, reading the CSR move the CSR into a GPR, it doesn't look up anything in the cache.
215
216 In an implementation using cache coordinates for *epc, reading *epc _does_ perform a cache tag lookup.
217
218 In case you instead meant that it is then used to index into the cache, then either:
219 - Reading the CSR into a GPR resolves to a VA, or
220
221 This is correct.
222
223 [...]
224 Neither of those explanations makes sense- could you explain better?
225
226 In this case, where sepc stores a (cache row, offset) tuple, reading sepc requires resolving that tuple into a virtual address, which is done by reading the high bits from the cache tag array and carrying over the offset within the cacheline. CPU-internal "magic cookie" cache coordinates are not software-visible. In this specific case, at entry to the trap handler, the relevant cacheline must be present -- it holds the most-recently executed instruction before the trap.
227
228 In general, the cacheline can be guaranteed to remain present using interlock logic that prevents its eviction unless no part of the processor is "looking at" it. Reference counting is a solved problem and should be sufficient for this. This gets a bit more complex with speculative execution and multiple privilege levels, although a cache-per-privilege-level model (needed to avoid side channels) also solves the problem of the cacheline being evicted -- the user cache is frozen while the supervisor runs and vice versa. I have an outline for a solution to this problem involving shadow cachelines (enabling speculative prefetch/eviction in a VIPT cache) and a "trace scoreboard" (multi-column reference counter array -- each column tracks references from pending execution traces: issuing an instruction increments a cell, retiring an instruction decrements a cell, dropping a speculative trace (resolving predicate as false) zeros an entire column, and a cacheline may be selected for eviction iff its entire row is zero).
229
230 CSR reads are allowed to have software-visible side effects in RISC-V, although none of the current standard CSRs have side-effects on read. Looking at it this way, resolving cache coordinates to a virtual address upon reading sepc is simply a side effect that is not visible to software.