1 | \documentclass[11pt,twoside,final,openright]{report} |
---|
2 | \usepackage{a4,graphicx,html,setspace,times} |
---|
3 | \usepackage{comment,parskip} |
---|
4 | \setstretch{1.15} |
---|
5 | |
---|
6 | % LIBRARY FUNCTIONS |
---|
7 | |
---|
8 | \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}} |
---|
9 | |
---|
10 | \begin{document} |
---|
11 | |
---|
12 | % TITLE PAGE |
---|
13 | \pagestyle{empty} |
---|
14 | \begin{center} |
---|
15 | \vspace*{\fill} |
---|
16 | \includegraphics{figs/xenlogo.eps} |
---|
17 | \vfill |
---|
18 | \vfill |
---|
19 | \vfill |
---|
20 | \begin{tabular}{l} |
---|
21 | {\Huge \bf Interface manual} \\[4mm] |
---|
22 | {\huge Xen v3.0 for x86} \\[80mm] |
---|
23 | |
---|
24 | {\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm] |
---|
25 | {\Large University of Cambridge, UK} \\[20mm] |
---|
26 | \end{tabular} |
---|
27 | \end{center} |
---|
28 | |
---|
29 | {\bf DISCLAIMER: This documentation is always under active development |
---|
30 | and as such there may be mistakes and omissions --- watch out for |
---|
31 | these and please report any you find to the developer's mailing list. |
---|
32 | The latest version is always available on-line. Contributions of |
---|
33 | material, suggestions and corrections are welcome. } |
---|
34 | |
---|
35 | \vfill |
---|
36 | \cleardoublepage |
---|
37 | |
---|
38 | % TABLE OF CONTENTS |
---|
39 | \pagestyle{plain} |
---|
40 | \pagenumbering{roman} |
---|
41 | { \parskip 0pt plus 1pt |
---|
42 | \tableofcontents } |
---|
43 | \cleardoublepage |
---|
44 | |
---|
45 | % PREPARE FOR MAIN TEXT |
---|
46 | \pagenumbering{arabic} |
---|
47 | \raggedbottom |
---|
48 | \widowpenalty=10000 |
---|
49 | \clubpenalty=10000 |
---|
50 | \parindent=0pt |
---|
51 | \parskip=5pt |
---|
52 | \renewcommand{\topfraction}{.8} |
---|
53 | \renewcommand{\bottomfraction}{.8} |
---|
54 | \renewcommand{\textfraction}{.2} |
---|
55 | \renewcommand{\floatpagefraction}{.8} |
---|
56 | \setstretch{1.1} |
---|
57 | |
---|
58 | \chapter{Introduction} |
---|
59 | |
---|
60 | Xen allows the hardware resources of a machine to be virtualized and |
---|
61 | dynamically partitioned, allowing multiple different {\em guest} |
---|
62 | operating system images to be run simultaneously. Virtualizing the |
---|
63 | machine in this manner provides considerable flexibility, for example |
---|
64 | allowing different users to choose their preferred operating system |
---|
65 | (e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen |
---|
66 | provides secure partitioning between virtual machines (known as |
---|
67 | {\em domains} in Xen terminology), and enables better resource |
---|
68 | accounting and QoS isolation than can be achieved with a conventional |
---|
69 | operating system. |
---|
70 | |
---|
71 | Xen essentially takes a `whole machine' virtualization approach as |
---|
72 | pioneered by IBM VM/370. However, unlike VM/370 or more recent |
---|
73 | efforts such as VMware and Virtual PC, Xen does not attempt to |
---|
74 | completely virtualize the underlying hardware. Instead parts of the |
---|
75 | hosted guest operating systems are modified to work with the VMM; the |
---|
76 | operating system is effectively ported to a new target architecture, |
---|
77 | typically requiring changes in just the machine-dependent code. The |
---|
78 | user-level API is unchanged, and so existing binaries and operating |
---|
79 | system distributions work without modification. |
---|
80 | |
---|
81 | In addition to exporting virtualized instances of CPU, memory, network |
---|
82 | and block devices, Xen exposes a control interface to manage how these |
---|
83 | resources are shared between the running domains. Access to the |
---|
84 | control interface is restricted: it may only be used by one |
---|
85 | specially-privileged VM, known as {\em domain 0}. This domain is a |
---|
86 | required part of any Xen-based server and runs the application software |
---|
87 | that manages the control-plane aspects of the platform. Running the |
---|
88 | control software in {\it domain 0}, distinct from the hypervisor |
---|
89 | itself, allows the Xen framework to separate the notions of |
---|
90 | mechanism and policy within the system. |
---|
91 | |
---|
92 | |
---|
93 | \chapter{Virtual Architecture} |
---|
94 | |
---|
95 | In a Xen/x86 system, only the hypervisor runs with full processor |
---|
96 | privileges ({\it ring 0} in the x86 four-ring model). It has full |
---|
97 | access to the physical memory available in the system and is |
---|
98 | responsible for allocating portions of it to running domains. |
---|
99 | |
---|
100 | On a 32-bit x86 system, guest operating systems may use {\it rings 1}, |
---|
101 | {\it 2} and {\it 3} as they see fit. Segmentation is used to prevent |
---|
102 | the guest OS from accessing the portion of the address space that is |
---|
103 | reserved for Xen. We expect most guest operating systems will use |
---|
104 | ring 1 for their own operation and place applications in ring 3. |
---|
105 | |
---|
106 | On 64-bit systems it is not possible to protect the hypervisor from |
---|
107 | untrusted guest code running in rings 1 and 2. Guests are therefore |
---|
108 | restricted to run in ring 3 only. The guest kernel is protected from its |
---|
109 | applications by context switching between the kernel and currently |
---|
110 | running application. |
---|
111 | |
---|
112 | In this chapter we consider the basic virtual architecture provided by |
---|
113 | Xen: CPU state, exception and interrupt handling, and time. |
---|
114 | Other aspects such as memory and device access are discussed in later |
---|
115 | chapters. |
---|
116 | |
---|
117 | |
---|
118 | \section{CPU state} |
---|
119 | |
---|
120 | All privileged state must be handled by Xen. The guest OS has no |
---|
121 | direct access to CR3 and is not permitted to update privileged bits in |
---|
122 | EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen; |
---|
123 | these are analogous to system calls but occur from ring 1 to ring 0. |
---|
124 | |
---|
125 | A list of all hypercalls is given in Appendix~\ref{a:hypercalls}. |
---|
126 | |
---|
127 | |
---|
128 | \section{Exceptions} |
---|
129 | |
---|
130 | A virtual IDT is provided --- a domain can submit a table of trap |
---|
131 | handlers to Xen via the {\bf set\_trap\_table} hypercall. The |
---|
132 | exception stack frame presented to a virtual trap handler is identical |
---|
133 | to its native equivalent. |
---|
134 | |
---|
135 | |
---|
136 | \section{Interrupts and events} |
---|
137 | |
---|
138 | Interrupts are virtualized by mapping them to \emph{event channels}, |
---|
139 | which are delivered asynchronously to the target domain using a callback |
---|
140 | supplied via the {\bf set\_callbacks} hypercall. A guest OS can map |
---|
141 | these events onto its standard interrupt dispatch mechanisms. Xen is |
---|
142 | responsible for determining the target domain that will handle each |
---|
143 | physical interrupt source. For more details on the binding of event |
---|
144 | sources to event channels, see Chapter~\ref{c:devices}. |
---|
145 | |
---|
146 | |
---|
147 | \section{Time} |
---|
148 | |
---|
149 | Guest operating systems need to be aware of the passage of both real |
---|
150 | (or wallclock) time and their own `virtual time' (the time for which |
---|
151 | they have been executing). Furthermore, Xen has a notion of time which |
---|
152 | is used for scheduling. The following notions of time are provided: |
---|
153 | |
---|
154 | \begin{description} |
---|
155 | \item[Cycle counter time.] |
---|
156 | |
---|
157 | This provides a fine-grained time reference. The cycle counter time |
---|
158 | is used to accurately extrapolate the other time references. On SMP |
---|
159 | machines it is currently assumed that the cycle counter time is |
---|
160 | synchronized between CPUs. The current x86-based implementation |
---|
161 | achieves this within inter-CPU communication latencies. |
---|
162 | |
---|
163 | \item[System time.] |
---|
164 | |
---|
165 | This is a 64-bit counter which holds the number of nanoseconds that |
---|
166 | have elapsed since system boot. |
---|
167 | |
---|
168 | \item[Wall clock time.] |
---|
169 | |
---|
170 | This is the time of day in a Unix-style {\bf struct timeval} |
---|
171 | (seconds and microseconds since 1 January 1970, adjusted by leap |
---|
172 | seconds). An NTP client hosted by {\it domain 0} can keep this |
---|
173 | value accurate. |
---|
174 | |
---|
175 | \item[Domain virtual time.] |
---|
176 | |
---|
177 | This progresses at the same pace as system time, but only while a |
---|
178 | domain is executing --- it stops while a domain is de-scheduled. |
---|
179 | Therefore the share of the CPU that a domain receives is indicated |
---|
180 | by the rate at which its virtual time increases. |
---|
181 | |
---|
182 | \end{description} |
---|
183 | |
---|
184 | |
---|
185 | Xen exports timestamps for system time and wall-clock time to guest |
---|
186 | operating systems through a shared page of memory. Xen also provides |
---|
187 | the cycle counter time at the instant the timestamps were calculated, |
---|
188 | and the CPU frequency in Hertz. This allows the guest to extrapolate |
---|
189 | system and wall-clock times accurately based on the current cycle |
---|
190 | counter time. |
---|
191 | |
---|
192 | Since all time stamps need to be updated and read \emph{atomically} |
---|
193 | a version number is also stored in the shared info page, which is |
---|
194 | incremented before and after updating the timestamps. Thus a guest can |
---|
195 | be sure that it read a consistent state by checking the two version |
---|
196 | numbers are equal and even. |
---|
197 | |
---|
198 | Xen includes a periodic ticker which sends a timer event to the |
---|
199 | currently executing domain every 10ms. The Xen scheduler also sends a |
---|
200 | timer event whenever a domain is scheduled; this allows the guest OS |
---|
201 | to adjust for the time that has passed while it has been inactive. In |
---|
202 | addition, Xen allows each domain to request that they receive a timer |
---|
203 | event sent at a specified system time by using the {\bf |
---|
204 | set\_timer\_op} hypercall. Guest OSes may use this timer to |
---|
205 | implement timeout values when they block. |
---|
206 | |
---|
207 | |
---|
208 | \section{Xen CPU Scheduling} |
---|
209 | |
---|
210 | Xen offers a uniform API for CPU schedulers. It is possible to choose |
---|
211 | from a number of schedulers at boot and it should be easy to add more. |
---|
212 | The SEDF and Credit schedulers are part of the normal Xen |
---|
213 | distribution. SEDF will be going away and its use should be |
---|
214 | avoided once the credit scheduler has stabilized and become the default. |
---|
215 | The Credit scheduler provides proportional fair shares of the |
---|
216 | host's CPUs to the running domains. It does this while transparently |
---|
217 | load balancing runnable VCPUs across the whole system. |
---|
218 | |
---|
219 | \paragraph*{Note: SMP host support} |
---|
220 | Xen has always supported SMP host systems. When using the credit scheduler, |
---|
221 | a domain's VCPUs will be dynamically moved across physical CPUs to maximise |
---|
222 | domain and system throughput. VCPUs can also be manually restricted to be |
---|
223 | mapped only on a subset of the host's physical CPUs, using the pinning |
---|
224 | mechanism. |
---|
225 | |
---|
226 | |
---|
227 | %% More information on the characteristics and use of these schedulers |
---|
228 | %% is available in {\bf Sched-HOWTO.txt}. |
---|
229 | |
---|
230 | |
---|
231 | \section{Privileged operations} |
---|
232 | |
---|
233 | Xen exports an extended interface to privileged domains (viz.\ {\it |
---|
234 | Domain 0}). This allows such domains to build and boot other domains |
---|
235 | on the server, and provides control interfaces for managing |
---|
236 | scheduling, memory, networking, and block devices. |
---|
237 | |
---|
238 | \chapter{Memory} |
---|
239 | \label{c:memory} |
---|
240 | |
---|
241 | Xen is responsible for managing the allocation of physical memory to |
---|
242 | domains, and for ensuring safe use of the paging and segmentation |
---|
243 | hardware. |
---|
244 | |
---|
245 | |
---|
246 | \section{Memory Allocation} |
---|
247 | |
---|
248 | As well as allocating a portion of physical memory for its own private |
---|
249 | use, Xen also reserves s small fixed portion of every virtual address |
---|
250 | space. This is located in the top 64MB on 32-bit systems, the top |
---|
251 | 168MB on PAE systems, and a larger portion in the middle of the |
---|
252 | address space on 64-bit systems. Unreserved physical memory is |
---|
253 | available for allocation to domains at a page granularity. Xen tracks |
---|
254 | the ownership and use of each page, which allows it to enforce secure |
---|
255 | partitioning between domains. |
---|
256 | |
---|
257 | Each domain has a maximum and current physical memory allocation. A |
---|
258 | guest OS may run a `balloon driver' to dynamically adjust its current |
---|
259 | memory allocation up to its limit. |
---|
260 | |
---|
261 | |
---|
262 | \section{Pseudo-Physical Memory} |
---|
263 | |
---|
264 | Since physical memory is allocated and freed on a page granularity, |
---|
265 | there is no guarantee that a domain will receive a contiguous stretch |
---|
266 | of physical memory. However most operating systems do not have good |
---|
267 | support for operating in a fragmented physical address space. To aid |
---|
268 | porting such operating systems to run on top of Xen, we make a |
---|
269 | distinction between \emph{machine memory} and \emph{pseudo-physical |
---|
270 | memory}. |
---|
271 | |
---|
272 | Put simply, machine memory refers to the entire amount of memory |
---|
273 | installed in the machine, including that reserved by Xen, in use by |
---|
274 | various domains, or currently unallocated. We consider machine memory |
---|
275 | to comprise a set of 4kB \emph{machine page frames} numbered |
---|
276 | consecutively starting from 0. Machine frame numbers mean the same |
---|
277 | within Xen or any domain. |
---|
278 | |
---|
279 | Pseudo-physical memory, on the other hand, is a per-domain |
---|
280 | abstraction. It allows a guest operating system to consider its memory |
---|
281 | allocation to consist of a contiguous range of physical page frames |
---|
282 | starting at physical frame 0, despite the fact that the underlying |
---|
283 | machine page frames may be sparsely allocated and in any order. |
---|
284 | |
---|
285 | To achieve this, Xen maintains a globally readable {\it |
---|
286 | machine-to-physical} table which records the mapping from machine |
---|
287 | page frames to pseudo-physical ones. In addition, each domain is |
---|
288 | supplied with a {\it physical-to-machine} table which performs the |
---|
289 | inverse mapping. Clearly the machine-to-physical table has size |
---|
290 | proportional to the amount of RAM installed in the machine, while each |
---|
291 | physical-to-machine table has size proportional to the memory |
---|
292 | allocation of the given domain. |
---|
293 | |
---|
294 | Architecture dependent code in guest operating systems can then use |
---|
295 | the two tables to provide the abstraction of pseudo-physical memory. |
---|
296 | In general, only certain specialized parts of the operating system |
---|
297 | (such as page table management) needs to understand the difference |
---|
298 | between machine and pseudo-physical addresses. |
---|
299 | |
---|
300 | |
---|
301 | \section{Page Table Updates} |
---|
302 | |
---|
303 | In the default mode of operation, Xen enforces read-only access to |
---|
304 | page tables and requires guest operating systems to explicitly request |
---|
305 | any modifications. Xen validates all such requests and only applies |
---|
306 | updates that it deems safe. This is necessary to prevent domains from |
---|
307 | adding arbitrary mappings to their page tables. |
---|
308 | |
---|
309 | To aid validation, Xen associates a type and reference count with each |
---|
310 | memory page. A page has one of the following mutually-exclusive types |
---|
311 | at any point in time: page directory ({\sf PD}), page table ({\sf |
---|
312 | PT}), local descriptor table ({\sf LDT}), global descriptor table |
---|
313 | ({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always |
---|
314 | create readable mappings of its own memory regardless of its current |
---|
315 | type. |
---|
316 | |
---|
317 | %%% XXX: possibly explain more about ref count 'lifecyle' here? |
---|
318 | This mechanism is used to maintain the invariants required for safety; |
---|
319 | for example, a domain cannot have a writable mapping to any part of a |
---|
320 | page table as this would require the page concerned to simultaneously |
---|
321 | be of types {\sf PT} and {\sf RW}. |
---|
322 | |
---|
323 | \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)} |
---|
324 | |
---|
325 | This hypercall is used to make updates to either the domain's |
---|
326 | pagetables or to the machine to physical mapping table. It supports |
---|
327 | submitting a queue of updates, allowing batching for maximal |
---|
328 | performance. Explicitly queuing updates using this interface will |
---|
329 | cause any outstanding writable pagetable state to be flushed from the |
---|
330 | system. |
---|
331 | |
---|
332 | \section{Writable Page Tables} |
---|
333 | |
---|
334 | Xen also provides an alternative mode of operation in which guests |
---|
335 | have the illusion that their page tables are directly writable. Of |
---|
336 | course this is not really the case, since Xen must still validate |
---|
337 | modifications to ensure secure partitioning. To this end, Xen traps |
---|
338 | any write attempt to a memory page of type {\sf PT} (i.e., that is |
---|
339 | currently part of a page table). If such an access occurs, Xen |
---|
340 | temporarily allows write access to that page while at the same time |
---|
341 | \emph{disconnecting} it from the page table that is currently in use. |
---|
342 | This allows the guest to safely make updates to the page because the |
---|
343 | newly-updated entries cannot be used by the MMU until Xen revalidates |
---|
344 | and reconnects the page. Reconnection occurs automatically in a |
---|
345 | number of situations: for example, when the guest modifies a different |
---|
346 | page-table page, when the domain is preempted, or whenever the guest |
---|
347 | uses Xen's explicit page-table update interfaces. |
---|
348 | |
---|
349 | Writable pagetable functionality is enabled when the guest requests |
---|
350 | it, using a {\bf vm\_assist} hypercall. Writable pagetables do {\em |
---|
351 | not} provide full virtualisation of the MMU, so the memory management |
---|
352 | code of the guest still needs to be aware that it is running on Xen. |
---|
353 | Since the guest's page tables are used directly, it must translate |
---|
354 | pseudo-physical addresses to real machine addresses when building page |
---|
355 | table entries. The guest may not attempt to map its own pagetables |
---|
356 | writably, since this would violate the memory type invariants; page |
---|
357 | tables will automatically be made writable by the hypervisor, as |
---|
358 | necessary. |
---|
359 | |
---|
360 | \section{Shadow Page Tables} |
---|
361 | |
---|
362 | Finally, Xen also supports a form of \emph{shadow page tables} in |
---|
363 | which the guest OS uses a independent copy of page tables which are |
---|
364 | unknown to the hardware (i.e.\ which are never pointed to by {\tt |
---|
365 | cr3}). Instead Xen propagates changes made to the guest's tables to |
---|
366 | the real ones, and vice versa. This is useful for logging page writes |
---|
367 | (e.g.\ for live migration or checkpoint). A full version of the shadow |
---|
368 | page tables also allows guest OS porting with less effort. |
---|
369 | |
---|
370 | |
---|
371 | \section{Segment Descriptor Tables} |
---|
372 | |
---|
373 | At start of day a guest is supplied with a default GDT, which does not reside |
---|
374 | within its own memory allocation. If the guest wishes to use other |
---|
375 | than the default `flat' ring-1 and ring-3 segments that this GDT |
---|
376 | provides, it must register a custom GDT and/or LDT with Xen, allocated |
---|
377 | from its own memory. |
---|
378 | |
---|
379 | The following hypercall is used to specify a new GDT: |
---|
380 | |
---|
381 | \begin{quote} |
---|
382 | int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em |
---|
383 | entries}) |
---|
384 | |
---|
385 | \emph{frame\_list}: An array of up to 14 machine page frames within |
---|
386 | which the GDT resides. Any frame registered as a GDT frame may only |
---|
387 | be mapped read-only within the guest's address space (e.g., no |
---|
388 | writable mappings, no use as a page-table page, and so on). Only 14 |
---|
389 | pages may be specified because pages 15 and 16 are reserved for |
---|
390 | the hypervisor's GDT entries. |
---|
391 | |
---|
392 | \emph{entries}: The number of descriptor-entry slots in the GDT. |
---|
393 | \end{quote} |
---|
394 | |
---|
395 | The LDT is updated via the generic MMU update mechanism (i.e., via the |
---|
396 | {\bf mmu\_update} hypercall. |
---|
397 | |
---|
398 | \section{Start of Day} |
---|
399 | |
---|
400 | The start-of-day environment for guest operating systems is rather |
---|
401 | different to that provided by the underlying hardware. In particular, |
---|
402 | the processor is already executing in protected mode with paging |
---|
403 | enabled. |
---|
404 | |
---|
405 | {\it Domain 0} is created and booted by Xen itself. For all subsequent |
---|
406 | domains, the analogue of the boot-loader is the {\it domain builder}, |
---|
407 | user-space software running in {\it domain 0}. The domain builder is |
---|
408 | responsible for building the initial page tables for a domain and |
---|
409 | loading its kernel image at the appropriate virtual address. |
---|
410 | |
---|
411 | \section{VM assists} |
---|
412 | |
---|
413 | Xen provides a number of ``assists'' for guest memory management. |
---|
414 | These are available on an ``opt-in'' basis to provide commonly-used |
---|
415 | extra functionality to a guest. |
---|
416 | |
---|
417 | \hypercall{vm\_assist(unsigned int cmd, unsigned int type)} |
---|
418 | |
---|
419 | The {\bf cmd} parameter describes the action to be taken, whilst the |
---|
420 | {\bf type} parameter describes the kind of assist that is being |
---|
421 | referred to. Available commands are as follows: |
---|
422 | |
---|
423 | \begin{description} |
---|
424 | \item[VMASST\_CMD\_enable] Enable a particular assist type |
---|
425 | \item[VMASST\_CMD\_disable] Disable a particular assist type |
---|
426 | \end{description} |
---|
427 | |
---|
428 | And the available types are: |
---|
429 | |
---|
430 | \begin{description} |
---|
431 | \item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for |
---|
432 | instructions that rely on 4GB segments (such as the techniques used |
---|
433 | by some TLS solutions). |
---|
434 | \item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the |
---|
435 | guest if the above segment fixups are used: allows the guest to |
---|
436 | display a warning message during boot. |
---|
437 | \item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable |
---|
438 | mode - described above. |
---|
439 | \end{description} |
---|
440 | |
---|
441 | |
---|
442 | \chapter{Xen Info Pages} |
---|
443 | |
---|
444 | The {\bf Shared info page} is used to share various CPU-related state |
---|
445 | between the guest OS and the hypervisor. This information includes VCPU |
---|
446 | status, time information and event channel (virtual interrupt) state. |
---|
447 | The {\bf Start info page} is used to pass build-time information to |
---|
448 | the guest when it boots and when it is resumed from a suspended state. |
---|
449 | This chapter documents the fields included in the {\bf |
---|
450 | shared\_info\_t} and {\bf start\_info\_t} structures for use by the |
---|
451 | guest OS. |
---|
452 | |
---|
453 | \section{Shared info page} |
---|
454 | |
---|
455 | The {\bf shared\_info\_t} is accessed at run time by both Xen and the |
---|
456 | guest OS. It is used to pass information relating to the |
---|
457 | virtual CPU and virtual machine state between the OS and the |
---|
458 | hypervisor. |
---|
459 | |
---|
460 | The structure is declared in {\bf xen/include/public/xen.h}: |
---|
461 | |
---|
462 | \scriptsize |
---|
463 | \begin{verbatim} |
---|
464 | typedef struct shared_info { |
---|
465 | vcpu_info_t vcpu_info[MAX_VIRT_CPUS]; |
---|
466 | |
---|
467 | /* |
---|
468 | * A domain can create "event channels" on which it can send and receive |
---|
469 | * asynchronous event notifications. There are three classes of event that |
---|
470 | * are delivered by this mechanism: |
---|
471 | * 1. Bi-directional inter- and intra-domain connections. Domains must |
---|
472 | * arrange out-of-band to set up a connection (usually by allocating |
---|
473 | * an unbound 'listener' port and avertising that via a storage service |
---|
474 | * such as xenstore). |
---|
475 | * 2. Physical interrupts. A domain with suitable hardware-access |
---|
476 | * privileges can bind an event-channel port to a physical interrupt |
---|
477 | * source. |
---|
478 | * 3. Virtual interrupts ('events'). A domain can bind an event-channel |
---|
479 | * port to a virtual interrupt source, such as the virtual-timer |
---|
480 | * device or the emergency console. |
---|
481 | * |
---|
482 | * Event channels are addressed by a "port index". Each channel is |
---|
483 | * associated with two bits of information: |
---|
484 | * 1. PENDING -- notifies the domain that there is a pending notification |
---|
485 | * to be processed. This bit is cleared by the guest. |
---|
486 | * 2. MASK -- if this bit is clear then a 0->1 transition of PENDING |
---|
487 | * will cause an asynchronous upcall to be scheduled. This bit is only |
---|
488 | * updated by the guest. It is read-only within Xen. If a channel |
---|
489 | * becomes pending while the channel is masked then the 'edge' is lost |
---|
490 | * (i.e., when the channel is unmasked, the guest must manually handle |
---|
491 | * pending notifications as no upcall will be scheduled by Xen). |
---|
492 | * |
---|
493 | * To expedite scanning of pending notifications, any 0->1 pending |
---|
494 | * transition on an unmasked channel causes a corresponding bit in a |
---|
495 | * per-vcpu selector word to be set. Each bit in the selector covers a |
---|
496 | * 'C long' in the PENDING bitfield array. |
---|
497 | */ |
---|
498 | unsigned long evtchn_pending[sizeof(unsigned long) * 8]; |
---|
499 | unsigned long evtchn_mask[sizeof(unsigned long) * 8]; |
---|
500 | |
---|
501 | /* |
---|
502 | * Wallclock time: updated only by control software. Guests should base |
---|
503 | * their gettimeofday() syscall on this wallclock-base value. |
---|
504 | */ |
---|
505 | uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */ |
---|
506 | uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */ |
---|
507 | uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */ |
---|
508 | |
---|
509 | arch_shared_info_t arch; |
---|
510 | |
---|
511 | } shared_info_t; |
---|
512 | \end{verbatim} |
---|
513 | \normalsize |
---|
514 | |
---|
515 | \begin{description} |
---|
516 | \item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of |
---|
517 | which holds either runtime information about a virtual CPU, or is |
---|
518 | ``empty'' if the corresponding VCPU does not exist. |
---|
519 | \item[evtchn\_pending] Guest-global array, with one bit per event |
---|
520 | channel. Bits are set if an event is currently pending on that |
---|
521 | channel. |
---|
522 | \item[evtchn\_mask] Guest-global array for masking notifications on |
---|
523 | event channels. |
---|
524 | \item[wc\_version] Version counter for current wallclock time. |
---|
525 | \item[wc\_sec] Whole seconds component of current wallclock time. |
---|
526 | \item[wc\_nsec] Nanoseconds component of current wallclock time. |
---|
527 | \item[arch] Host architecture-dependent portion of the shared info |
---|
528 | structure. |
---|
529 | \end{description} |
---|
530 | |
---|
531 | \subsection{vcpu\_info\_t} |
---|
532 | |
---|
533 | \scriptsize |
---|
534 | \begin{verbatim} |
---|
535 | typedef struct vcpu_info { |
---|
536 | /* |
---|
537 | * 'evtchn_upcall_pending' is written non-zero by Xen to indicate |
---|
538 | * a pending notification for a particular VCPU. It is then cleared |
---|
539 | * by the guest OS /before/ checking for pending work, thus avoiding |
---|
540 | * a set-and-check race. Note that the mask is only accessed by Xen |
---|
541 | * on the CPU that is currently hosting the VCPU. This means that the |
---|
542 | * pending and mask flags can be updated by the guest without special |
---|
543 | * synchronisation (i.e., no need for the x86 LOCK prefix). |
---|
544 | * This may seem suboptimal because if the pending flag is set by |
---|
545 | * a different CPU then an IPI may be scheduled even when the mask |
---|
546 | * is set. However, note: |
---|
547 | * 1. The task of 'interrupt holdoff' is covered by the per-event- |
---|
548 | * channel mask bits. A 'noisy' event that is continually being |
---|
549 | * triggered can be masked at source at this very precise |
---|
550 | * granularity. |
---|
551 | * 2. The main purpose of the per-VCPU mask is therefore to restrict |
---|
552 | * reentrant execution: whether for concurrency control, or to |
---|
553 | * prevent unbounded stack usage. Whatever the purpose, we expect |
---|
554 | * that the mask will be asserted only for short periods at a time, |
---|
555 | * and so the likelihood of a 'spurious' IPI is suitably small. |
---|
556 | * The mask is read before making an event upcall to the guest: a |
---|
557 | * non-zero mask therefore guarantees that the VCPU will not receive |
---|
558 | * an upcall activation. The mask is cleared when the VCPU requests |
---|
559 | * to block: this avoids wakeup-waiting races. |
---|
560 | */ |
---|
561 | uint8_t evtchn_upcall_pending; |
---|
562 | uint8_t evtchn_upcall_mask; |
---|
563 | unsigned long evtchn_pending_sel; |
---|
564 | arch_vcpu_info_t arch; |
---|
565 | vcpu_time_info_t time; |
---|
566 | } vcpu_info_t; /* 64 bytes (x86) */ |
---|
567 | \end{verbatim} |
---|
568 | \normalsize |
---|
569 | |
---|
570 | \begin{description} |
---|
571 | \item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate |
---|
572 | that there are pending events to be received. |
---|
573 | \item[evtchn\_upcall\_mask] This is set non-zero to disable all |
---|
574 | interrupts for this CPU for short periods of time. If individual |
---|
575 | event channels need to be masked, the {\bf evtchn\_mask} in the {\bf |
---|
576 | shared\_info\_t} is used instead. |
---|
577 | \item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a |
---|
578 | bit is set in this selector to indicate which word of the {\bf |
---|
579 | evtchn\_pending} array in the {\bf shared\_info\_t} contains the |
---|
580 | event in question. |
---|
581 | \item[arch] Architecture-specific VCPU info. On x86 this contains the |
---|
582 | virtualized CR2 register (page fault linear address) for this VCPU. |
---|
583 | \item[time] Time values for this VCPU. |
---|
584 | \end{description} |
---|
585 | |
---|
586 | \subsection{vcpu\_time\_info} |
---|
587 | |
---|
588 | \scriptsize |
---|
589 | \begin{verbatim} |
---|
590 | typedef struct vcpu_time_info { |
---|
591 | /* |
---|
592 | * Updates to the following values are preceded and followed by an |
---|
593 | * increment of 'version'. The guest can therefore detect updates by |
---|
594 | * looking for changes to 'version'. If the least-significant bit of |
---|
595 | * the version number is set then an update is in progress and the guest |
---|
596 | * must wait to read a consistent set of values. |
---|
597 | * The correct way to interact with the version number is similar to |
---|
598 | * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry. |
---|
599 | */ |
---|
600 | uint32_t version; |
---|
601 | uint32_t pad0; |
---|
602 | uint64_t tsc_timestamp; /* TSC at last update of time vals. */ |
---|
603 | uint64_t system_time; /* Time, in nanosecs, since boot. */ |
---|
604 | /* |
---|
605 | * Current system time: |
---|
606 | * system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul |
---|
607 | * CPU frequency (Hz): |
---|
608 | * ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift |
---|
609 | */ |
---|
610 | uint32_t tsc_to_system_mul; |
---|
611 | int8_t tsc_shift; |
---|
612 | int8_t pad1[3]; |
---|
613 | } vcpu_time_info_t; /* 32 bytes */ |
---|
614 | \end{verbatim} |
---|
615 | \normalsize |
---|
616 | |
---|
617 | \begin{description} |
---|
618 | \item[version] Used to ensure the guest gets consistent time updates. |
---|
619 | \item[tsc\_timestamp] Cycle counter timestamp of last time value; |
---|
620 | could be used to expolate in between updates, for instance. |
---|
621 | \item[system\_time] Time since boot (nanoseconds). |
---|
622 | \item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier |
---|
623 | (used in extrapolating current time). |
---|
624 | \item[tsc\_shift] Cycle counter to nanoseconds shift (used in |
---|
625 | extrapolating current time). |
---|
626 | \end{description} |
---|
627 | |
---|
628 | \subsection{arch\_shared\_info\_t} |
---|
629 | |
---|
630 | On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from |
---|
631 | xen/public/arch-x86\_32.h): |
---|
632 | |
---|
633 | \scriptsize |
---|
634 | \begin{verbatim} |
---|
635 | typedef struct arch_shared_info { |
---|
636 | unsigned long max_pfn; /* max pfn that appears in table */ |
---|
637 | /* Frame containing list of mfns containing list of mfns containing p2m. */ |
---|
638 | unsigned long pfn_to_mfn_frame_list_list; |
---|
639 | } arch_shared_info_t; |
---|
640 | \end{verbatim} |
---|
641 | \normalsize |
---|
642 | |
---|
643 | \begin{description} |
---|
644 | \item[max\_pfn] The maximum PFN listed in the physical-to-machine |
---|
645 | mapping table (P2M table). |
---|
646 | \item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame |
---|
647 | that contains the machine addresses of the P2M table frames. |
---|
648 | \end{description} |
---|
649 | |
---|
650 | \section{Start info page} |
---|
651 | |
---|
652 | The start info structure is declared as the following (in {\bf |
---|
653 | xen/include/public/xen.h}): |
---|
654 | |
---|
655 | \scriptsize |
---|
656 | \begin{verbatim} |
---|
657 | #define MAX_GUEST_CMDLINE 1024 |
---|
658 | typedef struct start_info { |
---|
659 | /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */ |
---|
660 | char magic[32]; /* "Xen-<version>.<subversion>". */ |
---|
661 | unsigned long nr_pages; /* Total pages allocated to this domain. */ |
---|
662 | unsigned long shared_info; /* MACHINE address of shared info struct. */ |
---|
663 | uint32_t flags; /* SIF_xxx flags. */ |
---|
664 | unsigned long store_mfn; /* MACHINE page number of shared page. */ |
---|
665 | uint32_t store_evtchn; /* Event channel for store communication. */ |
---|
666 | unsigned long console_mfn; /* MACHINE address of console page. */ |
---|
667 | uint32_t console_evtchn; /* Event channel for console messages. */ |
---|
668 | /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */ |
---|
669 | unsigned long pt_base; /* VIRTUAL address of page directory. */ |
---|
670 | unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */ |
---|
671 | unsigned long mfn_list; /* VIRTUAL address of page-frame list. */ |
---|
672 | unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */ |
---|
673 | unsigned long mod_len; /* Size (bytes) of pre-loaded module. */ |
---|
674 | int8_t cmd_line[MAX_GUEST_CMDLINE]; |
---|
675 | } start_info_t; |
---|
676 | \end{verbatim} |
---|
677 | \normalsize |
---|
678 | |
---|
679 | The fields are in two groups: the first group are always filled in |
---|
680 | when a domain is booted or resumed, the second set are only used at |
---|
681 | boot time. |
---|
682 | |
---|
683 | The always-available group is as follows: |
---|
684 | |
---|
685 | \begin{description} |
---|
686 | \item[magic] A text string identifying the Xen version to the guest. |
---|
687 | \item[nr\_pages] The number of real machine pages available to the |
---|
688 | guest. |
---|
689 | \item[shared\_info] Machine address of the shared info structure, |
---|
690 | allowing the guest to map it during initialisation. |
---|
691 | \item[flags] Flags for describing optional extra settings to the |
---|
692 | guest. |
---|
693 | \item[store\_mfn] Machine address of the Xenstore communications page. |
---|
694 | \item[store\_evtchn] Event channel to communicate with the store. |
---|
695 | \item[console\_mfn] Machine address of the console data page. |
---|
696 | \item[console\_evtchn] Event channel to notify the console backend. |
---|
697 | \end{description} |
---|
698 | |
---|
699 | The boot-only group may only be safely referred to during system boot: |
---|
700 | |
---|
701 | \begin{description} |
---|
702 | \item[pt\_base] Virtual address of the page directory created for us |
---|
703 | by the domain builder. |
---|
704 | \item[nr\_pt\_frames] Number of frames used by the builders' bootstrap |
---|
705 | pagetables. |
---|
706 | \item[mfn\_list] Virtual address of the list of machine frames this |
---|
707 | domain owns. |
---|
708 | \item[mod\_start] Virtual address of any pre-loaded modules |
---|
709 | (e.g. ramdisk) |
---|
710 | \item[mod\_len] Size of pre-loaded module (if any). |
---|
711 | \item[cmd\_line] Kernel command line passed by the domain builder. |
---|
712 | \end{description} |
---|
713 | |
---|
714 | |
---|
715 | % by Mark Williamson <mark.williamson@cl.cam.ac.uk> |
---|
716 | |
---|
717 | \chapter{Event Channels} |
---|
718 | \label{c:eventchannels} |
---|
719 | |
---|
720 | Event channels are the basic primitive provided by Xen for event |
---|
721 | notifications. An event is the Xen equivalent of a hardware |
---|
722 | interrupt. They essentially store one bit of information, the event |
---|
723 | of interest is signalled by transitioning this bit from 0 to 1. |
---|
724 | |
---|
725 | Notifications are received by a guest via an upcall from Xen, |
---|
726 | indicating when an event arrives (setting the bit). Further |
---|
727 | notifications are masked until the bit is cleared again (therefore, |
---|
728 | guests must check the value of the bit after re-enabling event |
---|
729 | delivery to ensure no missed notifications). |
---|
730 | |
---|
731 | Event notifications can be masked by setting a flag; this is |
---|
732 | equivalent to disabling interrupts and can be used to ensure atomicity |
---|
733 | of certain operations in the guest kernel. |
---|
734 | |
---|
735 | \section{Hypercall interface} |
---|
736 | |
---|
737 | \hypercall{event\_channel\_op(evtchn\_op\_t *op)} |
---|
738 | |
---|
739 | The event channel operation hypercall is used for all operations on |
---|
740 | event channels / ports. Operations are distinguished by the value of |
---|
741 | the {\bf cmd} field of the {\bf op} structure. The possible commands |
---|
742 | are described below: |
---|
743 | |
---|
744 | \begin{description} |
---|
745 | |
---|
746 | \item[EVTCHNOP\_alloc\_unbound] |
---|
747 | Allocate a new event channel port, ready to be connected to by a |
---|
748 | remote domain. |
---|
749 | \begin{itemize} |
---|
750 | \item Specified domain must exist. |
---|
751 | \item A free port must exist in that domain. |
---|
752 | \end{itemize} |
---|
753 | Unprivileged domains may only allocate their own ports, privileged |
---|
754 | domains may also allocate ports in other domains. |
---|
755 | \item[EVTCHNOP\_bind\_interdomain] |
---|
756 | Bind an event channel for interdomain communications. |
---|
757 | \begin{itemize} |
---|
758 | \item Caller domain must have a free port to bind. |
---|
759 | \item Remote domain must exist. |
---|
760 | \item Remote port must be allocated and currently unbound. |
---|
761 | \item Remote port must be expecting the caller domain as the ``remote''. |
---|
762 | \end{itemize} |
---|
763 | \item[EVTCHNOP\_bind\_virq] |
---|
764 | Allocate a port and bind a VIRQ to it. |
---|
765 | \begin{itemize} |
---|
766 | \item Caller domain must have a free port to bind. |
---|
767 | \item VIRQ must be valid. |
---|
768 | \item VCPU must exist. |
---|
769 | \item VIRQ must not currently be bound to an event channel. |
---|
770 | \end{itemize} |
---|
771 | \item[EVTCHNOP\_bind\_ipi] |
---|
772 | Allocate and bind a port for notifying other virtual CPUs. |
---|
773 | \begin{itemize} |
---|
774 | \item Caller domain must have a free port to bind. |
---|
775 | \item VCPU must exist. |
---|
776 | \end{itemize} |
---|
777 | \item[EVTCHNOP\_bind\_pirq] |
---|
778 | Allocate and bind a port to a real IRQ. |
---|
779 | \begin{itemize} |
---|
780 | \item Caller domain must have a free port to bind. |
---|
781 | \item PIRQ must be within the valid range. |
---|
782 | \item Another binding for this PIRQ must not exist for this domain. |
---|
783 | \item Caller must have an available port. |
---|
784 | \end{itemize} |
---|
785 | \item[EVTCHNOP\_close] |
---|
786 | Close an event channel (no more events will be received). |
---|
787 | \begin{itemize} |
---|
788 | \item Port must be valid (currently allocated). |
---|
789 | \end{itemize} |
---|
790 | \item[EVTCHNOP\_send] Send a notification on an event channel attached |
---|
791 | to a port. |
---|
792 | \begin{itemize} |
---|
793 | \item Port must be valid. |
---|
794 | \item Only valid for Interdomain, IPI or Allocated Unbound ports. |
---|
795 | \end{itemize} |
---|
796 | \item[EVTCHNOP\_status] Query the status of a port; what kind of port, |
---|
797 | whether it is bound, what remote domain is expected, what PIRQ or |
---|
798 | VIRQ it is bound to, what VCPU will be notified, etc. |
---|
799 | Unprivileged domains may only query the state of their own ports. |
---|
800 | Privileged domains may query any port. |
---|
801 | \item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU - |
---|
802 | receive notification upcalls only on that VCPU. |
---|
803 | \begin{itemize} |
---|
804 | \item VCPU must exist. |
---|
805 | \item Port must be valid. |
---|
806 | \item Event channel must be either: allocated but unbound, bound to |
---|
807 | an interdomain event channel, bound to a PIRQ. |
---|
808 | \end{itemize} |
---|
809 | |
---|
810 | \end{description} |
---|
811 | |
---|
812 | %% |
---|
813 | %% grant_tables.tex |
---|
814 | %% |
---|
815 | %% Made by Mark Williamson |
---|
816 | %% Login <mark@maw48> |
---|
817 | %% |
---|
818 | |
---|
819 | \chapter{Grant tables} |
---|
820 | \label{c:granttables} |
---|
821 | |
---|
822 | Xen's grant tables provide a generic mechanism to memory sharing |
---|
823 | between domains. This shared memory interface underpins the split |
---|
824 | device drivers for block and network IO. |
---|
825 | |
---|
826 | Each domain has its own {\bf grant table}. This is a data structure |
---|
827 | that is shared with Xen; it allows the domain to tell Xen what kind of |
---|
828 | permissions other domains have on its pages. Entries in the grant |
---|
829 | table are identified by {\bf grant references}. A grant reference is |
---|
830 | an integer, which indexes into the grant table. It acts as a |
---|
831 | capability which the grantee can use to perform operations on the |
---|
832 | granter's memory. |
---|
833 | |
---|
834 | This capability-based system allows shared-memory communications |
---|
835 | between unprivileged domains. A grant reference also encapsulates the |
---|
836 | details of a shared page, removing the need for a domain to know the |
---|
837 | real machine address of a page it is sharing. This makes it possible |
---|
838 | to share memory correctly with domains running in fully virtualised |
---|
839 | memory. |
---|
840 | |
---|
841 | \section{Interface} |
---|
842 | |
---|
843 | \subsection{Grant table manipulation} |
---|
844 | |
---|
845 | Creating and destroying grant references is done by direct access to |
---|
846 | the grant table. This removes the need to involve Xen when creating |
---|
847 | grant references, modifying access permissions, etc. The grantee |
---|
848 | domain will invoke hypercalls to use the grant references. Four main |
---|
849 | operations can be accomplished by directly manipulating the table: |
---|
850 | |
---|
851 | \begin{description} |
---|
852 | \item[Grant foreign access] allocate a new entry in the grant table |
---|
853 | and fill out the access permissions accordingly. The access |
---|
854 | permissions will be looked up by Xen when the grantee attempts to |
---|
855 | use the reference to map the granted frame. |
---|
856 | \item[End foreign access] check that the grant reference is not |
---|
857 | currently in use, then remove the mapping permissions for the frame. |
---|
858 | This prevents further mappings from taking place but does not allow |
---|
859 | forced revocations of existing mappings. |
---|
860 | \item[Grant foreign transfer] allocate a new entry in the table |
---|
861 | specifying transfer permissions for the grantee. Xen will look up |
---|
862 | this entry when the grantee attempts to transfer a frame to the |
---|
863 | granter. |
---|
864 | \item[End foreign transfer] remove permissions to prevent a transfer |
---|
865 | occurring in future. If the transfer is already committed, |
---|
866 | modifying the grant table cannot prevent it from completing. |
---|
867 | \end{description} |
---|
868 | |
---|
869 | \subsection{Hypercalls} |
---|
870 | |
---|
871 | Use of grant references is accomplished via a hypercall. The grant |
---|
872 | table op hypercall takes three arguments: |
---|
873 | |
---|
874 | \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} |
---|
875 | |
---|
876 | {\bf cmd} indicates the grant table operation of interest. {\bf uop} |
---|
877 | is a pointer to a structure (or an array of structures) describing the |
---|
878 | operation to be performed. The {\bf count} field describes how many |
---|
879 | grant table operations are being batched together. |
---|
880 | |
---|
881 | The core logic is situated in {\bf xen/common/grant\_table.c}. The |
---|
882 | grant table operation hypercall can be used to perform the following |
---|
883 | actions: |
---|
884 | |
---|
885 | \begin{description} |
---|
886 | \item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another |
---|
887 | domain, map the referred page into the caller's address space. |
---|
888 | \item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame |
---|
889 | from the caller's address space. This is used to voluntarily |
---|
890 | relinquish a mapping to a granted page. |
---|
891 | \item[GNTTABOP\_setup\_table] Setup grant table for caller domain. |
---|
892 | \item[GNTTABOP\_dump\_table] Debugging operation. |
---|
893 | \item[GNTTABOP\_transfer] Given a transfer reference from another |
---|
894 | domain, transfer ownership of a page frame to that domain. |
---|
895 | \end{description} |
---|
896 | |
---|
897 | %% |
---|
898 | %% xenstore.tex |
---|
899 | %% |
---|
900 | %% Made by Mark Williamson |
---|
901 | %% Login <mark@maw48> |
---|
902 | %% |
---|
903 | |
---|
904 | \chapter{Xenstore} |
---|
905 | |
---|
906 | Xenstore is the mechanism by which control-plane activities occur. |
---|
907 | These activities include: |
---|
908 | |
---|
909 | \begin{itemize} |
---|
910 | \item Setting up shared memory regions and event channels for use with |
---|
911 | the split device drivers. |
---|
912 | \item Notifying the guest of control events (e.g. balloon driver |
---|
913 | requests) |
---|
914 | \item Reporting back status information from the guest |
---|
915 | (e.g. performance-related statistics, etc). |
---|
916 | \end{itemize} |
---|
917 | |
---|
918 | The store is arranged as a hierachical collection of key-value pairs. |
---|
919 | Each domain has a directory hierarchy containing data related to its |
---|
920 | configuration. Domains are permitted to register for notifications |
---|
921 | about changes in subtrees of the store, and to apply changes to the |
---|
922 | store transactionally. |
---|
923 | |
---|
924 | \section{Guidelines} |
---|
925 | |
---|
926 | A few principles govern the operation of the store: |
---|
927 | |
---|
928 | \begin{itemize} |
---|
929 | \item Domains should only modify the contents of their own |
---|
930 | directories. |
---|
931 | \item The setup protocol for a device channel should simply consist of |
---|
932 | entering the configuration data into the store. |
---|
933 | \item The store should allow device discovery without requiring the |
---|
934 | relevant device drivers to be loaded: a Xen ``bus'' should be |
---|
935 | visible to probing code in the guest. |
---|
936 | \item The store should be usable for inter-tool communications, |
---|
937 | allowing the tools themselves to be decomposed into a number of |
---|
938 | smaller utilities, rather than a single monolithic entity. This |
---|
939 | also facilitates the development of alternate user interfaces to the |
---|
940 | same functionality. |
---|
941 | \end{itemize} |
---|
942 | |
---|
943 | \section{Store layout} |
---|
944 | |
---|
945 | There are three main paths in XenStore: |
---|
946 | |
---|
947 | \begin{description} |
---|
948 | \item[/vm] stores configuration information about domain |
---|
949 | \item[/local/domain] stores information about the domain on the local node (domid, etc.) |
---|
950 | \item[/tool] stores information for the various tools |
---|
951 | \end{description} |
---|
952 | |
---|
953 | The {\bf /vm} path stores configuration information for a domain. |
---|
954 | This information doesn't change and is indexed by the domain's UUID. |
---|
955 | A {\bf /vm} entry contains the following information: |
---|
956 | |
---|
957 | \begin{description} |
---|
958 | \item[uuid] uuid of the domain (somewhat redundant) |
---|
959 | \item[on\_reboot] the action to take on a domain reboot request (destroy or restart) |
---|
960 | \item[on\_poweroff] the action to take on a domain halt request (destroy or restart) |
---|
961 | \item[on\_crash] the action to take on a domain crash (destroy or restart) |
---|
962 | \item[vcpus] the number of allocated vcpus for the domain |
---|
963 | \item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0 |
---|
964 | \item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus) |
---|
965 | \item[name] the name of the domain |
---|
966 | \end{description} |
---|
967 | |
---|
968 | |
---|
969 | {\bf /vm/$<$uuid$>$/image/} |
---|
970 | |
---|
971 | The image path is only available for Domain-Us and contains: |
---|
972 | \begin{description} |
---|
973 | \item[ostype] identifies the builder type (linux or vmx) |
---|
974 | \item[kernel] path to kernel on domain-0 |
---|
975 | \item[cmdline] command line to pass to domain-U kernel |
---|
976 | \item[ramdisk] path to ramdisk on domain-0 |
---|
977 | \end{description} |
---|
978 | |
---|
979 | {\bf /local} |
---|
980 | |
---|
981 | The {\tt /local} path currently only contains one directory, {\tt |
---|
982 | /local/domain} that is indexed by domain id. It contains the running |
---|
983 | domain information. The reason to have two storage areas is that |
---|
984 | during migration, the uuid doesn't change but the domain id does. The |
---|
985 | {\tt /local/domain} directory can be created and populated before |
---|
986 | finalizing the migration enabling localhost to localhost migration. |
---|
987 | |
---|
988 | {\bf /local/domain/$<$domid$>$} |
---|
989 | |
---|
990 | This path contains: |
---|
991 | |
---|
992 | \begin{description} |
---|
993 | \item[cpu\_time] xend start time (this is only around for domain-0) |
---|
994 | \item[handle] private handle for xend |
---|
995 | \item[name] see /vm |
---|
996 | \item[on\_reboot] see /vm |
---|
997 | \item[on\_poweroff] see /vm |
---|
998 | \item[on\_crash] see /vm |
---|
999 | \item[vm] the path to the VM directory for the domain |
---|
1000 | \item[domid] the domain id (somewhat redundant) |
---|
1001 | \item[running] indicates that the domain is currently running |
---|
1002 | \item[memory] the current memory in megabytes for the domain (empty for domain-0?) |
---|
1003 | \item[maxmem\_KiB] the maximum memory for the domain (in kilobytes) |
---|
1004 | \item[memory\_KiB] the memory allocated to the domain (in kilobytes) |
---|
1005 | \item[cpu] the current CPU the domain is pinned to (empty for domain-0?) |
---|
1006 | \item[cpu\_weight] the weight assigned to the domain |
---|
1007 | \item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU |
---|
1008 | \item[online\_vcpus] how many vcpus are currently online |
---|
1009 | \item[vcpus] the total number of vcpus allocated to the domain |
---|
1010 | \item[console/] a directory for console information |
---|
1011 | \begin{description} |
---|
1012 | \item[ring-ref] the grant table reference of the console ring queue |
---|
1013 | \item[port] the event channel being used for the console ring queue (local port) |
---|
1014 | \item[tty] the current tty the console data is being exposed of |
---|
1015 | \item[limit] the limit (in bytes) of console data to buffer |
---|
1016 | \end{description} |
---|
1017 | \item[backend/] a directory containing all backends the domain hosts |
---|
1018 | \begin{description} |
---|
1019 | \item[vbd/] a directory containing vbd backends |
---|
1020 | \begin{description} |
---|
1021 | \item[$<$domid$>$/] a directory containing vbd's for domid |
---|
1022 | \begin{description} |
---|
1023 | \item[$<$virtual-device$>$/] a directory for a particular |
---|
1024 | virtual-device on domid |
---|
1025 | \begin{description} |
---|
1026 | \item[frontend-id] domain id of frontend |
---|
1027 | \item[frontend] the path to the frontend domain |
---|
1028 | \item[physical-device] backend device number |
---|
1029 | \item[sector-size] backend sector size |
---|
1030 | \item[info] 0 read/write, 1 read-only (is this right?) |
---|
1031 | \item[domain] name of frontend domain |
---|
1032 | \item[params] parameters for device |
---|
1033 | \item[type] the type of the device |
---|
1034 | \item[dev] the virtual device (as given by the user) |
---|
1035 | \item[node] output from block creation script |
---|
1036 | \end{description} |
---|
1037 | \end{description} |
---|
1038 | \end{description} |
---|
1039 | |
---|
1040 | \item[vif/] a directory containing vif backends |
---|
1041 | \begin{description} |
---|
1042 | \item[$<$domid$>$/] a directory containing vif's for domid |
---|
1043 | \begin{description} |
---|
1044 | \item[$<$vif number$>$/] a directory for each vif |
---|
1045 | \item[frontend-id] the domain id of the frontend |
---|
1046 | \item[frontend] the path to the frontend |
---|
1047 | \item[mac] the mac address of the vif |
---|
1048 | \item[bridge] the bridge the vif is connected to |
---|
1049 | \item[handle] the handle of the vif |
---|
1050 | \item[script] the script used to create/stop the vif |
---|
1051 | \item[domain] the name of the frontend |
---|
1052 | \end{description} |
---|
1053 | \end{description} |
---|
1054 | |
---|
1055 | \item[vtpm/] a directory containin vtpm backends |
---|
1056 | \begin{description} |
---|
1057 | \item[$<$domid$>$/] a directory containing vtpm's for domid |
---|
1058 | \begin{description} |
---|
1059 | \item[$<$vtpm number$>$/] a directory for each vtpm |
---|
1060 | \item[frontend-id] the domain id of the frontend |
---|
1061 | \item[frontend] the path to the frontend |
---|
1062 | \item[instance] the instance of the virtual TPM that is used |
---|
1063 | \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file; |
---|
1064 | may be different from {\bf instance} |
---|
1065 | \item[domain] the name of the domain of the frontend |
---|
1066 | \end{description} |
---|
1067 | \end{description} |
---|
1068 | |
---|
1069 | \end{description} |
---|
1070 | |
---|
1071 | \item[device/] a directory containing the frontend devices for the |
---|
1072 | domain |
---|
1073 | \begin{description} |
---|
1074 | \item[vbd/] a directory containing vbd frontend devices for the |
---|
1075 | domain |
---|
1076 | \begin{description} |
---|
1077 | \item[$<$virtual-device$>$/] a directory containing the vbd frontend for |
---|
1078 | virtual-device |
---|
1079 | \begin{description} |
---|
1080 | \item[virtual-device] the device number of the frontend device |
---|
1081 | \item[backend-id] the domain id of the backend |
---|
1082 | \item[backend] the path of the backend in the store (/local/domain |
---|
1083 | path) |
---|
1084 | \item[ring-ref] the grant table reference for the block request |
---|
1085 | ring queue |
---|
1086 | \item[event-channel] the event channel used for the block request |
---|
1087 | ring queue |
---|
1088 | \end{description} |
---|
1089 | |
---|
1090 | \item[vif/] a directory containing vif frontend devices for the |
---|
1091 | domain |
---|
1092 | \begin{description} |
---|
1093 | \item[$<$id$>$/] a directory for vif id frontend device for the domain |
---|
1094 | \begin{description} |
---|
1095 | \item[backend-id] the backend domain id |
---|
1096 | \item[mac] the mac address of the vif |
---|
1097 | \item[handle] the internal vif handle |
---|
1098 | \item[backend] a path to the backend's store entry |
---|
1099 | \item[tx-ring-ref] the grant table reference for the transmission ring queue |
---|
1100 | \item[rx-ring-ref] the grant table reference for the receiving ring queue |
---|
1101 | \item[event-channel] the event channel used for the two ring queues |
---|
1102 | \end{description} |
---|
1103 | \end{description} |
---|
1104 | |
---|
1105 | \item[vtpm/] a directory containing the vtpm frontend device for the |
---|
1106 | domain |
---|
1107 | \begin{description} |
---|
1108 | \item[$<$id$>$] a directory for vtpm id frontend device for the domain |
---|
1109 | \begin{description} |
---|
1110 | \item[backend-id] the backend domain id |
---|
1111 | \item[backend] a path to the backend's store entry |
---|
1112 | \item[ring-ref] the grant table reference for the tx/rx ring |
---|
1113 | \item[event-channel] the event channel used for the ring |
---|
1114 | \end{description} |
---|
1115 | \end{description} |
---|
1116 | |
---|
1117 | \item[device-misc/] miscellanous information for devices |
---|
1118 | \begin{description} |
---|
1119 | \item[vif/] miscellanous information for vif devices |
---|
1120 | \begin{description} |
---|
1121 | \item[nextDeviceID] the next device id to use |
---|
1122 | \end{description} |
---|
1123 | \end{description} |
---|
1124 | \end{description} |
---|
1125 | \end{description} |
---|
1126 | |
---|
1127 | \item[security/] access control information for the domain |
---|
1128 | \begin{description} |
---|
1129 | \item[ssidref] security reference identifier used inside the hypervisor |
---|
1130 | \item[access\_control/] security label used by management tools |
---|
1131 | \begin{description} |
---|
1132 | \item[label] security label name |
---|
1133 | \item[policy] security policy name |
---|
1134 | \end{description} |
---|
1135 | \end{description} |
---|
1136 | |
---|
1137 | \item[store/] per-domain information for the store |
---|
1138 | \begin{description} |
---|
1139 | \item[port] the event channel used for the store ring queue |
---|
1140 | \item[ring-ref] - the grant table reference used for the store's |
---|
1141 | communication channel |
---|
1142 | \end{description} |
---|
1143 | |
---|
1144 | \item[image] - private xend information |
---|
1145 | \end{description} |
---|
1146 | |
---|
1147 | |
---|
1148 | \chapter{Devices} |
---|
1149 | \label{c:devices} |
---|
1150 | |
---|
1151 | Virtual devices under Xen are provided by a {\bf split device driver} |
---|
1152 | architecture. The illusion of the virtual device is provided by two |
---|
1153 | co-operating drivers: the {\bf frontend}, which runs an the |
---|
1154 | unprivileged domain and the {\bf backend}, which runs in a domain with |
---|
1155 | access to the real device hardware (often called a {\bf driver |
---|
1156 | domain}; in practice domain 0 usually fulfills this function). |
---|
1157 | |
---|
1158 | The frontend driver appears to the unprivileged guest as if it were a |
---|
1159 | real device, for instance a block or network device. It receives IO |
---|
1160 | requests from its kernel as usual, however since it does not have |
---|
1161 | access to the physical hardware of the system it must then issue |
---|
1162 | requests to the backend. The backend driver is responsible for |
---|
1163 | receiving these IO requests, verifying that they are safe and then |
---|
1164 | issuing them to the real device hardware. The backend driver appears |
---|
1165 | to its kernel as a normal user of in-kernel IO functionality. When |
---|
1166 | the IO completes the backend notifies the frontend that the data is |
---|
1167 | ready for use; the frontend is then able to report IO completion to |
---|
1168 | its own kernel. |
---|
1169 | |
---|
1170 | Frontend drivers are designed to be simple; most of the complexity is |
---|
1171 | in the backend, which has responsibility for translating device |
---|
1172 | addresses, verifying that requests are well-formed and do not violate |
---|
1173 | isolation guarantees, etc. |
---|
1174 | |
---|
1175 | Split drivers exchange requests and responses in shared memory, with |
---|
1176 | an event channel for asynchronous notifications of activity. When the |
---|
1177 | frontend driver comes up, it uses Xenstore to set up a shared memory |
---|
1178 | frame and an interdomain event channel for communications with the |
---|
1179 | backend. Once this connection is established, the two can communicate |
---|
1180 | directly by placing requests / responses into shared memory and then |
---|
1181 | sending notifications on the event channel. This separation of |
---|
1182 | notification from data transfer allows message batching, and results |
---|
1183 | in very efficient device access. |
---|
1184 | |
---|
1185 | This chapter focuses on some individual split device interfaces |
---|
1186 | available to Xen guests. |
---|
1187 | |
---|
1188 | |
---|
1189 | \section{Network I/O} |
---|
1190 | |
---|
1191 | Virtual network device services are provided by shared memory |
---|
1192 | communication with a backend domain. From the point of view of other |
---|
1193 | domains, the backend may be viewed as a virtual ethernet switch |
---|
1194 | element with each domain having one or more virtual network interfaces |
---|
1195 | connected to it. |
---|
1196 | |
---|
1197 | From the point of view of the backend domain itself, the network |
---|
1198 | backend driver consists of a number of ethernet devices. Each of |
---|
1199 | these has a logical direct connection to a virtual network device in |
---|
1200 | another domain. This allows the backend domain to route, bridge, |
---|
1201 | firewall, etc the traffic to / from the other domains using normal |
---|
1202 | operating system mechanisms. |
---|
1203 | |
---|
1204 | \subsection{Backend Packet Handling} |
---|
1205 | |
---|
1206 | The backend driver is responsible for a variety of actions relating to |
---|
1207 | the transmission and reception of packets from the physical device. |
---|
1208 | With regard to transmission, the backend performs these key actions: |
---|
1209 | |
---|
1210 | \begin{itemize} |
---|
1211 | \item {\bf Validation:} To ensure that domains do not attempt to |
---|
1212 | generate invalid (e.g. spoofed) traffic, the backend driver may |
---|
1213 | validate headers ensuring that source MAC and IP addresses match the |
---|
1214 | interface that they have been sent from. |
---|
1215 | |
---|
1216 | Validation functions can be configured using standard firewall rules |
---|
1217 | ({\small{\tt iptables}} in the case of Linux). |
---|
1218 | |
---|
1219 | \item {\bf Scheduling:} Since a number of domains can share a single |
---|
1220 | physical network interface, the backend must mediate access when |
---|
1221 | several domains each have packets queued for transmission. This |
---|
1222 | general scheduling function subsumes basic shaping or rate-limiting |
---|
1223 | schemes. |
---|
1224 | |
---|
1225 | \item {\bf Logging and Accounting:} The backend domain can be |
---|
1226 | configured with classifier rules that control how packets are |
---|
1227 | accounted or logged. For example, log messages might be generated |
---|
1228 | whenever a domain attempts to send a TCP packet containing a SYN. |
---|
1229 | \end{itemize} |
---|
1230 | |
---|
1231 | On receipt of incoming packets, the backend acts as a simple |
---|
1232 | demultiplexer: Packets are passed to the appropriate virtual interface |
---|
1233 | after any necessary logging and accounting have been carried out. |
---|
1234 | |
---|
1235 | \subsection{Data Transfer} |
---|
1236 | |
---|
1237 | Each virtual interface uses two ``descriptor rings'', one for |
---|
1238 | transmit, the other for receive. Each descriptor identifies a block |
---|
1239 | of contiguous machine memory allocated to the domain. |
---|
1240 | |
---|
1241 | The transmit ring carries packets to transmit from the guest to the |
---|
1242 | backend domain. The return path of the transmit ring carries messages |
---|
1243 | indicating that the contents have been physically transmitted and the |
---|
1244 | backend no longer requires the associated pages of memory. |
---|
1245 | |
---|
1246 | To receive packets, the guest places descriptors of unused pages on |
---|
1247 | the receive ring. The backend will return received packets by |
---|
1248 | exchanging these pages in the domain's memory with new pages |
---|
1249 | containing the received data, and passing back descriptors regarding |
---|
1250 | the new packets on the ring. This zero-copy approach allows the |
---|
1251 | backend to maintain a pool of free pages to receive packets into, and |
---|
1252 | then deliver them to appropriate domains after examining their |
---|
1253 | headers. |
---|
1254 | |
---|
1255 | % Real physical addresses are used throughout, with the domain |
---|
1256 | % performing translation from pseudo-physical addresses if that is |
---|
1257 | % necessary. |
---|
1258 | |
---|
1259 | If a domain does not keep its receive ring stocked with empty buffers |
---|
1260 | then packets destined to it may be dropped. This provides some |
---|
1261 | defence against receive livelock problems because an overloaded domain |
---|
1262 | will cease to receive further data. Similarly, on the transmit path, |
---|
1263 | it provides the application with feedback on the rate at which packets |
---|
1264 | are able to leave the system. |
---|
1265 | |
---|
1266 | Flow control on rings is achieved by including a pair of producer |
---|
1267 | indexes on the shared ring page. Each side will maintain a private |
---|
1268 | consumer index indicating the next outstanding message. In this |
---|
1269 | manner, the domains cooperate to divide the ring into two message |
---|
1270 | lists, one in each direction. Notification is decoupled from the |
---|
1271 | immediate placement of new messages on the ring; the event channel |
---|
1272 | will be used to generate notification when {\em either} a certain |
---|
1273 | number of outstanding messages are queued, {\em or} a specified number |
---|
1274 | of nanoseconds have elapsed since the oldest message was placed on the |
---|
1275 | ring. |
---|
1276 | |
---|
1277 | %% Not sure if my version is any better -- here is what was here |
---|
1278 | %% before: Synchronization between the backend domain and the guest is |
---|
1279 | %% achieved using counters held in shared memory that is accessible to |
---|
1280 | %% both. Each ring has associated producer and consumer indices |
---|
1281 | %% indicating the area in the ring that holds descriptors that contain |
---|
1282 | %% data. After receiving {\it n} packets or {\t nanoseconds} after |
---|
1283 | %% receiving the first packet, the hypervisor sends an event to the |
---|
1284 | %% domain. |
---|
1285 | |
---|
1286 | |
---|
1287 | \subsection{Network ring interface} |
---|
1288 | |
---|
1289 | The network device uses two shared memory rings for communication: one |
---|
1290 | for transmit, one for receieve. |
---|
1291 | |
---|
1292 | Transmit requests are described by the following structure: |
---|
1293 | |
---|
1294 | \scriptsize |
---|
1295 | \begin{verbatim} |
---|
1296 | typedef struct netif_tx_request { |
---|
1297 | grant_ref_t gref; /* Reference to buffer page */ |
---|
1298 | uint16_t offset; /* Offset within buffer page */ |
---|
1299 | uint16_t flags; /* NETTXF_* */ |
---|
1300 | uint16_t id; /* Echoed in response message. */ |
---|
1301 | uint16_t size; /* Packet size in bytes. */ |
---|
1302 | } netif_tx_request_t; |
---|
1303 | \end{verbatim} |
---|
1304 | \normalsize |
---|
1305 | |
---|
1306 | \begin{description} |
---|
1307 | \item[gref] Grant reference for the network buffer |
---|
1308 | \item[offset] Offset to data |
---|
1309 | \item[flags] Transmit flags (currently only NETTXF\_csum\_blank is |
---|
1310 | supported, to indicate that the protocol checksum field is |
---|
1311 | incomplete). |
---|
1312 | \item[id] Echoed to guest by the backend in the ring-level response so |
---|
1313 | that the guest can match it to this request |
---|
1314 | \item[size] Buffer size |
---|
1315 | \end{description} |
---|
1316 | |
---|
1317 | Each transmit request is followed by a transmit response at some later |
---|
1318 | date. This is part of the shared-memory communication protocol and |
---|
1319 | allows the guest to (potentially) retire internal structures related |
---|
1320 | to the request. It does not imply a network-level response. This |
---|
1321 | structure is as follows: |
---|
1322 | |
---|
1323 | \scriptsize |
---|
1324 | \begin{verbatim} |
---|
1325 | typedef struct netif_tx_response { |
---|
1326 | uint16_t id; |
---|
1327 | int16_t status; |
---|
1328 | } netif_tx_response_t; |
---|
1329 | \end{verbatim} |
---|
1330 | \normalsize |
---|
1331 | |
---|
1332 | \begin{description} |
---|
1333 | \item[id] Echo of the ID field in the corresponding transmit request. |
---|
1334 | \item[status] Success / failure status of the transmit request. |
---|
1335 | \end{description} |
---|
1336 | |
---|
1337 | Receive requests must be queued by the frontend, accompanied by a |
---|
1338 | donation of page-frames to the backend. The backend transfers page |
---|
1339 | frames full of data back to the guest |
---|
1340 | |
---|
1341 | \scriptsize |
---|
1342 | \begin{verbatim} |
---|
1343 | typedef struct { |
---|
1344 | uint16_t id; /* Echoed in response message. */ |
---|
1345 | grant_ref_t gref; /* Reference to incoming granted frame */ |
---|
1346 | } netif_rx_request_t; |
---|
1347 | \end{verbatim} |
---|
1348 | \normalsize |
---|
1349 | |
---|
1350 | \begin{description} |
---|
1351 | \item[id] Echoed by the frontend to identify this request when |
---|
1352 | responding. |
---|
1353 | \item[gref] Transfer reference - the backend will use this reference |
---|
1354 | to transfer a frame of network data to us. |
---|
1355 | \end{description} |
---|
1356 | |
---|
1357 | Receive response descriptors are queued for each received frame. Note |
---|
1358 | that these may only be queued in reply to an existing receive request, |
---|
1359 | providing an in-built form of traffic throttling. |
---|
1360 | |
---|
1361 | \scriptsize |
---|
1362 | \begin{verbatim} |
---|
1363 | typedef struct { |
---|
1364 | uint16_t id; |
---|
1365 | uint16_t offset; /* Offset in page of start of received packet */ |
---|
1366 | uint16_t flags; /* NETRXF_* */ |
---|
1367 | int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */ |
---|
1368 | } netif_rx_response_t; |
---|
1369 | \end{verbatim} |
---|
1370 | \normalsize |
---|
1371 | |
---|
1372 | \begin{description} |
---|
1373 | \item[id] ID echoed from the original request, used by the guest to |
---|
1374 | match this response to the original request. |
---|
1375 | \item[offset] Offset to data within the transferred frame. |
---|
1376 | \item[flags] Transmit flags (currently only NETRXF\_csum\_valid is |
---|
1377 | supported, to indicate that the protocol checksum field has already |
---|
1378 | been validated). |
---|
1379 | \item[status] Success / error status for this operation. |
---|
1380 | \end{description} |
---|
1381 | |
---|
1382 | Note that the receive protocol includes a mechanism for guests to |
---|
1383 | receive incoming memory frames but there is no explicit transfer of |
---|
1384 | frames in the other direction. Guests are expected to return memory |
---|
1385 | to the hypervisor in order to use the network interface. They {\em |
---|
1386 | must} do this or they will exceed their maximum memory reservation and |
---|
1387 | will not be able to receive incoming frame transfers. When necessary, |
---|
1388 | the backend is able to replenish its pool of free network buffers by |
---|
1389 | claiming some of this free memory from the hypervisor. |
---|
1390 | |
---|
1391 | \section{Block I/O} |
---|
1392 | |
---|
1393 | All guest OS disk access goes through the virtual block device VBD |
---|
1394 | interface. This interface allows domains access to portions of block |
---|
1395 | storage devices visible to the the block backend device. The VBD |
---|
1396 | interface is a split driver, similar to the network interface |
---|
1397 | described above. A single shared memory ring is used between the |
---|
1398 | frontend and backend drivers for each virtual device, across which |
---|
1399 | IO requests and responses are sent. |
---|
1400 | |
---|
1401 | Any block device accessible to the backend domain, including |
---|
1402 | network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, |
---|
1403 | can be exported as a VBD. Each VBD is mapped to a device node in the |
---|
1404 | guest, specified in the guest's startup configuration. |
---|
1405 | |
---|
1406 | \subsection{Data Transfer} |
---|
1407 | |
---|
1408 | The per-(virtual)-device ring between the guest and the block backend |
---|
1409 | supports two messages: |
---|
1410 | |
---|
1411 | \begin{description} |
---|
1412 | \item [{\small {\tt READ}}:] Read data from the specified block |
---|
1413 | device. The front end identifies the device and location to read |
---|
1414 | from and attaches pages for the data to be copied to (typically via |
---|
1415 | DMA from the device). The backend acknowledges completed read |
---|
1416 | requests as they finish. |
---|
1417 | |
---|
1418 | \item [{\small {\tt WRITE}}:] Write data to the specified block |
---|
1419 | device. This functions essentially as {\small {\tt READ}}, except |
---|
1420 | that the data moves to the device instead of from it. |
---|
1421 | \end{description} |
---|
1422 | |
---|
1423 | %% Rather than copying data, the backend simply maps the domain's |
---|
1424 | %% buffers in order to enable direct DMA to them. The act of mapping |
---|
1425 | %% the buffers also increases the reference counts of the underlying |
---|
1426 | %% pages, so that the unprivileged domain cannot try to return them to |
---|
1427 | %% the hypervisor, install them as page tables, or any other unsafe |
---|
1428 | %% behaviour. |
---|
1429 | %% |
---|
1430 | %% % block API here |
---|
1431 | |
---|
1432 | \subsection{Block ring interface} |
---|
1433 | |
---|
1434 | The block interface is defined by the structures passed over the |
---|
1435 | shared memory interface. These structures are either requests (from |
---|
1436 | the frontend to the backend) or responses (from the backend to the |
---|
1437 | frontend). |
---|
1438 | |
---|
1439 | The request structure is defined as follows: |
---|
1440 | |
---|
1441 | \scriptsize |
---|
1442 | \begin{verbatim} |
---|
1443 | typedef struct blkif_request { |
---|
1444 | uint8_t operation; /* BLKIF_OP_??? */ |
---|
1445 | uint8_t nr_segments; /* number of segments */ |
---|
1446 | blkif_vdev_t handle; /* only for read/write requests */ |
---|
1447 | uint64_t id; /* private guest value, echoed in resp */ |
---|
1448 | blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ |
---|
1449 | struct blkif_request_segment { |
---|
1450 | grant_ref_t gref; /* reference to I/O buffer frame */ |
---|
1451 | /* @first_sect: first sector in frame to transfer (inclusive). */ |
---|
1452 | /* @last_sect: last sector in frame to transfer (inclusive). */ |
---|
1453 | uint8_t first_sect, last_sect; |
---|
1454 | } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; |
---|
1455 | } blkif_request_t; |
---|
1456 | \end{verbatim} |
---|
1457 | \normalsize |
---|
1458 | |
---|
1459 | The fields are as follows: |
---|
1460 | |
---|
1461 | \begin{description} |
---|
1462 | \item[operation] operation ID: one of the operations described above |
---|
1463 | \item[nr\_segments] number of segments for scatter / gather IO |
---|
1464 | described by this request |
---|
1465 | \item[handle] identifier for a particular virtual device on this |
---|
1466 | interface |
---|
1467 | \item[id] this value is echoed in the response message for this IO; |
---|
1468 | the guest may use it to identify the original request |
---|
1469 | \item[sector\_number] start sector on the virtal device for this |
---|
1470 | request |
---|
1471 | \item[frame\_and\_sects] This array contains structures encoding |
---|
1472 | scatter-gather IO to be performed: |
---|
1473 | \begin{description} |
---|
1474 | \item[gref] The grant reference for the foreign I/O buffer page. |
---|
1475 | \item[first\_sect] First sector to access within the buffer page (0 to 7). |
---|
1476 | \item[last\_sect] Last sector to access within the buffer page (0 to 7). |
---|
1477 | \end{description} |
---|
1478 | Data will be transferred into frames at an offset determined by the |
---|
1479 | value of {\tt first\_sect}. |
---|
1480 | \end{description} |
---|
1481 | |
---|
1482 | \section{Virtual TPM} |
---|
1483 | |
---|
1484 | Virtual TPM (VTPM) support provides TPM functionality to each virtual |
---|
1485 | machine that requests this functionality in its configuration file. |
---|
1486 | The interface enables domains to access therr own private TPM like it |
---|
1487 | was a hardware TPM built into the machine. |
---|
1488 | |
---|
1489 | The virtual TPM interface is implemented as a split driver, |
---|
1490 | similar to the network and block interfaces described above. |
---|
1491 | The user domain hosting the frontend exports a character device /dev/tpm0 |
---|
1492 | to user-level applications for communicating with the virtual TPM. |
---|
1493 | This is the same device interface that is also offered if a hardware TPM |
---|
1494 | is available in the system. The backend provides a single interface |
---|
1495 | /dev/vtpm where the virtual TPM is waiting for commands from all domains |
---|
1496 | that have located their backend in a given domain. |
---|
1497 | |
---|
1498 | \subsection{Data Transfer} |
---|
1499 | |
---|
1500 | A single shared memory ring is used between the frontend and backend |
---|
1501 | drivers. TPM requests and responses are sent in pages where a pointer |
---|
1502 | to those pages and other information is placed into the ring such that |
---|
1503 | the backend can map the pages into its memory space using the grant |
---|
1504 | table mechanism. |
---|
1505 | |
---|
1506 | The backend driver has been implemented to only accept well-formed |
---|
1507 | TPM requests. To meet this requirement, the length inidicator in the |
---|
1508 | TPM request must correctly indicate the length of the request. |
---|
1509 | Otherwise an error message is automatically sent back by the device driver. |
---|
1510 | |
---|
1511 | The virtual TPM implementation listenes for TPM request on /dev/vtpm. Since |
---|
1512 | it must be able to apply the TPM request packet to the virtual TPM instance |
---|
1513 | associated with the virtual machine, a 4-byte virtual TPM instance |
---|
1514 | identifier is prepended to each packet by the backend driver (in network |
---|
1515 | byte order) for internal routing of the request. |
---|
1516 | |
---|
1517 | \subsection{Virtual TPM ring interface} |
---|
1518 | |
---|
1519 | The TPM protocol is a strict request/response protocol and therefore |
---|
1520 | only one ring is used to send requests from the frontend to the backend |
---|
1521 | and responses on the reverse path. |
---|
1522 | |
---|
1523 | The request/response structure is defined as follows: |
---|
1524 | |
---|
1525 | \scriptsize |
---|
1526 | \begin{verbatim} |
---|
1527 | typedef struct { |
---|
1528 | unsigned long addr; /* Machine address of packet. */ |
---|
1529 | grant_ref_t ref; /* grant table access reference. */ |
---|
1530 | uint16_t unused; /* unused */ |
---|
1531 | uint16_t size; /* Packet size in bytes. */ |
---|
1532 | } tpmif_tx_request_t; |
---|
1533 | \end{verbatim} |
---|
1534 | \normalsize |
---|
1535 | |
---|
1536 | The fields are as follows: |
---|
1537 | |
---|
1538 | \begin{description} |
---|
1539 | \item[addr] The machine address of the page asscoiated with the TPM |
---|
1540 | request/response; a request/response may span multiple |
---|
1541 | pages |
---|
1542 | \item[ref] The grant table reference associated with the address. |
---|
1543 | \item[size] The size of the remaining packet; up to |
---|
1544 | PAGE{\textunderscore}SIZE bytes can be found in the |
---|
1545 | page referenced by 'addr' |
---|
1546 | \end{description} |
---|
1547 | |
---|
1548 | The frontend initially allocates several pages whose addresses |
---|
1549 | are stored in the ring. Only these pages are used for exchange of |
---|
1550 | requests and responses. |
---|
1551 | |
---|
1552 | |
---|
1553 | \chapter{Further Information} |
---|
1554 | |
---|
1555 | If you have questions that are not answered by this manual, the |
---|
1556 | sources of information listed below may be of interest to you. Note |
---|
1557 | that bug reports, suggestions and contributions related to the |
---|
1558 | software (or the documentation) should be sent to the Xen developers' |
---|
1559 | mailing list (address below). |
---|
1560 | |
---|
1561 | |
---|
1562 | \section{Other documentation} |
---|
1563 | |
---|
1564 | If you are mainly interested in using (rather than developing for) |
---|
1565 | Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/} |
---|
1566 | directory of the Xen source distribution. |
---|
1567 | |
---|
1568 | % Various HOWTOs are also available in {\tt docs/HOWTOS}. |
---|
1569 | |
---|
1570 | |
---|
1571 | \section{Online references} |
---|
1572 | |
---|
1573 | The official Xen web site can be found at: |
---|
1574 | \begin{quote} {\tt http://www.xensource.com} |
---|
1575 | \end{quote} |
---|
1576 | |
---|
1577 | |
---|
1578 | This contains links to the latest versions of all online |
---|
1579 | documentation, including the latest version of the FAQ. |
---|
1580 | |
---|
1581 | Information regarding Xen is also available at the Xen Wiki at |
---|
1582 | \begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote} |
---|
1583 | The Xen project uses Bugzilla as its bug tracking system. You'll find |
---|
1584 | the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/. |
---|
1585 | |
---|
1586 | |
---|
1587 | \section{Mailing lists} |
---|
1588 | |
---|
1589 | There are several mailing lists that are used to discuss Xen related |
---|
1590 | topics. The most widely relevant are listed below. An official page of |
---|
1591 | mailing lists and subscription information can be found at \begin{quote} |
---|
1592 | {\tt http://lists.xensource.com/} \end{quote} |
---|
1593 | |
---|
1594 | \begin{description} |
---|
1595 | \item[xen-devel@lists.xensource.com] Used for development |
---|
1596 | discussions and bug reports. Subscribe at: \\ |
---|
1597 | {\small {\tt http://lists.xensource.com/xen-devel}} |
---|
1598 | \item[xen-users@lists.xensource.com] Used for installation and usage |
---|
1599 | discussions and requests for help. Subscribe at: \\ |
---|
1600 | {\small {\tt http://lists.xensource.com/xen-users}} |
---|
1601 | \item[xen-announce@lists.xensource.com] Used for announcements only. |
---|
1602 | Subscribe at: \\ |
---|
1603 | {\small {\tt http://lists.xensource.com/xen-announce}} |
---|
1604 | \item[xen-changelog@lists.xensource.com] Changelog feed |
---|
1605 | from the unstable and 2.0 trees - developer oriented. Subscribe at: \\ |
---|
1606 | {\small {\tt http://lists.xensource.com/xen-changelog}} |
---|
1607 | \end{description} |
---|
1608 | |
---|
1609 | \appendix |
---|
1610 | |
---|
1611 | |
---|
1612 | \chapter{Xen Hypercalls} |
---|
1613 | \label{a:hypercalls} |
---|
1614 | |
---|
1615 | Hypercalls represent the procedural interface to Xen; this appendix |
---|
1616 | categorizes and describes the current set of hypercalls. |
---|
1617 | |
---|
1618 | \section{Invoking Hypercalls} |
---|
1619 | |
---|
1620 | Hypercalls are invoked in a manner analogous to system calls in a |
---|
1621 | conventional operating system; a software interrupt is issued which |
---|
1622 | vectors to an entry point within Xen. On x86/32 machines the |
---|
1623 | instruction required is {\tt int \$82}; the (real) IDT is setup so |
---|
1624 | that this may only be issued from within ring 1. The particular |
---|
1625 | hypercall to be invoked is contained in {\tt EAX} --- a list |
---|
1626 | mapping these values to symbolic hypercall names can be found |
---|
1627 | in {\tt xen/include/public/xen.h}. |
---|
1628 | |
---|
1629 | On some occasions a set of hypercalls will be required to carry |
---|
1630 | out a higher-level function; a good example is when a guest |
---|
1631 | operating wishes to context switch to a new process which |
---|
1632 | requires updating various privileged CPU state. As an optimization |
---|
1633 | for these cases, there is a generic mechanism to issue a set of |
---|
1634 | hypercalls as a batch: |
---|
1635 | |
---|
1636 | \begin{quote} |
---|
1637 | \hypercall{multicall(void *call\_list, int nr\_calls)} |
---|
1638 | |
---|
1639 | Execute a series of hypervisor calls; {\tt nr\_calls} is the length of |
---|
1640 | the array of {\tt multicall\_entry\_t} structures pointed to be {\tt |
---|
1641 | call\_list}. Each entry contains the hypercall operation code followed |
---|
1642 | by up to 7 word-sized arguments. |
---|
1643 | \end{quote} |
---|
1644 | |
---|
1645 | Note that multicalls are provided purely as an optimization; there is |
---|
1646 | no requirement to use them when first porting a guest operating |
---|
1647 | system. |
---|
1648 | |
---|
1649 | |
---|
1650 | \section{Virtual CPU Setup} |
---|
1651 | |
---|
1652 | At start of day, a guest operating system needs to setup the virtual |
---|
1653 | CPU it is executing on. This includes installing vectors for the |
---|
1654 | virtual IDT so that the guest OS can handle interrupts, page faults, |
---|
1655 | etc. However the very first thing a guest OS must setup is a pair |
---|
1656 | of hypervisor callbacks: these are the entry points which Xen will |
---|
1657 | use when it wishes to notify the guest OS of an occurrence. |
---|
1658 | |
---|
1659 | \begin{quote} |
---|
1660 | \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long |
---|
1661 | event\_address, unsigned long failsafe\_selector, unsigned long |
---|
1662 | failsafe\_address) } |
---|
1663 | |
---|
1664 | Register the normal (``event'') and failsafe callbacks for |
---|
1665 | event processing. In each case the code segment selector and |
---|
1666 | address within that segment are provided. The selectors must |
---|
1667 | have RPL 1; in XenLinux we simply use the kernel's CS for both |
---|
1668 | {\bf event\_selector} and {\bf failsafe\_selector}. |
---|
1669 | |
---|
1670 | The value {\bf event\_address} specifies the address of the guest OSes |
---|
1671 | event handling and dispatch routine; the {\bf failsafe\_address} |
---|
1672 | specifies a separate entry point which is used only if a fault occurs |
---|
1673 | when Xen attempts to use the normal callback. |
---|
1674 | |
---|
1675 | \end{quote} |
---|
1676 | |
---|
1677 | On x86/64 systems the hypercall takes slightly different |
---|
1678 | arguments. This is because callback CS does not need to be specified |
---|
1679 | (since teh callbacks are entered via SYSRET), and also because an |
---|
1680 | entry address needs to be specified for SYSCALLs from guest user |
---|
1681 | space: |
---|
1682 | |
---|
1683 | \begin{quote} |
---|
1684 | \hypercall{set\_callbacks(unsigned long event\_address, unsigned long |
---|
1685 | failsafe\_address, unsigned long syscall\_address)} |
---|
1686 | \end{quote} |
---|
1687 | |
---|
1688 | |
---|
1689 | After installing the hypervisor callbacks, the guest OS can |
---|
1690 | install a `virtual IDT' by using the following hypercall: |
---|
1691 | |
---|
1692 | \begin{quote} |
---|
1693 | \hypercall{set\_trap\_table(trap\_info\_t *table)} |
---|
1694 | |
---|
1695 | Install one or more entries into the per-domain |
---|
1696 | trap handler table (essentially a software version of the IDT). |
---|
1697 | Each entry in the array pointed to by {\bf table} includes the |
---|
1698 | exception vector number with the corresponding segment selector |
---|
1699 | and entry point. Most guest OSes can use the same handlers on |
---|
1700 | Xen as when running on the real hardware. |
---|
1701 | |
---|
1702 | |
---|
1703 | \end{quote} |
---|
1704 | |
---|
1705 | A further hypercall is provided for the management of virtual CPUs: |
---|
1706 | |
---|
1707 | \begin{quote} |
---|
1708 | \hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)} |
---|
1709 | |
---|
1710 | This hypercall can be used to bootstrap VCPUs, to bring them up and |
---|
1711 | down and to test their current status. |
---|
1712 | |
---|
1713 | \end{quote} |
---|
1714 | |
---|
1715 | \section{Scheduling and Timer} |
---|
1716 | |
---|
1717 | Domains are preemptively scheduled by Xen according to the |
---|
1718 | parameters installed by domain 0 (see Section~\ref{s:dom0ops}). |
---|
1719 | In addition, however, a domain may choose to explicitly |
---|
1720 | control certain behavior with the following hypercall: |
---|
1721 | |
---|
1722 | \begin{quote} |
---|
1723 | \hypercall{sched\_op\_new(int cmd, void *extra\_args)} |
---|
1724 | |
---|
1725 | Request scheduling operation from hypervisor. The following |
---|
1726 | sub-commands are available: |
---|
1727 | |
---|
1728 | \begin{description} |
---|
1729 | \item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the |
---|
1730 | caller marked as runnable. No extra arguments are passed to this |
---|
1731 | command. |
---|
1732 | \item[SCHEDOP\_block] removes the calling domain from the run queue |
---|
1733 | and causes it to sleep until an event is delivered to it. No extra |
---|
1734 | arguments are passed to this command. |
---|
1735 | \item[SCHEDOP\_shutdown] is used to end the calling domain's |
---|
1736 | execution. The extra argument is a {\bf sched\_shutdown} structure |
---|
1737 | which indicates the reason why the domain suspended (e.g., for reboot, |
---|
1738 | halt, power-off). |
---|
1739 | \item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels |
---|
1740 | with an optional timeout (all of which are specified in the {\bf |
---|
1741 | sched\_poll} extra argument). The semantics are similar to the UNIX |
---|
1742 | {\bf poll} system call. The caller must have event-channel upcalls |
---|
1743 | masked when executing this command. |
---|
1744 | \end{description} |
---|
1745 | \end{quote} |
---|
1746 | |
---|
1747 | {\bf sched\_op\_new} was not available prior to Xen 3.0.2. Older versions |
---|
1748 | provide only the following hypercall: |
---|
1749 | |
---|
1750 | \begin{quote} |
---|
1751 | \hypercall{sched\_op(int cmd, unsigned long extra\_arg)} |
---|
1752 | |
---|
1753 | This hypercall supports the following subset of {\bf sched\_op\_new} commands: |
---|
1754 | |
---|
1755 | \begin{description} |
---|
1756 | \item[SCHEDOP\_yield] (extra argument is 0). |
---|
1757 | \item[SCHEDOP\_block] (extra argument is 0). |
---|
1758 | \item[SCHEDOP\_shutdown] (extra argument is numeric reason code). |
---|
1759 | \end{description} |
---|
1760 | \end{quote} |
---|
1761 | |
---|
1762 | To aid the implementation of a process scheduler within a guest OS, |
---|
1763 | Xen provides a virtual programmable timer: |
---|
1764 | |
---|
1765 | \begin{quote} |
---|
1766 | \hypercall{set\_timer\_op(uint64\_t timeout)} |
---|
1767 | |
---|
1768 | Request a timer event to be sent at the specified system time (time |
---|
1769 | in nanoseconds since system boot). |
---|
1770 | |
---|
1771 | \end{quote} |
---|
1772 | |
---|
1773 | Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op} |
---|
1774 | allows block-with-timeout semantics. |
---|
1775 | |
---|
1776 | |
---|
1777 | \section{Page Table Management} |
---|
1778 | |
---|
1779 | Since guest operating systems have read-only access to their page |
---|
1780 | tables, Xen must be involved when making any changes. The following |
---|
1781 | multi-purpose hypercall can be used to modify page-table entries, |
---|
1782 | update the machine-to-physical mapping table, flush the TLB, install |
---|
1783 | a new page-table base pointer, and more. |
---|
1784 | |
---|
1785 | \begin{quote} |
---|
1786 | \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} |
---|
1787 | |
---|
1788 | Update the page table for the domain; a set of {\bf count} updates are |
---|
1789 | submitted for processing in a batch, with {\bf success\_count} being |
---|
1790 | updated to report the number of successful updates. |
---|
1791 | |
---|
1792 | Each element of {\bf req[]} contains a pointer (address) and value; |
---|
1793 | the least significant 2-bits of the pointer are used to distinguish |
---|
1794 | the type of update requested as follows: |
---|
1795 | \begin{description} |
---|
1796 | |
---|
1797 | \item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or |
---|
1798 | page table entry to the associated value; Xen will check that the |
---|
1799 | update is safe, as described in Chapter~\ref{c:memory}. |
---|
1800 | |
---|
1801 | \item[MMU\_MACHPHYS\_UPDATE:] update an entry in the |
---|
1802 | machine-to-physical table. The calling domain must own the machine |
---|
1803 | page in question (or be privileged). |
---|
1804 | \end{description} |
---|
1805 | |
---|
1806 | \end{quote} |
---|
1807 | |
---|
1808 | Explicitly updating batches of page table entries is extremely |
---|
1809 | efficient, but can require a number of alterations to the guest |
---|
1810 | OS. Using the writable page table mode (Chapter~\ref{c:memory}) is |
---|
1811 | recommended for new OS ports. |
---|
1812 | |
---|
1813 | Regardless of which page table update mode is being used, however, |
---|
1814 | there are some occasions (notably handling a demand page fault) where |
---|
1815 | a guest OS will wish to modify exactly one PTE rather than a |
---|
1816 | batch, and where that PTE is mapped into the current address space. |
---|
1817 | This is catered for by the following: |
---|
1818 | |
---|
1819 | \begin{quote} |
---|
1820 | \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val, |
---|
1821 | unsigned long flags)} |
---|
1822 | |
---|
1823 | Update the currently installed PTE that maps virtual address {\bf va} |
---|
1824 | to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the |
---|
1825 | modification is safe before applying it. The {\bf flags} determine |
---|
1826 | which kind of TLB flush, if any, should follow the update. |
---|
1827 | |
---|
1828 | \end{quote} |
---|
1829 | |
---|
1830 | Finally, sufficiently privileged domains may occasionally wish to manipulate |
---|
1831 | the pages of others: |
---|
1832 | |
---|
1833 | \begin{quote} |
---|
1834 | \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val, |
---|
1835 | unsigned long flags, domid\_t domid)} |
---|
1836 | |
---|
1837 | Identical to {\bf update\_va\_mapping} save that the pages being |
---|
1838 | mapped must belong to the domain {\bf domid}. |
---|
1839 | |
---|
1840 | \end{quote} |
---|
1841 | |
---|
1842 | An additional MMU hypercall provides an ``extended command'' |
---|
1843 | interface. This provides additional functionality beyond the basic |
---|
1844 | table updating commands: |
---|
1845 | |
---|
1846 | \begin{quote} |
---|
1847 | |
---|
1848 | \hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)} |
---|
1849 | |
---|
1850 | This hypercall is used to perform additional MMU operations. These |
---|
1851 | include updating {\tt cr3} (or just re-installing it for a TLB flush), |
---|
1852 | requesting various kinds of TLB flush, flushing the cache, installing |
---|
1853 | a new LDT, or pinning \& unpinning page-table pages (to ensure their |
---|
1854 | reference count doesn't drop to zero which would require a |
---|
1855 | revalidation of all entries). Some of the operations available are |
---|
1856 | restricted to domains with sufficient system privileges. |
---|
1857 | |
---|
1858 | It is also possible for privileged domains to reassign page ownership |
---|
1859 | via an extended MMU operation, although grant tables are used instead |
---|
1860 | of this where possible; see Section~\ref{s:idc}. |
---|
1861 | |
---|
1862 | \end{quote} |
---|
1863 | |
---|
1864 | Finally, a hypercall interface is exposed to activate and deactivate |
---|
1865 | various optional facilities provided by Xen for memory management. |
---|
1866 | |
---|
1867 | \begin{quote} |
---|
1868 | \hypercall{vm\_assist(unsigned int cmd, unsigned int type)} |
---|
1869 | |
---|
1870 | Toggle various memory management modes (in particular writable page |
---|
1871 | tables). |
---|
1872 | |
---|
1873 | \end{quote} |
---|
1874 | |
---|
1875 | \section{Segmentation Support} |
---|
1876 | |
---|
1877 | Xen allows guest OSes to install a custom GDT if they require it; |
---|
1878 | this is context switched transparently whenever a domain is |
---|
1879 | [de]scheduled. The following hypercall is effectively a |
---|
1880 | `safe' version of {\tt lgdt}: |
---|
1881 | |
---|
1882 | \begin{quote} |
---|
1883 | \hypercall{set\_gdt(unsigned long *frame\_list, int entries)} |
---|
1884 | |
---|
1885 | Install a global descriptor table for a domain; {\bf frame\_list} is |
---|
1886 | an array of up to 16 machine page frames within which the GDT resides, |
---|
1887 | with {\bf entries} being the actual number of descriptor-entry |
---|
1888 | slots. All page frames must be mapped read-only within the guest's |
---|
1889 | address space, and the table must be large enough to contain Xen's |
---|
1890 | reserved entries (see {\bf xen/include/public/arch-x86\_32.h}). |
---|
1891 | |
---|
1892 | \end{quote} |
---|
1893 | |
---|
1894 | Many guest OSes will also wish to install LDTs; this is achieved by |
---|
1895 | using {\bf mmu\_update} with an extended command, passing the |
---|
1896 | linear address of the LDT base along with the number of entries. No |
---|
1897 | special safety checks are required; Xen needs to perform this task |
---|
1898 | simply since {\tt lldt} requires CPL 0. |
---|
1899 | |
---|
1900 | |
---|
1901 | Xen also allows guest operating systems to update just an |
---|
1902 | individual segment descriptor in the GDT or LDT: |
---|
1903 | |
---|
1904 | \begin{quote} |
---|
1905 | \hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)} |
---|
1906 | |
---|
1907 | Update the GDT/LDT entry at machine address {\bf ma}; the new |
---|
1908 | 8-byte descriptor is stored in {\bf desc}. |
---|
1909 | Xen performs a number of checks to ensure the descriptor is |
---|
1910 | valid. |
---|
1911 | |
---|
1912 | \end{quote} |
---|
1913 | |
---|
1914 | Guest OSes can use the above in place of context switching entire |
---|
1915 | LDTs (or the GDT) when the number of changing descriptors is small. |
---|
1916 | |
---|
1917 | \section{Context Switching} |
---|
1918 | |
---|
1919 | When a guest OS wishes to context switch between two processes, |
---|
1920 | it can use the page table and segmentation hypercalls described |
---|
1921 | above to perform the the bulk of the privileged work. In addition, |
---|
1922 | however, it will need to invoke Xen to switch the kernel (ring 1) |
---|
1923 | stack pointer: |
---|
1924 | |
---|
1925 | \begin{quote} |
---|
1926 | \hypercall{stack\_switch(unsigned long ss, unsigned long esp)} |
---|
1927 | |
---|
1928 | Request kernel stack switch from hypervisor; {\bf ss} is the new |
---|
1929 | stack segment, which {\bf esp} is the new stack pointer. |
---|
1930 | |
---|
1931 | \end{quote} |
---|
1932 | |
---|
1933 | A useful hypercall for context switching allows ``lazy'' save and |
---|
1934 | restore of floating point state: |
---|
1935 | |
---|
1936 | \begin{quote} |
---|
1937 | \hypercall{fpu\_taskswitch(int set)} |
---|
1938 | |
---|
1939 | This call instructs Xen to set the {\tt TS} bit in the {\tt cr0} |
---|
1940 | control register; this means that the next attempt to use floating |
---|
1941 | point will cause a trap which the guest OS can trap. Typically it will |
---|
1942 | then save/restore the FP state, and clear the {\tt TS} bit, using the |
---|
1943 | same call. |
---|
1944 | \end{quote} |
---|
1945 | |
---|
1946 | This is provided as an optimization only; guest OSes can also choose |
---|
1947 | to save and restore FP state on all context switches for simplicity. |
---|
1948 | |
---|
1949 | Finally, a hypercall is provided for entering vm86 mode: |
---|
1950 | |
---|
1951 | \begin{quote} |
---|
1952 | \hypercall{switch\_vm86} |
---|
1953 | |
---|
1954 | This allows the guest to run code in vm86 mode, which is needed for |
---|
1955 | some legacy software. |
---|
1956 | \end{quote} |
---|
1957 | |
---|
1958 | \section{Physical Memory Management} |
---|
1959 | |
---|
1960 | As mentioned previously, each domain has a maximum and current |
---|
1961 | memory allocation. The maximum allocation, set at domain creation |
---|
1962 | time, cannot be modified. However a domain can choose to reduce |
---|
1963 | and subsequently grow its current allocation by using the |
---|
1964 | following call: |
---|
1965 | |
---|
1966 | \begin{quote} |
---|
1967 | \hypercall{memory\_op(unsigned int op, void *arg)} |
---|
1968 | |
---|
1969 | Increase or decrease current memory allocation (as determined by |
---|
1970 | the value of {\bf op}). The available operations are: |
---|
1971 | |
---|
1972 | \begin{description} |
---|
1973 | \item[XENMEM\_increase\_reservation] Request an increase in machine |
---|
1974 | memory allocation; {\bf arg} must point to a {\bf |
---|
1975 | xen\_memory\_reservation} structure. |
---|
1976 | \item[XENMEM\_decrease\_reservation] Request a decrease in machine |
---|
1977 | memory allocation; {\bf arg} must point to a {\bf |
---|
1978 | xen\_memory\_reservation} structure. |
---|
1979 | \item[XENMEM\_maximum\_ram\_page] Request the frame number of the |
---|
1980 | highest-addressed frame of machine memory in the system. {\bf arg} |
---|
1981 | must point to an {\bf unsigned long} where this value will be |
---|
1982 | stored. |
---|
1983 | \item[XENMEM\_current\_reservation] Returns current memory reservation |
---|
1984 | of the specified domain. |
---|
1985 | \item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation |
---|
1986 | of the specified domain. |
---|
1987 | \end{description} |
---|
1988 | |
---|
1989 | \end{quote} |
---|
1990 | |
---|
1991 | In addition to simply reducing or increasing the current memory |
---|
1992 | allocation via a `balloon driver', this call is also useful for |
---|
1993 | obtaining contiguous regions of machine memory when required (e.g. |
---|
1994 | for certain PCI devices, or if using superpages). |
---|
1995 | |
---|
1996 | |
---|
1997 | \section{Inter-Domain Communication} |
---|
1998 | \label{s:idc} |
---|
1999 | |
---|
2000 | Xen provides a simple asynchronous notification mechanism via |
---|
2001 | \emph{event channels}. Each domain has a set of end-points (or |
---|
2002 | \emph{ports}) which may be bound to an event source (e.g. a physical |
---|
2003 | IRQ, a virtual IRQ, or an port in another domain). When a pair of |
---|
2004 | end-points in two different domains are bound together, then a `send' |
---|
2005 | operation on one will cause an event to be received by the destination |
---|
2006 | domain. |
---|
2007 | |
---|
2008 | The control and use of event channels involves the following hypercall: |
---|
2009 | |
---|
2010 | \begin{quote} |
---|
2011 | \hypercall{event\_channel\_op(evtchn\_op\_t *op)} |
---|
2012 | |
---|
2013 | Inter-domain event-channel management; {\bf op} is a discriminated |
---|
2014 | union which allows the following 7 operations: |
---|
2015 | |
---|
2016 | \begin{description} |
---|
2017 | |
---|
2018 | \item[alloc\_unbound:] allocate a free (unbound) local |
---|
2019 | port and prepare for connection from a specified domain. |
---|
2020 | \item[bind\_virq:] bind a local port to a virtual |
---|
2021 | IRQ; any particular VIRQ can be bound to at most one port per domain. |
---|
2022 | \item[bind\_pirq:] bind a local port to a physical IRQ; |
---|
2023 | once more, a given pIRQ can be bound to at most one port per |
---|
2024 | domain. Furthermore the calling domain must be sufficiently |
---|
2025 | privileged. |
---|
2026 | \item[bind\_interdomain:] construct an interdomain event |
---|
2027 | channel; in general, the target domain must have previously allocated |
---|
2028 | an unbound port for this channel, although this can be bypassed by |
---|
2029 | privileged domains during domain setup. |
---|
2030 | \item[close:] close an interdomain event channel. |
---|
2031 | \item[send:] send an event to the remote end of a |
---|
2032 | interdomain event channel. |
---|
2033 | \item[status:] determine the current status of a local port. |
---|
2034 | \end{description} |
---|
2035 | |
---|
2036 | For more details see |
---|
2037 | {\bf xen/include/public/event\_channel.h}. |
---|
2038 | |
---|
2039 | \end{quote} |
---|
2040 | |
---|
2041 | Event channels are the fundamental communication primitive between |
---|
2042 | Xen domains and seamlessly support SMP. However they provide little |
---|
2043 | bandwidth for communication {\sl per se}, and hence are typically |
---|
2044 | married with a piece of shared memory to produce effective and |
---|
2045 | high-performance inter-domain communication. |
---|
2046 | |
---|
2047 | Safe sharing of memory pages between guest OSes is carried out by |
---|
2048 | granting access on a per page basis to individual domains. This is |
---|
2049 | achieved by using the {\tt grant\_table\_op} hypercall. |
---|
2050 | |
---|
2051 | \begin{quote} |
---|
2052 | \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} |
---|
2053 | |
---|
2054 | Used to invoke operations on a grant reference, to setup the grant |
---|
2055 | table and to dump the tables' contents for debugging. |
---|
2056 | |
---|
2057 | \end{quote} |
---|
2058 | |
---|
2059 | \section{IO Configuration} |
---|
2060 | |
---|
2061 | Domains with physical device access (i.e.\ driver domains) receive |
---|
2062 | limited access to certain PCI devices (bus address space and |
---|
2063 | interrupts). However many guest operating systems attempt to |
---|
2064 | determine the PCI configuration by directly access the PCI BIOS, |
---|
2065 | which cannot be allowed for safety. |
---|
2066 | |
---|
2067 | Instead, Xen provides the following hypercall: |
---|
2068 | |
---|
2069 | \begin{quote} |
---|
2070 | \hypercall{physdev\_op(void *physdev\_op)} |
---|
2071 | |
---|
2072 | Set and query IRQ configuration details, set the system IOPL, set the |
---|
2073 | TSS IO bitmap. |
---|
2074 | |
---|
2075 | \end{quote} |
---|
2076 | |
---|
2077 | |
---|
2078 | For examples of using {\tt physdev\_op}, see the |
---|
2079 | Xen-specific PCI code in the linux sparse tree. |
---|
2080 | |
---|
2081 | \section{Administrative Operations} |
---|
2082 | \label{s:dom0ops} |
---|
2083 | |
---|
2084 | A large number of control operations are available to a sufficiently |
---|
2085 | privileged domain (typically domain 0). These allow the creation and |
---|
2086 | management of new domains, for example. A complete list is given |
---|
2087 | below: for more details on any or all of these, please see |
---|
2088 | {\tt xen/include/public/dom0\_ops.h} |
---|
2089 | |
---|
2090 | |
---|
2091 | \begin{quote} |
---|
2092 | \hypercall{dom0\_op(dom0\_op\_t *op)} |
---|
2093 | |
---|
2094 | Administrative domain operations for domain management. The options are: |
---|
2095 | |
---|
2096 | \begin{description} |
---|
2097 | \item [DOM0\_GETMEMLIST:] get list of pages used by the domain |
---|
2098 | |
---|
2099 | \item [DOM0\_SCHEDCTL:] |
---|
2100 | |
---|
2101 | \item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain |
---|
2102 | |
---|
2103 | \item [DOM0\_CREATEDOMAIN:] create a new domain |
---|
2104 | |
---|
2105 | \item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated |
---|
2106 | with a domain |
---|
2107 | |
---|
2108 | \item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run |
---|
2109 | queue. |
---|
2110 | |
---|
2111 | \item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable |
---|
2112 | once again. |
---|
2113 | |
---|
2114 | \item [DOM0\_GETDOMAININFO:] get statistics about the domain |
---|
2115 | |
---|
2116 | \item [DOM0\_SETDOMAININFO:] set VCPU-related attributes |
---|
2117 | |
---|
2118 | \item [DOM0\_MSR:] read or write model specific registers |
---|
2119 | |
---|
2120 | \item [DOM0\_DEBUG:] interactively invoke the debugger |
---|
2121 | |
---|
2122 | \item [DOM0\_SETTIME:] set system time |
---|
2123 | |
---|
2124 | \item [DOM0\_GETPAGEFRAMEINFO:] |
---|
2125 | |
---|
2126 | \item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring |
---|
2127 | |
---|
2128 | \item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU |
---|
2129 | |
---|
2130 | \item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes |
---|
2131 | |
---|
2132 | \item [DOM0\_PHYSINFO:] get information about the host machine |
---|
2133 | |
---|
2134 | \item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler |
---|
2135 | |
---|
2136 | \item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes |
---|
2137 | |
---|
2138 | \item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain |
---|
2139 | |
---|
2140 | \item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting |
---|
2141 | page frame info |
---|
2142 | |
---|
2143 | \item [DOM0\_ADD\_MEMTYPE:] set MTRRs |
---|
2144 | |
---|
2145 | \item [DOM0\_DEL\_MEMTYPE:] remove a memory type range |
---|
2146 | |
---|
2147 | \item [DOM0\_READ\_MEMTYPE:] read MTRR |
---|
2148 | |
---|
2149 | \item [DOM0\_PERFCCONTROL:] control Xen's software performance |
---|
2150 | counters |
---|
2151 | |
---|
2152 | \item [DOM0\_MICROCODE:] update CPU microcode |
---|
2153 | |
---|
2154 | \item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an |
---|
2155 | IO port range (enable / disable a range for a particular domain) |
---|
2156 | |
---|
2157 | \item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU |
---|
2158 | |
---|
2159 | \item [DOM0\_GETVCPUINFO:] get current state for a VCPU |
---|
2160 | \item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain |
---|
2161 | info |
---|
2162 | |
---|
2163 | \item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it |
---|
2164 | needs to handle (e.g. noirqbalance) |
---|
2165 | |
---|
2166 | \item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory |
---|
2167 | map |
---|
2168 | |
---|
2169 | \item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain |
---|
2170 | |
---|
2171 | \item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain |
---|
2172 | |
---|
2173 | \end{description} |
---|
2174 | \end{quote} |
---|
2175 | |
---|
2176 | Most of the above are best understood by looking at the code |
---|
2177 | implementing them (in {\tt xen/common/dom0\_ops.c}) and in |
---|
2178 | the user-space tools that use them (mostly in {\tt tools/libxc}). |
---|
2179 | |
---|
2180 | \section{Access Control Module Hypercalls} |
---|
2181 | \label{s:acmops} |
---|
2182 | |
---|
2183 | Hypercalls relating to the management of the Access Control Module are |
---|
2184 | also restricted to domain 0 access for now. For more details on any or |
---|
2185 | all of these, please see {\tt xen/include/public/acm\_ops.h}. A |
---|
2186 | complete list is given below: |
---|
2187 | |
---|
2188 | \begin{quote} |
---|
2189 | |
---|
2190 | \hypercall{acm\_op(int cmd, void *args)} |
---|
2191 | |
---|
2192 | This hypercall can be used to configure the state of the ACM, query |
---|
2193 | that state, request access control decisions and dump additional |
---|
2194 | information. |
---|
2195 | |
---|
2196 | \begin{description} |
---|
2197 | |
---|
2198 | \item [ACMOP\_SETPOLICY:] set the access control policy |
---|
2199 | |
---|
2200 | \item [ACMOP\_GETPOLICY:] get the current access control policy and |
---|
2201 | status |
---|
2202 | |
---|
2203 | \item [ACMOP\_DUMPSTATS:] get current access control hook invocation |
---|
2204 | statistics |
---|
2205 | |
---|
2206 | \item [ACMOP\_GETSSID:] get security access control information for a |
---|
2207 | domain |
---|
2208 | |
---|
2209 | \item [ACMOP\_GETDECISION:] get access decision based on the currently |
---|
2210 | enforced access control policy |
---|
2211 | |
---|
2212 | \end{description} |
---|
2213 | \end{quote} |
---|
2214 | |
---|
2215 | Most of the above are best understood by looking at the code |
---|
2216 | implementing them (in {\tt xen/common/acm\_ops.c}) and in the |
---|
2217 | user-space tools that use them (mostly in {\tt tools/security} and |
---|
2218 | {\tt tools/python/xen/lowlevel/acm}). |
---|
2219 | |
---|
2220 | |
---|
2221 | \section{Debugging Hypercalls} |
---|
2222 | |
---|
2223 | A few additional hypercalls are mainly useful for debugging: |
---|
2224 | |
---|
2225 | \begin{quote} |
---|
2226 | \hypercall{console\_io(int cmd, int count, char *str)} |
---|
2227 | |
---|
2228 | Use Xen to interact with the console; operations are: |
---|
2229 | |
---|
2230 | {CONSOLEIO\_write}: Output count characters from buffer str. |
---|
2231 | |
---|
2232 | {CONSOLEIO\_read}: Input at most count characters into buffer str. |
---|
2233 | \end{quote} |
---|
2234 | |
---|
2235 | A pair of hypercalls allows access to the underlying debug registers: |
---|
2236 | \begin{quote} |
---|
2237 | \hypercall{set\_debugreg(int reg, unsigned long value)} |
---|
2238 | |
---|
2239 | Set debug register {\bf reg} to {\bf value} |
---|
2240 | |
---|
2241 | \hypercall{get\_debugreg(int reg)} |
---|
2242 | |
---|
2243 | Return the contents of the debug register {\bf reg} |
---|
2244 | \end{quote} |
---|
2245 | |
---|
2246 | And finally: |
---|
2247 | \begin{quote} |
---|
2248 | \hypercall{xen\_version(int cmd)} |
---|
2249 | |
---|
2250 | Request Xen version number. |
---|
2251 | \end{quote} |
---|
2252 | |
---|
2253 | This is useful to ensure that user-space tools are in sync |
---|
2254 | with the underlying hypervisor. |
---|
2255 | |
---|
2256 | |
---|
2257 | \end{document} |
---|