[34] | 1 | \documentclass[11pt,twoside,final,openright]{report} |
---|
| 2 | \usepackage{a4,graphicx,html,setspace,times} |
---|
| 3 | \usepackage{comment,parskip} |
---|
| 4 | \setstretch{1.15} |
---|
| 5 | |
---|
| 6 | % LIBRARY FUNCTIONS |
---|
| 7 | |
---|
| 8 | \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}} |
---|
| 9 | |
---|
| 10 | \begin{document} |
---|
| 11 | |
---|
| 12 | % TITLE PAGE |
---|
| 13 | \pagestyle{empty} |
---|
| 14 | \begin{center} |
---|
| 15 | \vspace*{\fill} |
---|
| 16 | \includegraphics{figs/xenlogo.eps} |
---|
| 17 | \vfill |
---|
| 18 | \vfill |
---|
| 19 | \vfill |
---|
| 20 | \begin{tabular}{l} |
---|
| 21 | {\Huge \bf Interface manual} \\[4mm] |
---|
| 22 | {\huge Xen v3.0 for x86} \\[80mm] |
---|
| 23 | |
---|
| 24 | {\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm] |
---|
| 25 | {\Large University of Cambridge, UK} \\[20mm] |
---|
| 26 | \end{tabular} |
---|
| 27 | \end{center} |
---|
| 28 | |
---|
| 29 | {\bf DISCLAIMER: This documentation is always under active development |
---|
| 30 | and as such there may be mistakes and omissions --- watch out for |
---|
| 31 | these and please report any you find to the developer's mailing list. |
---|
| 32 | The latest version is always available on-line. Contributions of |
---|
| 33 | material, suggestions and corrections are welcome. } |
---|
| 34 | |
---|
| 35 | \vfill |
---|
| 36 | \cleardoublepage |
---|
| 37 | |
---|
| 38 | % TABLE OF CONTENTS |
---|
| 39 | \pagestyle{plain} |
---|
| 40 | \pagenumbering{roman} |
---|
| 41 | { \parskip 0pt plus 1pt |
---|
| 42 | \tableofcontents } |
---|
| 43 | \cleardoublepage |
---|
| 44 | |
---|
| 45 | % PREPARE FOR MAIN TEXT |
---|
| 46 | \pagenumbering{arabic} |
---|
| 47 | \raggedbottom |
---|
| 48 | \widowpenalty=10000 |
---|
| 49 | \clubpenalty=10000 |
---|
| 50 | \parindent=0pt |
---|
| 51 | \parskip=5pt |
---|
| 52 | \renewcommand{\topfraction}{.8} |
---|
| 53 | \renewcommand{\bottomfraction}{.8} |
---|
| 54 | \renewcommand{\textfraction}{.2} |
---|
| 55 | \renewcommand{\floatpagefraction}{.8} |
---|
| 56 | \setstretch{1.1} |
---|
| 57 | |
---|
| 58 | \chapter{Introduction} |
---|
| 59 | |
---|
| 60 | Xen allows the hardware resources of a machine to be virtualized and |
---|
| 61 | dynamically partitioned, allowing multiple different {\em guest} |
---|
| 62 | operating system images to be run simultaneously. Virtualizing the |
---|
| 63 | machine in this manner provides considerable flexibility, for example |
---|
| 64 | allowing different users to choose their preferred operating system |
---|
| 65 | (e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen |
---|
| 66 | provides secure partitioning between virtual machines (known as |
---|
| 67 | {\em domains} in Xen terminology), and enables better resource |
---|
| 68 | accounting and QoS isolation than can be achieved with a conventional |
---|
| 69 | operating system. |
---|
| 70 | |
---|
| 71 | Xen essentially takes a `whole machine' virtualization approach as |
---|
| 72 | pioneered by IBM VM/370. However, unlike VM/370 or more recent |
---|
| 73 | efforts such as VMware and Virtual PC, Xen does not attempt to |
---|
| 74 | completely virtualize the underlying hardware. Instead parts of the |
---|
| 75 | hosted guest operating systems are modified to work with the VMM; the |
---|
| 76 | operating system is effectively ported to a new target architecture, |
---|
| 77 | typically requiring changes in just the machine-dependent code. The |
---|
| 78 | user-level API is unchanged, and so existing binaries and operating |
---|
| 79 | system distributions work without modification. |
---|
| 80 | |
---|
| 81 | In addition to exporting virtualized instances of CPU, memory, network |
---|
| 82 | and block devices, Xen exposes a control interface to manage how these |
---|
| 83 | resources are shared between the running domains. Access to the |
---|
| 84 | control interface is restricted: it may only be used by one |
---|
| 85 | specially-privileged VM, known as {\em domain 0}. This domain is a |
---|
| 86 | required part of any Xen-based server and runs the application software |
---|
| 87 | that manages the control-plane aspects of the platform. Running the |
---|
| 88 | control software in {\it domain 0}, distinct from the hypervisor |
---|
| 89 | itself, allows the Xen framework to separate the notions of |
---|
| 90 | mechanism and policy within the system. |
---|
| 91 | |
---|
| 92 | |
---|
| 93 | \chapter{Virtual Architecture} |
---|
| 94 | |
---|
| 95 | In a Xen/x86 system, only the hypervisor runs with full processor |
---|
| 96 | privileges ({\it ring 0} in the x86 four-ring model). It has full |
---|
| 97 | access to the physical memory available in the system and is |
---|
| 98 | responsible for allocating portions of it to running domains. |
---|
| 99 | |
---|
| 100 | On a 32-bit x86 system, guest operating systems may use {\it rings 1}, |
---|
| 101 | {\it 2} and {\it 3} as they see fit. Segmentation is used to prevent |
---|
| 102 | the guest OS from accessing the portion of the address space that is |
---|
| 103 | reserved for Xen. We expect most guest operating systems will use |
---|
| 104 | ring 1 for their own operation and place applications in ring 3. |
---|
| 105 | |
---|
| 106 | On 64-bit systems it is not possible to protect the hypervisor from |
---|
| 107 | untrusted guest code running in rings 1 and 2. Guests are therefore |
---|
| 108 | restricted to run in ring 3 only. The guest kernel is protected from its |
---|
| 109 | applications by context switching between the kernel and currently |
---|
| 110 | running application. |
---|
| 111 | |
---|
| 112 | In this chapter we consider the basic virtual architecture provided by |
---|
| 113 | Xen: CPU state, exception and interrupt handling, and time. |
---|
| 114 | Other aspects such as memory and device access are discussed in later |
---|
| 115 | chapters. |
---|
| 116 | |
---|
| 117 | |
---|
| 118 | \section{CPU state} |
---|
| 119 | |
---|
| 120 | All privileged state must be handled by Xen. The guest OS has no |
---|
| 121 | direct access to CR3 and is not permitted to update privileged bits in |
---|
| 122 | EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen; |
---|
| 123 | these are analogous to system calls but occur from ring 1 to ring 0. |
---|
| 124 | |
---|
| 125 | A list of all hypercalls is given in Appendix~\ref{a:hypercalls}. |
---|
| 126 | |
---|
| 127 | |
---|
| 128 | \section{Exceptions} |
---|
| 129 | |
---|
| 130 | A virtual IDT is provided --- a domain can submit a table of trap |
---|
| 131 | handlers to Xen via the {\bf set\_trap\_table} hypercall. The |
---|
| 132 | exception stack frame presented to a virtual trap handler is identical |
---|
| 133 | to its native equivalent. |
---|
| 134 | |
---|
| 135 | |
---|
| 136 | \section{Interrupts and events} |
---|
| 137 | |
---|
| 138 | Interrupts are virtualized by mapping them to \emph{event channels}, |
---|
| 139 | which are delivered asynchronously to the target domain using a callback |
---|
| 140 | supplied via the {\bf set\_callbacks} hypercall. A guest OS can map |
---|
| 141 | these events onto its standard interrupt dispatch mechanisms. Xen is |
---|
| 142 | responsible for determining the target domain that will handle each |
---|
| 143 | physical interrupt source. For more details on the binding of event |
---|
| 144 | sources to event channels, see Chapter~\ref{c:devices}. |
---|
| 145 | |
---|
| 146 | |
---|
| 147 | \section{Time} |
---|
| 148 | |
---|
| 149 | Guest operating systems need to be aware of the passage of both real |
---|
| 150 | (or wallclock) time and their own `virtual time' (the time for which |
---|
| 151 | they have been executing). Furthermore, Xen has a notion of time which |
---|
| 152 | is used for scheduling. The following notions of time are provided: |
---|
| 153 | |
---|
| 154 | \begin{description} |
---|
| 155 | \item[Cycle counter time.] |
---|
| 156 | |
---|
| 157 | This provides a fine-grained time reference. The cycle counter time |
---|
| 158 | is used to accurately extrapolate the other time references. On SMP |
---|
| 159 | machines it is currently assumed that the cycle counter time is |
---|
| 160 | synchronized between CPUs. The current x86-based implementation |
---|
| 161 | achieves this within inter-CPU communication latencies. |
---|
| 162 | |
---|
| 163 | \item[System time.] |
---|
| 164 | |
---|
| 165 | This is a 64-bit counter which holds the number of nanoseconds that |
---|
| 166 | have elapsed since system boot. |
---|
| 167 | |
---|
| 168 | \item[Wall clock time.] |
---|
| 169 | |
---|
| 170 | This is the time of day in a Unix-style {\bf struct timeval} |
---|
| 171 | (seconds and microseconds since 1 January 1970, adjusted by leap |
---|
| 172 | seconds). An NTP client hosted by {\it domain 0} can keep this |
---|
| 173 | value accurate. |
---|
| 174 | |
---|
| 175 | \item[Domain virtual time.] |
---|
| 176 | |
---|
| 177 | This progresses at the same pace as system time, but only while a |
---|
| 178 | domain is executing --- it stops while a domain is de-scheduled. |
---|
| 179 | Therefore the share of the CPU that a domain receives is indicated |
---|
| 180 | by the rate at which its virtual time increases. |
---|
| 181 | |
---|
| 182 | \end{description} |
---|
| 183 | |
---|
| 184 | |
---|
| 185 | Xen exports timestamps for system time and wall-clock time to guest |
---|
| 186 | operating systems through a shared page of memory. Xen also provides |
---|
| 187 | the cycle counter time at the instant the timestamps were calculated, |
---|
| 188 | and the CPU frequency in Hertz. This allows the guest to extrapolate |
---|
| 189 | system and wall-clock times accurately based on the current cycle |
---|
| 190 | counter time. |
---|
| 191 | |
---|
| 192 | Since all time stamps need to be updated and read \emph{atomically} |
---|
| 193 | a version number is also stored in the shared info page, which is |
---|
| 194 | incremented before and after updating the timestamps. Thus a guest can |
---|
| 195 | be sure that it read a consistent state by checking the two version |
---|
| 196 | numbers are equal and even. |
---|
| 197 | |
---|
| 198 | Xen includes a periodic ticker which sends a timer event to the |
---|
| 199 | currently executing domain every 10ms. The Xen scheduler also sends a |
---|
| 200 | timer event whenever a domain is scheduled; this allows the guest OS |
---|
| 201 | to adjust for the time that has passed while it has been inactive. In |
---|
| 202 | addition, Xen allows each domain to request that they receive a timer |
---|
| 203 | event sent at a specified system time by using the {\bf |
---|
| 204 | set\_timer\_op} hypercall. Guest OSes may use this timer to |
---|
| 205 | implement timeout values when they block. |
---|
| 206 | |
---|
| 207 | |
---|
| 208 | \section{Xen CPU Scheduling} |
---|
| 209 | |
---|
| 210 | Xen offers a uniform API for CPU schedulers. It is possible to choose |
---|
| 211 | from a number of schedulers at boot and it should be easy to add more. |
---|
| 212 | The SEDF and Credit schedulers are part of the normal Xen |
---|
| 213 | distribution. SEDF will be going away and its use should be |
---|
| 214 | avoided once the credit scheduler has stabilized and become the default. |
---|
| 215 | The Credit scheduler provides proportional fair shares of the |
---|
| 216 | host's CPUs to the running domains. It does this while transparently |
---|
| 217 | load balancing runnable VCPUs across the whole system. |
---|
| 218 | |
---|
| 219 | \paragraph*{Note: SMP host support} |
---|
| 220 | Xen has always supported SMP host systems. When using the credit scheduler, |
---|
| 221 | a domain's VCPUs will be dynamically moved across physical CPUs to maximise |
---|
| 222 | domain and system throughput. VCPUs can also be manually restricted to be |
---|
| 223 | mapped only on a subset of the host's physical CPUs, using the pinning |
---|
| 224 | mechanism. |
---|
| 225 | |
---|
| 226 | |
---|
| 227 | %% More information on the characteristics and use of these schedulers |
---|
| 228 | %% is available in {\bf Sched-HOWTO.txt}. |
---|
| 229 | |
---|
| 230 | |
---|
| 231 | \section{Privileged operations} |
---|
| 232 | |
---|
| 233 | Xen exports an extended interface to privileged domains (viz.\ {\it |
---|
| 234 | Domain 0}). This allows such domains to build and boot other domains |
---|
| 235 | on the server, and provides control interfaces for managing |
---|
| 236 | scheduling, memory, networking, and block devices. |
---|
| 237 | |
---|
| 238 | \chapter{Memory} |
---|
| 239 | \label{c:memory} |
---|
| 240 | |
---|
| 241 | Xen is responsible for managing the allocation of physical memory to |
---|
| 242 | domains, and for ensuring safe use of the paging and segmentation |
---|
| 243 | hardware. |
---|
| 244 | |
---|
| 245 | |
---|
| 246 | \section{Memory Allocation} |
---|
| 247 | |
---|
| 248 | As well as allocating a portion of physical memory for its own private |
---|
| 249 | use, Xen also reserves s small fixed portion of every virtual address |
---|
| 250 | space. This is located in the top 64MB on 32-bit systems, the top |
---|
| 251 | 168MB on PAE systems, and a larger portion in the middle of the |
---|
| 252 | address space on 64-bit systems. Unreserved physical memory is |
---|
| 253 | available for allocation to domains at a page granularity. Xen tracks |
---|
| 254 | the ownership and use of each page, which allows it to enforce secure |
---|
| 255 | partitioning between domains. |
---|
| 256 | |
---|
| 257 | Each domain has a maximum and current physical memory allocation. A |
---|
| 258 | guest OS may run a `balloon driver' to dynamically adjust its current |
---|
| 259 | memory allocation up to its limit. |
---|
| 260 | |
---|
| 261 | |
---|
| 262 | \section{Pseudo-Physical Memory} |
---|
| 263 | |
---|
| 264 | Since physical memory is allocated and freed on a page granularity, |
---|
| 265 | there is no guarantee that a domain will receive a contiguous stretch |
---|
| 266 | of physical memory. However most operating systems do not have good |
---|
| 267 | support for operating in a fragmented physical address space. To aid |
---|
| 268 | porting such operating systems to run on top of Xen, we make a |
---|
| 269 | distinction between \emph{machine memory} and \emph{pseudo-physical |
---|
| 270 | memory}. |
---|
| 271 | |
---|
| 272 | Put simply, machine memory refers to the entire amount of memory |
---|
| 273 | installed in the machine, including that reserved by Xen, in use by |
---|
| 274 | various domains, or currently unallocated. We consider machine memory |
---|
| 275 | to comprise a set of 4kB \emph{machine page frames} numbered |
---|
| 276 | consecutively starting from 0. Machine frame numbers mean the same |
---|
| 277 | within Xen or any domain. |
---|
| 278 | |
---|
| 279 | Pseudo-physical memory, on the other hand, is a per-domain |
---|
| 280 | abstraction. It allows a guest operating system to consider its memory |
---|
| 281 | allocation to consist of a contiguous range of physical page frames |
---|
| 282 | starting at physical frame 0, despite the fact that the underlying |
---|
| 283 | machine page frames may be sparsely allocated and in any order. |
---|
| 284 | |
---|
| 285 | To achieve this, Xen maintains a globally readable {\it |
---|
| 286 | machine-to-physical} table which records the mapping from machine |
---|
| 287 | page frames to pseudo-physical ones. In addition, each domain is |
---|
| 288 | supplied with a {\it physical-to-machine} table which performs the |
---|
| 289 | inverse mapping. Clearly the machine-to-physical table has size |
---|
| 290 | proportional to the amount of RAM installed in the machine, while each |
---|
| 291 | physical-to-machine table has size proportional to the memory |
---|
| 292 | allocation of the given domain. |
---|
| 293 | |
---|
| 294 | Architecture dependent code in guest operating systems can then use |
---|
| 295 | the two tables to provide the abstraction of pseudo-physical memory. |
---|
| 296 | In general, only certain specialized parts of the operating system |
---|
| 297 | (such as page table management) needs to understand the difference |
---|
| 298 | between machine and pseudo-physical addresses. |
---|
| 299 | |
---|
| 300 | |
---|
| 301 | \section{Page Table Updates} |
---|
| 302 | |
---|
| 303 | In the default mode of operation, Xen enforces read-only access to |
---|
| 304 | page tables and requires guest operating systems to explicitly request |
---|
| 305 | any modifications. Xen validates all such requests and only applies |
---|
| 306 | updates that it deems safe. This is necessary to prevent domains from |
---|
| 307 | adding arbitrary mappings to their page tables. |
---|
| 308 | |
---|
| 309 | To aid validation, Xen associates a type and reference count with each |
---|
| 310 | memory page. A page has one of the following mutually-exclusive types |
---|
| 311 | at any point in time: page directory ({\sf PD}), page table ({\sf |
---|
| 312 | PT}), local descriptor table ({\sf LDT}), global descriptor table |
---|
| 313 | ({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always |
---|
| 314 | create readable mappings of its own memory regardless of its current |
---|
| 315 | type. |
---|
| 316 | |
---|
| 317 | %%% XXX: possibly explain more about ref count 'lifecyle' here? |
---|
| 318 | This mechanism is used to maintain the invariants required for safety; |
---|
| 319 | for example, a domain cannot have a writable mapping to any part of a |
---|
| 320 | page table as this would require the page concerned to simultaneously |
---|
| 321 | be of types {\sf PT} and {\sf RW}. |
---|
| 322 | |
---|
| 323 | \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)} |
---|
| 324 | |
---|
| 325 | This hypercall is used to make updates to either the domain's |
---|
| 326 | pagetables or to the machine to physical mapping table. It supports |
---|
| 327 | submitting a queue of updates, allowing batching for maximal |
---|
| 328 | performance. Explicitly queuing updates using this interface will |
---|
| 329 | cause any outstanding writable pagetable state to be flushed from the |
---|
| 330 | system. |
---|
| 331 | |
---|
| 332 | \section{Writable Page Tables} |
---|
| 333 | |
---|
| 334 | Xen also provides an alternative mode of operation in which guests |
---|
| 335 | have the illusion that their page tables are directly writable. Of |
---|
| 336 | course this is not really the case, since Xen must still validate |
---|
| 337 | modifications to ensure secure partitioning. To this end, Xen traps |
---|
| 338 | any write attempt to a memory page of type {\sf PT} (i.e., that is |
---|
| 339 | currently part of a page table). If such an access occurs, Xen |
---|
| 340 | temporarily allows write access to that page while at the same time |
---|
| 341 | \emph{disconnecting} it from the page table that is currently in use. |
---|
| 342 | This allows the guest to safely make updates to the page because the |
---|
| 343 | newly-updated entries cannot be used by the MMU until Xen revalidates |
---|
| 344 | and reconnects the page. Reconnection occurs automatically in a |
---|
| 345 | number of situations: for example, when the guest modifies a different |
---|
| 346 | page-table page, when the domain is preempted, or whenever the guest |
---|
| 347 | uses Xen's explicit page-table update interfaces. |
---|
| 348 | |
---|
| 349 | Writable pagetable functionality is enabled when the guest requests |
---|
| 350 | it, using a {\bf vm\_assist} hypercall. Writable pagetables do {\em |
---|
| 351 | not} provide full virtualisation of the MMU, so the memory management |
---|
| 352 | code of the guest still needs to be aware that it is running on Xen. |
---|
| 353 | Since the guest's page tables are used directly, it must translate |
---|
| 354 | pseudo-physical addresses to real machine addresses when building page |
---|
| 355 | table entries. The guest may not attempt to map its own pagetables |
---|
| 356 | writably, since this would violate the memory type invariants; page |
---|
| 357 | tables will automatically be made writable by the hypervisor, as |
---|
| 358 | necessary. |
---|
| 359 | |
---|
| 360 | \section{Shadow Page Tables} |
---|
| 361 | |
---|
| 362 | Finally, Xen also supports a form of \emph{shadow page tables} in |
---|
| 363 | which the guest OS uses a independent copy of page tables which are |
---|
| 364 | unknown to the hardware (i.e.\ which are never pointed to by {\tt |
---|
| 365 | cr3}). Instead Xen propagates changes made to the guest's tables to |
---|
| 366 | the real ones, and vice versa. This is useful for logging page writes |
---|
| 367 | (e.g.\ for live migration or checkpoint). A full version of the shadow |
---|
| 368 | page tables also allows guest OS porting with less effort. |
---|
| 369 | |
---|
| 370 | |
---|
| 371 | \section{Segment Descriptor Tables} |
---|
| 372 | |
---|
| 373 | At start of day a guest is supplied with a default GDT, which does not reside |
---|
| 374 | within its own memory allocation. If the guest wishes to use other |
---|
| 375 | than the default `flat' ring-1 and ring-3 segments that this GDT |
---|
| 376 | provides, it must register a custom GDT and/or LDT with Xen, allocated |
---|
| 377 | from its own memory. |
---|
| 378 | |
---|
| 379 | The following hypercall is used to specify a new GDT: |
---|
| 380 | |
---|
| 381 | \begin{quote} |
---|
| 382 | int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em |
---|
| 383 | entries}) |
---|
| 384 | |
---|
| 385 | \emph{frame\_list}: An array of up to 14 machine page frames within |
---|
| 386 | which the GDT resides. Any frame registered as a GDT frame may only |
---|
| 387 | be mapped read-only within the guest's address space (e.g., no |
---|
| 388 | writable mappings, no use as a page-table page, and so on). Only 14 |
---|
| 389 | pages may be specified because pages 15 and 16 are reserved for |
---|
| 390 | the hypervisor's GDT entries. |
---|
| 391 | |
---|
| 392 | \emph{entries}: The number of descriptor-entry slots in the GDT. |
---|
| 393 | \end{quote} |
---|
| 394 | |
---|
| 395 | The LDT is updated via the generic MMU update mechanism (i.e., via the |
---|
| 396 | {\bf mmu\_update} hypercall. |
---|
| 397 | |
---|
| 398 | \section{Start of Day} |
---|
| 399 | |
---|
| 400 | The start-of-day environment for guest operating systems is rather |
---|
| 401 | different to that provided by the underlying hardware. In particular, |
---|
| 402 | the processor is already executing in protected mode with paging |
---|
| 403 | enabled. |
---|
| 404 | |
---|
| 405 | {\it Domain 0} is created and booted by Xen itself. For all subsequent |
---|
| 406 | domains, the analogue of the boot-loader is the {\it domain builder}, |
---|
| 407 | user-space software running in {\it domain 0}. The domain builder is |
---|
| 408 | responsible for building the initial page tables for a domain and |
---|
| 409 | loading its kernel image at the appropriate virtual address. |
---|
| 410 | |
---|
| 411 | \section{VM assists} |
---|
| 412 | |
---|
| 413 | Xen provides a number of ``assists'' for guest memory management. |
---|
| 414 | These are available on an ``opt-in'' basis to provide commonly-used |
---|
| 415 | extra functionality to a guest. |
---|
| 416 | |
---|
| 417 | \hypercall{vm\_assist(unsigned int cmd, unsigned int type)} |
---|
| 418 | |
---|
| 419 | The {\bf cmd} parameter describes the action to be taken, whilst the |
---|
| 420 | {\bf type} parameter describes the kind of assist that is being |
---|
| 421 | referred to. Available commands are as follows: |
---|
| 422 | |
---|
| 423 | \begin{description} |
---|
| 424 | \item[VMASST\_CMD\_enable] Enable a particular assist type |
---|
| 425 | \item[VMASST\_CMD\_disable] Disable a particular assist type |
---|
| 426 | \end{description} |
---|
| 427 | |
---|
| 428 | And the available types are: |
---|
| 429 | |
---|
| 430 | \begin{description} |
---|
| 431 | \item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for |
---|
| 432 | instructions that rely on 4GB segments (such as the techniques used |
---|
| 433 | by some TLS solutions). |
---|
| 434 | \item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the |
---|
| 435 | guest if the above segment fixups are used: allows the guest to |
---|
| 436 | display a warning message during boot. |
---|
| 437 | \item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable |
---|
| 438 | mode - described above. |
---|
| 439 | \end{description} |
---|
| 440 | |
---|
| 441 | |
---|
| 442 | \chapter{Xen Info Pages} |
---|
| 443 | |
---|
| 444 | The {\bf Shared info page} is used to share various CPU-related state |
---|
| 445 | between the guest OS and the hypervisor. This information includes VCPU |
---|
| 446 | status, time information and event channel (virtual interrupt) state. |
---|
| 447 | The {\bf Start info page} is used to pass build-time information to |
---|
| 448 | the guest when it boots and when it is resumed from a suspended state. |
---|
| 449 | This chapter documents the fields included in the {\bf |
---|
| 450 | shared\_info\_t} and {\bf start\_info\_t} structures for use by the |
---|
| 451 | guest OS. |
---|
| 452 | |
---|
| 453 | \section{Shared info page} |
---|
| 454 | |
---|
| 455 | The {\bf shared\_info\_t} is accessed at run time by both Xen and the |
---|
| 456 | guest OS. It is used to pass information relating to the |
---|
| 457 | virtual CPU and virtual machine state between the OS and the |
---|
| 458 | hypervisor. |
---|
| 459 | |
---|
| 460 | The structure is declared in {\bf xen/include/public/xen.h}: |
---|
| 461 | |
---|
| 462 | \scriptsize |
---|
| 463 | \begin{verbatim} |
---|
| 464 | typedef struct shared_info { |
---|
| 465 | vcpu_info_t vcpu_info[MAX_VIRT_CPUS]; |
---|
| 466 | |
---|
| 467 | /* |
---|
| 468 | * A domain can create "event channels" on which it can send and receive |
---|
| 469 | * asynchronous event notifications. There are three classes of event that |
---|
| 470 | * are delivered by this mechanism: |
---|
| 471 | * 1. Bi-directional inter- and intra-domain connections. Domains must |
---|
| 472 | * arrange out-of-band to set up a connection (usually by allocating |
---|
| 473 | * an unbound 'listener' port and avertising that via a storage service |
---|
| 474 | * such as xenstore). |
---|
| 475 | * 2. Physical interrupts. A domain with suitable hardware-access |
---|
| 476 | * privileges can bind an event-channel port to a physical interrupt |
---|
| 477 | * source. |
---|
| 478 | * 3. Virtual interrupts ('events'). A domain can bind an event-channel |
---|
| 479 | * port to a virtual interrupt source, such as the virtual-timer |
---|
| 480 | * device or the emergency console. |
---|
| 481 | * |
---|
| 482 | * Event channels are addressed by a "port index". Each channel is |
---|
| 483 | * associated with two bits of information: |
---|
| 484 | * 1. PENDING -- notifies the domain that there is a pending notification |
---|
| 485 | * to be processed. This bit is cleared by the guest. |
---|
| 486 | * 2. MASK -- if this bit is clear then a 0->1 transition of PENDING |
---|
| 487 | * will cause an asynchronous upcall to be scheduled. This bit is only |
---|
| 488 | * updated by the guest. It is read-only within Xen. If a channel |
---|
| 489 | * becomes pending while the channel is masked then the 'edge' is lost |
---|
| 490 | * (i.e., when the channel is unmasked, the guest must manually handle |
---|
| 491 | * pending notifications as no upcall will be scheduled by Xen). |
---|
| 492 | * |
---|
| 493 | * To expedite scanning of pending notifications, any 0->1 pending |
---|
| 494 | * transition on an unmasked channel causes a corresponding bit in a |
---|
| 495 | * per-vcpu selector word to be set. Each bit in the selector covers a |
---|
| 496 | * 'C long' in the PENDING bitfield array. |
---|
| 497 | */ |
---|
| 498 | unsigned long evtchn_pending[sizeof(unsigned long) * 8]; |
---|
| 499 | unsigned long evtchn_mask[sizeof(unsigned long) * 8]; |
---|
| 500 | |
---|
| 501 | /* |
---|
| 502 | * Wallclock time: updated only by control software. Guests should base |
---|
| 503 | * their gettimeofday() syscall on this wallclock-base value. |
---|
| 504 | */ |
---|
| 505 | uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */ |
---|
| 506 | uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */ |
---|
| 507 | uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */ |
---|
| 508 | |
---|
| 509 | arch_shared_info_t arch; |
---|
| 510 | |
---|
| 511 | } shared_info_t; |
---|
| 512 | \end{verbatim} |
---|
| 513 | \normalsize |
---|
| 514 | |
---|
| 515 | \begin{description} |
---|
| 516 | \item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of |
---|
| 517 | which holds either runtime information about a virtual CPU, or is |
---|
| 518 | ``empty'' if the corresponding VCPU does not exist. |
---|
| 519 | \item[evtchn\_pending] Guest-global array, with one bit per event |
---|
| 520 | channel. Bits are set if an event is currently pending on that |
---|
| 521 | channel. |
---|
| 522 | \item[evtchn\_mask] Guest-global array for masking notifications on |
---|
| 523 | event channels. |
---|
| 524 | \item[wc\_version] Version counter for current wallclock time. |
---|
| 525 | \item[wc\_sec] Whole seconds component of current wallclock time. |
---|
| 526 | \item[wc\_nsec] Nanoseconds component of current wallclock time. |
---|
| 527 | \item[arch] Host architecture-dependent portion of the shared info |
---|
| 528 | structure. |
---|
| 529 | \end{description} |
---|
| 530 | |
---|
| 531 | \subsection{vcpu\_info\_t} |
---|
| 532 | |
---|
| 533 | \scriptsize |
---|
| 534 | \begin{verbatim} |
---|
| 535 | typedef struct vcpu_info { |
---|
| 536 | /* |
---|
| 537 | * 'evtchn_upcall_pending' is written non-zero by Xen to indicate |
---|
| 538 | * a pending notification for a particular VCPU. It is then cleared |
---|
| 539 | * by the guest OS /before/ checking for pending work, thus avoiding |
---|
| 540 | * a set-and-check race. Note that the mask is only accessed by Xen |
---|
| 541 | * on the CPU that is currently hosting the VCPU. This means that the |
---|
| 542 | * pending and mask flags can be updated by the guest without special |
---|
| 543 | * synchronisation (i.e., no need for the x86 LOCK prefix). |
---|
| 544 | * This may seem suboptimal because if the pending flag is set by |
---|
| 545 | * a different CPU then an IPI may be scheduled even when the mask |
---|
| 546 | * is set. However, note: |
---|
| 547 | * 1. The task of 'interrupt holdoff' is covered by the per-event- |
---|
| 548 | * channel mask bits. A 'noisy' event that is continually being |
---|
| 549 | * triggered can be masked at source at this very precise |
---|
| 550 | * granularity. |
---|
| 551 | * 2. The main purpose of the per-VCPU mask is therefore to restrict |
---|
| 552 | * reentrant execution: whether for concurrency control, or to |
---|
| 553 | * prevent unbounded stack usage. Whatever the purpose, we expect |
---|
| 554 | * that the mask will be asserted only for short periods at a time, |
---|
| 555 | * and so the likelihood of a 'spurious' IPI is suitably small. |
---|
| 556 | * The mask is read before making an event upcall to the guest: a |
---|
| 557 | * non-zero mask therefore guarantees that the VCPU will not receive |
---|
| 558 | * an upcall activation. The mask is cleared when the VCPU requests |
---|
| 559 | * to block: this avoids wakeup-waiting races. |
---|
| 560 | */ |
---|
| 561 | uint8_t evtchn_upcall_pending; |
---|
| 562 | uint8_t evtchn_upcall_mask; |
---|
| 563 | unsigned long evtchn_pending_sel; |
---|
| 564 | arch_vcpu_info_t arch; |
---|
| 565 | vcpu_time_info_t time; |
---|
| 566 | } vcpu_info_t; /* 64 bytes (x86) */ |
---|
| 567 | \end{verbatim} |
---|
| 568 | \normalsize |
---|
| 569 | |
---|
| 570 | \begin{description} |
---|
| 571 | \item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate |
---|
| 572 | that there are pending events to be received. |
---|
| 573 | \item[evtchn\_upcall\_mask] This is set non-zero to disable all |
---|
| 574 | interrupts for this CPU for short periods of time. If individual |
---|
| 575 | event channels need to be masked, the {\bf evtchn\_mask} in the {\bf |
---|
| 576 | shared\_info\_t} is used instead. |
---|
| 577 | \item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a |
---|
| 578 | bit is set in this selector to indicate which word of the {\bf |
---|
| 579 | evtchn\_pending} array in the {\bf shared\_info\_t} contains the |
---|
| 580 | event in question. |
---|
| 581 | \item[arch] Architecture-specific VCPU info. On x86 this contains the |
---|
| 582 | virtualized CR2 register (page fault linear address) for this VCPU. |
---|
| 583 | \item[time] Time values for this VCPU. |
---|
| 584 | \end{description} |
---|
| 585 | |
---|
| 586 | \subsection{vcpu\_time\_info} |
---|
| 587 | |
---|
| 588 | \scriptsize |
---|
| 589 | \begin{verbatim} |
---|
| 590 | typedef struct vcpu_time_info { |
---|
| 591 | /* |
---|
| 592 | * Updates to the following values are preceded and followed by an |
---|
| 593 | * increment of 'version'. The guest can therefore detect updates by |
---|
| 594 | * looking for changes to 'version'. If the least-significant bit of |
---|
| 595 | * the version number is set then an update is in progress and the guest |
---|
| 596 | * must wait to read a consistent set of values. |
---|
| 597 | * The correct way to interact with the version number is similar to |
---|
| 598 | * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry. |
---|
| 599 | */ |
---|
| 600 | uint32_t version; |
---|
| 601 | uint32_t pad0; |
---|
| 602 | uint64_t tsc_timestamp; /* TSC at last update of time vals. */ |
---|
| 603 | uint64_t system_time; /* Time, in nanosecs, since boot. */ |
---|
| 604 | /* |
---|
| 605 | * Current system time: |
---|
| 606 | * system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul |
---|
| 607 | * CPU frequency (Hz): |
---|
| 608 | * ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift |
---|
| 609 | */ |
---|
| 610 | uint32_t tsc_to_system_mul; |
---|
| 611 | int8_t tsc_shift; |
---|
| 612 | int8_t pad1[3]; |
---|
| 613 | } vcpu_time_info_t; /* 32 bytes */ |
---|
| 614 | \end{verbatim} |
---|
| 615 | \normalsize |
---|
| 616 | |
---|
| 617 | \begin{description} |
---|
| 618 | \item[version] Used to ensure the guest gets consistent time updates. |
---|
| 619 | \item[tsc\_timestamp] Cycle counter timestamp of last time value; |
---|
| 620 | could be used to expolate in between updates, for instance. |
---|
| 621 | \item[system\_time] Time since boot (nanoseconds). |
---|
| 622 | \item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier |
---|
| 623 | (used in extrapolating current time). |
---|
| 624 | \item[tsc\_shift] Cycle counter to nanoseconds shift (used in |
---|
| 625 | extrapolating current time). |
---|
| 626 | \end{description} |
---|
| 627 | |
---|
| 628 | \subsection{arch\_shared\_info\_t} |
---|
| 629 | |
---|
| 630 | On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from |
---|
| 631 | xen/public/arch-x86\_32.h): |
---|
| 632 | |
---|
| 633 | \scriptsize |
---|
| 634 | \begin{verbatim} |
---|
| 635 | typedef struct arch_shared_info { |
---|
| 636 | unsigned long max_pfn; /* max pfn that appears in table */ |
---|
| 637 | /* Frame containing list of mfns containing list of mfns containing p2m. */ |
---|
| 638 | unsigned long pfn_to_mfn_frame_list_list; |
---|
| 639 | } arch_shared_info_t; |
---|
| 640 | \end{verbatim} |
---|
| 641 | \normalsize |
---|
| 642 | |
---|
| 643 | \begin{description} |
---|
| 644 | \item[max\_pfn] The maximum PFN listed in the physical-to-machine |
---|
| 645 | mapping table (P2M table). |
---|
| 646 | \item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame |
---|
| 647 | that contains the machine addresses of the P2M table frames. |
---|
| 648 | \end{description} |
---|
| 649 | |
---|
| 650 | \section{Start info page} |
---|
| 651 | |
---|
| 652 | The start info structure is declared as the following (in {\bf |
---|
| 653 | xen/include/public/xen.h}): |
---|
| 654 | |
---|
| 655 | \scriptsize |
---|
| 656 | \begin{verbatim} |
---|
| 657 | #define MAX_GUEST_CMDLINE 1024 |
---|
| 658 | typedef struct start_info { |
---|
| 659 | /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */ |
---|
| 660 | char magic[32]; /* "Xen-<version>.<subversion>". */ |
---|
| 661 | unsigned long nr_pages; /* Total pages allocated to this domain. */ |
---|
| 662 | unsigned long shared_info; /* MACHINE address of shared info struct. */ |
---|
| 663 | uint32_t flags; /* SIF_xxx flags. */ |
---|
| 664 | unsigned long store_mfn; /* MACHINE page number of shared page. */ |
---|
| 665 | uint32_t store_evtchn; /* Event channel for store communication. */ |
---|
| 666 | unsigned long console_mfn; /* MACHINE address of console page. */ |
---|
| 667 | uint32_t console_evtchn; /* Event channel for console messages. */ |
---|
| 668 | /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */ |
---|
| 669 | unsigned long pt_base; /* VIRTUAL address of page directory. */ |
---|
| 670 | unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */ |
---|
| 671 | unsigned long mfn_list; /* VIRTUAL address of page-frame list. */ |
---|
| 672 | unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */ |
---|
| 673 | unsigned long mod_len; /* Size (bytes) of pre-loaded module. */ |
---|
| 674 | int8_t cmd_line[MAX_GUEST_CMDLINE]; |
---|
| 675 | } start_info_t; |
---|
| 676 | \end{verbatim} |
---|
| 677 | \normalsize |
---|
| 678 | |
---|
| 679 | The fields are in two groups: the first group are always filled in |
---|
| 680 | when a domain is booted or resumed, the second set are only used at |
---|
| 681 | boot time. |
---|
| 682 | |
---|
| 683 | The always-available group is as follows: |
---|
| 684 | |
---|
| 685 | \begin{description} |
---|
| 686 | \item[magic] A text string identifying the Xen version to the guest. |
---|
| 687 | \item[nr\_pages] The number of real machine pages available to the |
---|
| 688 | guest. |
---|
| 689 | \item[shared\_info] Machine address of the shared info structure, |
---|
| 690 | allowing the guest to map it during initialisation. |
---|
| 691 | \item[flags] Flags for describing optional extra settings to the |
---|
| 692 | guest. |
---|
| 693 | \item[store\_mfn] Machine address of the Xenstore communications page. |
---|
| 694 | \item[store\_evtchn] Event channel to communicate with the store. |
---|
| 695 | \item[console\_mfn] Machine address of the console data page. |
---|
| 696 | \item[console\_evtchn] Event channel to notify the console backend. |
---|
| 697 | \end{description} |
---|
| 698 | |
---|
| 699 | The boot-only group may only be safely referred to during system boot: |
---|
| 700 | |
---|
| 701 | \begin{description} |
---|
| 702 | \item[pt\_base] Virtual address of the page directory created for us |
---|
| 703 | by the domain builder. |
---|
| 704 | \item[nr\_pt\_frames] Number of frames used by the builders' bootstrap |
---|
| 705 | pagetables. |
---|
| 706 | \item[mfn\_list] Virtual address of the list of machine frames this |
---|
| 707 | domain owns. |
---|
| 708 | \item[mod\_start] Virtual address of any pre-loaded modules |
---|
| 709 | (e.g. ramdisk) |
---|
| 710 | \item[mod\_len] Size of pre-loaded module (if any). |
---|
| 711 | \item[cmd\_line] Kernel command line passed by the domain builder. |
---|
| 712 | \end{description} |
---|
| 713 | |
---|
| 714 | |
---|
| 715 | % by Mark Williamson <mark.williamson@cl.cam.ac.uk> |
---|
| 716 | |
---|
| 717 | \chapter{Event Channels} |
---|
| 718 | \label{c:eventchannels} |
---|
| 719 | |
---|
| 720 | Event channels are the basic primitive provided by Xen for event |
---|
| 721 | notifications. An event is the Xen equivalent of a hardware |
---|
| 722 | interrupt. They essentially store one bit of information, the event |
---|
| 723 | of interest is signalled by transitioning this bit from 0 to 1. |
---|
| 724 | |
---|
| 725 | Notifications are received by a guest via an upcall from Xen, |
---|
| 726 | indicating when an event arrives (setting the bit). Further |
---|
| 727 | notifications are masked until the bit is cleared again (therefore, |
---|
| 728 | guests must check the value of the bit after re-enabling event |
---|
| 729 | delivery to ensure no missed notifications). |
---|
| 730 | |
---|
| 731 | Event notifications can be masked by setting a flag; this is |
---|
| 732 | equivalent to disabling interrupts and can be used to ensure atomicity |
---|
| 733 | of certain operations in the guest kernel. |
---|
| 734 | |
---|
| 735 | \section{Hypercall interface} |
---|
| 736 | |
---|
| 737 | \hypercall{event\_channel\_op(evtchn\_op\_t *op)} |
---|
| 738 | |
---|
| 739 | The event channel operation hypercall is used for all operations on |
---|
| 740 | event channels / ports. Operations are distinguished by the value of |
---|
| 741 | the {\bf cmd} field of the {\bf op} structure. The possible commands |
---|
| 742 | are described below: |
---|
| 743 | |
---|
| 744 | \begin{description} |
---|
| 745 | |
---|
| 746 | \item[EVTCHNOP\_alloc\_unbound] |
---|
| 747 | Allocate a new event channel port, ready to be connected to by a |
---|
| 748 | remote domain. |
---|
| 749 | \begin{itemize} |
---|
| 750 | \item Specified domain must exist. |
---|
| 751 | \item A free port must exist in that domain. |
---|
| 752 | \end{itemize} |
---|
| 753 | Unprivileged domains may only allocate their own ports, privileged |
---|
| 754 | domains may also allocate ports in other domains. |
---|
| 755 | \item[EVTCHNOP\_bind\_interdomain] |
---|
| 756 | Bind an event channel for interdomain communications. |
---|
| 757 | \begin{itemize} |
---|
| 758 | \item Caller domain must have a free port to bind. |
---|
| 759 | \item Remote domain must exist. |
---|
| 760 | \item Remote port must be allocated and currently unbound. |
---|
| 761 | \item Remote port must be expecting the caller domain as the ``remote''. |
---|
| 762 | \end{itemize} |
---|
| 763 | \item[EVTCHNOP\_bind\_virq] |
---|
| 764 | Allocate a port and bind a VIRQ to it. |
---|
| 765 | \begin{itemize} |
---|
| 766 | \item Caller domain must have a free port to bind. |
---|
| 767 | \item VIRQ must be valid. |
---|
| 768 | \item VCPU must exist. |
---|
| 769 | \item VIRQ must not currently be bound to an event channel. |
---|
| 770 | \end{itemize} |
---|
| 771 | \item[EVTCHNOP\_bind\_ipi] |
---|
| 772 | Allocate and bind a port for notifying other virtual CPUs. |
---|
| 773 | \begin{itemize} |
---|
| 774 | \item Caller domain must have a free port to bind. |
---|
| 775 | \item VCPU must exist. |
---|
| 776 | \end{itemize} |
---|
| 777 | \item[EVTCHNOP\_bind\_pirq] |
---|
| 778 | Allocate and bind a port to a real IRQ. |
---|
| 779 | \begin{itemize} |
---|
| 780 | \item Caller domain must have a free port to bind. |
---|
| 781 | \item PIRQ must be within the valid range. |
---|
| 782 | \item Another binding for this PIRQ must not exist for this domain. |
---|
| 783 | \item Caller must have an available port. |
---|
| 784 | \end{itemize} |
---|
| 785 | \item[EVTCHNOP\_close] |
---|
| 786 | Close an event channel (no more events will be received). |
---|
| 787 | \begin{itemize} |
---|
| 788 | \item Port must be valid (currently allocated). |
---|
| 789 | \end{itemize} |
---|
| 790 | \item[EVTCHNOP\_send] Send a notification on an event channel attached |
---|
| 791 | to a port. |
---|
| 792 | \begin{itemize} |
---|
| 793 | \item Port must be valid. |
---|
| 794 | \item Only valid for Interdomain, IPI or Allocated Unbound ports. |
---|
| 795 | \end{itemize} |
---|
| 796 | \item[EVTCHNOP\_status] Query the status of a port; what kind of port, |
---|
| 797 | whether it is bound, what remote domain is expected, what PIRQ or |
---|
| 798 | VIRQ it is bound to, what VCPU will be notified, etc. |
---|
| 799 | Unprivileged domains may only query the state of their own ports. |
---|
| 800 | Privileged domains may query any port. |
---|
| 801 | \item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU - |
---|
| 802 | receive notification upcalls only on that VCPU. |
---|
| 803 | \begin{itemize} |
---|
| 804 | \item VCPU must exist. |
---|
| 805 | \item Port must be valid. |
---|
| 806 | \item Event channel must be either: allocated but unbound, bound to |
---|
| 807 | an interdomain event channel, bound to a PIRQ. |
---|
| 808 | \end{itemize} |
---|
| 809 | |
---|
| 810 | \end{description} |
---|
| 811 | |
---|
| 812 | %% |
---|
| 813 | %% grant_tables.tex |
---|
| 814 | %% |
---|
| 815 | %% Made by Mark Williamson |
---|
| 816 | %% Login <mark@maw48> |
---|
| 817 | %% |
---|
| 818 | |
---|
| 819 | \chapter{Grant tables} |
---|
| 820 | \label{c:granttables} |
---|
| 821 | |
---|
| 822 | Xen's grant tables provide a generic mechanism to memory sharing |
---|
| 823 | between domains. This shared memory interface underpins the split |
---|
| 824 | device drivers for block and network IO. |
---|
| 825 | |
---|
| 826 | Each domain has its own {\bf grant table}. This is a data structure |
---|
| 827 | that is shared with Xen; it allows the domain to tell Xen what kind of |
---|
| 828 | permissions other domains have on its pages. Entries in the grant |
---|
| 829 | table are identified by {\bf grant references}. A grant reference is |
---|
| 830 | an integer, which indexes into the grant table. It acts as a |
---|
| 831 | capability which the grantee can use to perform operations on the |
---|
| 832 | granter's memory. |
---|
| 833 | |
---|
| 834 | This capability-based system allows shared-memory communications |
---|
| 835 | between unprivileged domains. A grant reference also encapsulates the |
---|
| 836 | details of a shared page, removing the need for a domain to know the |
---|
| 837 | real machine address of a page it is sharing. This makes it possible |
---|
| 838 | to share memory correctly with domains running in fully virtualised |
---|
| 839 | memory. |
---|
| 840 | |
---|
| 841 | \section{Interface} |
---|
| 842 | |
---|
| 843 | \subsection{Grant table manipulation} |
---|
| 844 | |
---|
| 845 | Creating and destroying grant references is done by direct access to |
---|
| 846 | the grant table. This removes the need to involve Xen when creating |
---|
| 847 | grant references, modifying access permissions, etc. The grantee |
---|
| 848 | domain will invoke hypercalls to use the grant references. Four main |
---|
| 849 | operations can be accomplished by directly manipulating the table: |
---|
| 850 | |
---|
| 851 | \begin{description} |
---|
| 852 | \item[Grant foreign access] allocate a new entry in the grant table |
---|
| 853 | and fill out the access permissions accordingly. The access |
---|
| 854 | permissions will be looked up by Xen when the grantee attempts to |
---|
| 855 | use the reference to map the granted frame. |
---|
| 856 | \item[End foreign access] check that the grant reference is not |
---|
| 857 | currently in use, then remove the mapping permissions for the frame. |
---|
| 858 | This prevents further mappings from taking place but does not allow |
---|
| 859 | forced revocations of existing mappings. |
---|
| 860 | \item[Grant foreign transfer] allocate a new entry in the table |
---|
| 861 | specifying transfer permissions for the grantee. Xen will look up |
---|
| 862 | this entry when the grantee attempts to transfer a frame to the |
---|
| 863 | granter. |
---|
| 864 | \item[End foreign transfer] remove permissions to prevent a transfer |
---|
| 865 | occurring in future. If the transfer is already committed, |
---|
| 866 | modifying the grant table cannot prevent it from completing. |
---|
| 867 | \end{description} |
---|
| 868 | |
---|
| 869 | \subsection{Hypercalls} |
---|
| 870 | |
---|
| 871 | Use of grant references is accomplished via a hypercall. The grant |
---|
| 872 | table op hypercall takes three arguments: |
---|
| 873 | |
---|
| 874 | \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} |
---|
| 875 | |
---|
| 876 | {\bf cmd} indicates the grant table operation of interest. {\bf uop} |
---|
| 877 | is a pointer to a structure (or an array of structures) describing the |
---|
| 878 | operation to be performed. The {\bf count} field describes how many |
---|
| 879 | grant table operations are being batched together. |
---|
| 880 | |
---|
| 881 | The core logic is situated in {\bf xen/common/grant\_table.c}. The |
---|
| 882 | grant table operation hypercall can be used to perform the following |
---|
| 883 | actions: |
---|
| 884 | |
---|
| 885 | \begin{description} |
---|
| 886 | \item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another |
---|
| 887 | domain, map the referred page into the caller's address space. |
---|
| 888 | \item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame |
---|
| 889 | from the caller's address space. This is used to voluntarily |
---|
| 890 | relinquish a mapping to a granted page. |
---|
| 891 | \item[GNTTABOP\_setup\_table] Setup grant table for caller domain. |
---|
| 892 | \item[GNTTABOP\_dump\_table] Debugging operation. |
---|
| 893 | \item[GNTTABOP\_transfer] Given a transfer reference from another |
---|
| 894 | domain, transfer ownership of a page frame to that domain. |
---|
| 895 | \end{description} |
---|
| 896 | |
---|
| 897 | %% |
---|
| 898 | %% xenstore.tex |
---|
| 899 | %% |
---|
| 900 | %% Made by Mark Williamson |
---|
| 901 | %% Login <mark@maw48> |
---|
| 902 | %% |
---|
| 903 | |
---|
| 904 | \chapter{Xenstore} |
---|
| 905 | |
---|
| 906 | Xenstore is the mechanism by which control-plane activities occur. |
---|
| 907 | These activities include: |
---|
| 908 | |
---|
| 909 | \begin{itemize} |
---|
| 910 | \item Setting up shared memory regions and event channels for use with |
---|
| 911 | the split device drivers. |
---|
| 912 | \item Notifying the guest of control events (e.g. balloon driver |
---|
| 913 | requests) |
---|
| 914 | \item Reporting back status information from the guest |
---|
| 915 | (e.g. performance-related statistics, etc). |
---|
| 916 | \end{itemize} |
---|
| 917 | |
---|
| 918 | The store is arranged as a hierachical collection of key-value pairs. |
---|
| 919 | Each domain has a directory hierarchy containing data related to its |
---|
| 920 | configuration. Domains are permitted to register for notifications |
---|
| 921 | about changes in subtrees of the store, and to apply changes to the |
---|
| 922 | store transactionally. |
---|
| 923 | |
---|
| 924 | \section{Guidelines} |
---|
| 925 | |
---|
| 926 | A few principles govern the operation of the store: |
---|
| 927 | |
---|
| 928 | \begin{itemize} |
---|
| 929 | \item Domains should only modify the contents of their own |
---|
| 930 | directories. |
---|
| 931 | \item The setup protocol for a device channel should simply consist of |
---|
| 932 | entering the configuration data into the store. |
---|
| 933 | \item The store should allow device discovery without requiring the |
---|
| 934 | relevant device drivers to be loaded: a Xen ``bus'' should be |
---|
| 935 | visible to probing code in the guest. |
---|
| 936 | \item The store should be usable for inter-tool communications, |
---|
| 937 | allowing the tools themselves to be decomposed into a number of |
---|
| 938 | smaller utilities, rather than a single monolithic entity. This |
---|
| 939 | also facilitates the development of alternate user interfaces to the |
---|
| 940 | same functionality. |
---|
| 941 | \end{itemize} |
---|
| 942 | |
---|
| 943 | \section{Store layout} |
---|
| 944 | |
---|
| 945 | There are three main paths in XenStore: |
---|
| 946 | |
---|
| 947 | \begin{description} |
---|
| 948 | \item[/vm] stores configuration information about domain |
---|
| 949 | \item[/local/domain] stores information about the domain on the local node (domid, etc.) |
---|
| 950 | \item[/tool] stores information for the various tools |
---|
| 951 | \end{description} |
---|
| 952 | |
---|
| 953 | The {\bf /vm} path stores configuration information for a domain. |
---|
| 954 | This information doesn't change and is indexed by the domain's UUID. |
---|
| 955 | A {\bf /vm} entry contains the following information: |
---|
| 956 | |
---|
| 957 | \begin{description} |
---|
| 958 | \item[uuid] uuid of the domain (somewhat redundant) |
---|
| 959 | \item[on\_reboot] the action to take on a domain reboot request (destroy or restart) |
---|
| 960 | \item[on\_poweroff] the action to take on a domain halt request (destroy or restart) |
---|
| 961 | \item[on\_crash] the action to take on a domain crash (destroy or restart) |
---|
| 962 | \item[vcpus] the number of allocated vcpus for the domain |
---|
| 963 | \item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0 |
---|
| 964 | \item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus) |
---|
| 965 | \item[name] the name of the domain |
---|
| 966 | \end{description} |
---|
| 967 | |
---|
| 968 | |
---|
| 969 | {\bf /vm/$<$uuid$>$/image/} |
---|
| 970 | |
---|
| 971 | The image path is only available for Domain-Us and contains: |
---|
| 972 | \begin{description} |
---|
| 973 | \item[ostype] identifies the builder type (linux or vmx) |
---|
| 974 | \item[kernel] path to kernel on domain-0 |
---|
| 975 | \item[cmdline] command line to pass to domain-U kernel |
---|
| 976 | \item[ramdisk] path to ramdisk on domain-0 |
---|
| 977 | \end{description} |
---|
| 978 | |
---|
| 979 | {\bf /local} |
---|
| 980 | |
---|
| 981 | The {\tt /local} path currently only contains one directory, {\tt |
---|
| 982 | /local/domain} that is indexed by domain id. It contains the running |
---|
| 983 | domain information. The reason to have two storage areas is that |
---|
| 984 | during migration, the uuid doesn't change but the domain id does. The |
---|
| 985 | {\tt /local/domain} directory can be created and populated before |
---|
| 986 | finalizing the migration enabling localhost to localhost migration. |
---|
| 987 | |
---|
| 988 | {\bf /local/domain/$<$domid$>$} |
---|
| 989 | |
---|
| 990 | This path contains: |
---|
| 991 | |
---|
| 992 | \begin{description} |
---|
| 993 | \item[cpu\_time] xend start time (this is only around for domain-0) |
---|
| 994 | \item[handle] private handle for xend |
---|
| 995 | \item[name] see /vm |
---|
| 996 | \item[on\_reboot] see /vm |
---|
| 997 | \item[on\_poweroff] see /vm |
---|
| 998 | \item[on\_crash] see /vm |
---|
| 999 | \item[vm] the path to the VM directory for the domain |
---|
| 1000 | \item[domid] the domain id (somewhat redundant) |
---|
| 1001 | \item[running] indicates that the domain is currently running |
---|
| 1002 | \item[memory] the current memory in megabytes for the domain (empty for domain-0?) |
---|
| 1003 | \item[maxmem\_KiB] the maximum memory for the domain (in kilobytes) |
---|
| 1004 | \item[memory\_KiB] the memory allocated to the domain (in kilobytes) |
---|
| 1005 | \item[cpu] the current CPU the domain is pinned to (empty for domain-0?) |
---|
| 1006 | \item[cpu\_weight] the weight assigned to the domain |
---|
| 1007 | \item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU |
---|
| 1008 | \item[online\_vcpus] how many vcpus are currently online |
---|
| 1009 | \item[vcpus] the total number of vcpus allocated to the domain |
---|
| 1010 | \item[console/] a directory for console information |
---|
| 1011 | \begin{description} |
---|
| 1012 | \item[ring-ref] the grant table reference of the console ring queue |
---|
| 1013 | \item[port] the event channel being used for the console ring queue (local port) |
---|
| 1014 | \item[tty] the current tty the console data is being exposed of |
---|
| 1015 | \item[limit] the limit (in bytes) of console data to buffer |
---|
| 1016 | \end{description} |
---|
| 1017 | \item[backend/] a directory containing all backends the domain hosts |
---|
| 1018 | \begin{description} |
---|
| 1019 | \item[vbd/] a directory containing vbd backends |
---|
| 1020 | \begin{description} |
---|
| 1021 | \item[$<$domid$>$/] a directory containing vbd's for domid |
---|
| 1022 | \begin{description} |
---|
| 1023 | \item[$<$virtual-device$>$/] a directory for a particular |
---|
| 1024 | virtual-device on domid |
---|
| 1025 | \begin{description} |
---|
| 1026 | \item[frontend-id] domain id of frontend |
---|
| 1027 | \item[frontend] the path to the frontend domain |
---|
| 1028 | \item[physical-device] backend device number |
---|
| 1029 | \item[sector-size] backend sector size |
---|
| 1030 | \item[info] 0 read/write, 1 read-only (is this right?) |
---|
| 1031 | \item[domain] name of frontend domain |
---|
| 1032 | \item[params] parameters for device |
---|
| 1033 | \item[type] the type of the device |
---|
| 1034 | \item[dev] the virtual device (as given by the user) |
---|
| 1035 | \item[node] output from block creation script |
---|
| 1036 | \end{description} |
---|
| 1037 | \end{description} |
---|
| 1038 | \end{description} |
---|
| 1039 | |
---|
| 1040 | \item[vif/] a directory containing vif backends |
---|
| 1041 | \begin{description} |
---|
| 1042 | \item[$<$domid$>$/] a directory containing vif's for domid |
---|
| 1043 | \begin{description} |
---|
| 1044 | \item[$<$vif number$>$/] a directory for each vif |
---|
| 1045 | \item[frontend-id] the domain id of the frontend |
---|
| 1046 | \item[frontend] the path to the frontend |
---|
| 1047 | \item[mac] the mac address of the vif |
---|
| 1048 | \item[bridge] the bridge the vif is connected to |
---|
| 1049 | \item[handle] the handle of the vif |
---|
| 1050 | \item[script] the script used to create/stop the vif |
---|
| 1051 | \item[domain] the name of the frontend |
---|
| 1052 | \end{description} |
---|
| 1053 | \end{description} |
---|
| 1054 | |
---|
| 1055 | \item[vtpm/] a directory containin vtpm backends |
---|
| 1056 | \begin{description} |
---|
| 1057 | \item[$<$domid$>$/] a directory containing vtpm's for domid |
---|
| 1058 | \begin{description} |
---|
| 1059 | \item[$<$vtpm number$>$/] a directory for each vtpm |
---|
| 1060 | \item[frontend-id] the domain id of the frontend |
---|
| 1061 | \item[frontend] the path to the frontend |
---|
| 1062 | \item[instance] the instance of the virtual TPM that is used |
---|
| 1063 | \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file; |
---|
| 1064 | may be different from {\bf instance} |
---|
| 1065 | \item[domain] the name of the domain of the frontend |
---|
| 1066 | \end{description} |
---|
| 1067 | \end{description} |
---|
| 1068 | |
---|
| 1069 | \end{description} |
---|
| 1070 | |
---|
| 1071 | \item[device/] a directory containing the frontend devices for the |
---|
| 1072 | domain |
---|
| 1073 | \begin{description} |
---|
| 1074 | \item[vbd/] a directory containing vbd frontend devices for the |
---|
| 1075 | domain |
---|
| 1076 | \begin{description} |
---|
| 1077 | \item[$<$virtual-device$>$/] a directory containing the vbd frontend for |
---|
| 1078 | virtual-device |
---|
| 1079 | \begin{description} |
---|
| 1080 | \item[virtual-device] the device number of the frontend device |
---|
| 1081 | \item[backend-id] the domain id of the backend |
---|
| 1082 | \item[backend] the path of the backend in the store (/local/domain |
---|
| 1083 | path) |
---|
| 1084 | \item[ring-ref] the grant table reference for the block request |
---|
| 1085 | ring queue |
---|
| 1086 | \item[event-channel] the event channel used for the block request |
---|
| 1087 | ring queue |
---|
| 1088 | \end{description} |
---|
| 1089 | |
---|
| 1090 | \item[vif/] a directory containing vif frontend devices for the |
---|
| 1091 | domain |
---|
| 1092 | \begin{description} |
---|
| 1093 | \item[$<$id$>$/] a directory for vif id frontend device for the domain |
---|
| 1094 | \begin{description} |
---|
| 1095 | \item[backend-id] the backend domain id |
---|
| 1096 | \item[mac] the mac address of the vif |
---|
| 1097 | \item[handle] the internal vif handle |
---|
| 1098 | \item[backend] a path to the backend's store entry |
---|
| 1099 | \item[tx-ring-ref] the grant table reference for the transmission ring queue |
---|
| 1100 | \item[rx-ring-ref] the grant table reference for the receiving ring queue |
---|
| 1101 | \item[event-channel] the event channel used for the two ring queues |
---|
| 1102 | \end{description} |
---|
| 1103 | \end{description} |
---|
| 1104 | |
---|
| 1105 | \item[vtpm/] a directory containing the vtpm frontend device for the |
---|
| 1106 | domain |
---|
| 1107 | \begin{description} |
---|
| 1108 | \item[$<$id$>$] a directory for vtpm id frontend device for the domain |
---|
| 1109 | \begin{description} |
---|
| 1110 | \item[backend-id] the backend domain id |
---|
| 1111 | \item[backend] a path to the backend's store entry |
---|
| 1112 | \item[ring-ref] the grant table reference for the tx/rx ring |
---|
| 1113 | \item[event-channel] the event channel used for the ring |
---|
| 1114 | \end{description} |
---|
| 1115 | \end{description} |
---|
| 1116 | |
---|
| 1117 | \item[device-misc/] miscellanous information for devices |
---|
| 1118 | \begin{description} |
---|
| 1119 | \item[vif/] miscellanous information for vif devices |
---|
| 1120 | \begin{description} |
---|
| 1121 | \item[nextDeviceID] the next device id to use |
---|
| 1122 | \end{description} |
---|
| 1123 | \end{description} |
---|
| 1124 | \end{description} |
---|
| 1125 | \end{description} |
---|
| 1126 | |
---|
| 1127 | \item[security/] access control information for the domain |
---|
| 1128 | \begin{description} |
---|
| 1129 | \item[ssidref] security reference identifier used inside the hypervisor |
---|
| 1130 | \item[access\_control/] security label used by management tools |
---|
| 1131 | \begin{description} |
---|
| 1132 | \item[label] security label name |
---|
| 1133 | \item[policy] security policy name |
---|
| 1134 | \end{description} |
---|
| 1135 | \end{description} |
---|
| 1136 | |
---|
| 1137 | \item[store/] per-domain information for the store |
---|
| 1138 | \begin{description} |
---|
| 1139 | \item[port] the event channel used for the store ring queue |
---|
| 1140 | \item[ring-ref] - the grant table reference used for the store's |
---|
| 1141 | communication channel |
---|
| 1142 | \end{description} |
---|
| 1143 | |
---|
| 1144 | \item[image] - private xend information |
---|
| 1145 | \end{description} |
---|
| 1146 | |
---|
| 1147 | |
---|
| 1148 | \chapter{Devices} |
---|
| 1149 | \label{c:devices} |
---|
| 1150 | |
---|
| 1151 | Virtual devices under Xen are provided by a {\bf split device driver} |
---|
| 1152 | architecture. The illusion of the virtual device is provided by two |
---|
| 1153 | co-operating drivers: the {\bf frontend}, which runs an the |
---|
| 1154 | unprivileged domain and the {\bf backend}, which runs in a domain with |
---|
| 1155 | access to the real device hardware (often called a {\bf driver |
---|
| 1156 | domain}; in practice domain 0 usually fulfills this function). |
---|
| 1157 | |
---|
| 1158 | The frontend driver appears to the unprivileged guest as if it were a |
---|
| 1159 | real device, for instance a block or network device. It receives IO |
---|
| 1160 | requests from its kernel as usual, however since it does not have |
---|
| 1161 | access to the physical hardware of the system it must then issue |
---|
| 1162 | requests to the backend. The backend driver is responsible for |
---|
| 1163 | receiving these IO requests, verifying that they are safe and then |
---|
| 1164 | issuing them to the real device hardware. The backend driver appears |
---|
| 1165 | to its kernel as a normal user of in-kernel IO functionality. When |
---|
| 1166 | the IO completes the backend notifies the frontend that the data is |
---|
| 1167 | ready for use; the frontend is then able to report IO completion to |
---|
| 1168 | its own kernel. |
---|
| 1169 | |
---|
| 1170 | Frontend drivers are designed to be simple; most of the complexity is |
---|
| 1171 | in the backend, which has responsibility for translating device |
---|
| 1172 | addresses, verifying that requests are well-formed and do not violate |
---|
| 1173 | isolation guarantees, etc. |
---|
| 1174 | |
---|
| 1175 | Split drivers exchange requests and responses in shared memory, with |
---|
| 1176 | an event channel for asynchronous notifications of activity. When the |
---|
| 1177 | frontend driver comes up, it uses Xenstore to set up a shared memory |
---|
| 1178 | frame and an interdomain event channel for communications with the |
---|
| 1179 | backend. Once this connection is established, the two can communicate |
---|
| 1180 | directly by placing requests / responses into shared memory and then |
---|
| 1181 | sending notifications on the event channel. This separation of |
---|
| 1182 | notification from data transfer allows message batching, and results |
---|
| 1183 | in very efficient device access. |
---|
| 1184 | |
---|
| 1185 | This chapter focuses on some individual split device interfaces |
---|
| 1186 | available to Xen guests. |
---|
| 1187 | |
---|
| 1188 | |
---|
| 1189 | \section{Network I/O} |
---|
| 1190 | |
---|
| 1191 | Virtual network device services are provided by shared memory |
---|
| 1192 | communication with a backend domain. From the point of view of other |
---|
| 1193 | domains, the backend may be viewed as a virtual ethernet switch |
---|
| 1194 | element with each domain having one or more virtual network interfaces |
---|
| 1195 | connected to it. |
---|
| 1196 | |
---|
| 1197 | From the point of view of the backend domain itself, the network |
---|
| 1198 | backend driver consists of a number of ethernet devices. Each of |
---|
| 1199 | these has a logical direct connection to a virtual network device in |
---|
| 1200 | another domain. This allows the backend domain to route, bridge, |
---|
| 1201 | firewall, etc the traffic to / from the other domains using normal |
---|
| 1202 | operating system mechanisms. |
---|
| 1203 | |
---|
| 1204 | \subsection{Backend Packet Handling} |
---|
| 1205 | |
---|
| 1206 | The backend driver is responsible for a variety of actions relating to |
---|
| 1207 | the transmission and reception of packets from the physical device. |
---|
| 1208 | With regard to transmission, the backend performs these key actions: |
---|
| 1209 | |
---|
| 1210 | \begin{itemize} |
---|
| 1211 | \item {\bf Validation:} To ensure that domains do not attempt to |
---|
| 1212 | generate invalid (e.g. spoofed) traffic, the backend driver may |
---|
| 1213 | validate headers ensuring that source MAC and IP addresses match the |
---|
| 1214 | interface that they have been sent from. |
---|
| 1215 | |
---|
| 1216 | Validation functions can be configured using standard firewall rules |
---|
| 1217 | ({\small{\tt iptables}} in the case of Linux). |
---|
| 1218 | |
---|
| 1219 | \item {\bf Scheduling:} Since a number of domains can share a single |
---|
| 1220 | physical network interface, the backend must mediate access when |
---|
| 1221 | several domains each have packets queued for transmission. This |
---|
| 1222 | general scheduling function subsumes basic shaping or rate-limiting |
---|
| 1223 | schemes. |
---|
| 1224 | |
---|
| 1225 | \item {\bf Logging and Accounting:} The backend domain can be |
---|
| 1226 | configured with classifier rules that control how packets are |
---|
| 1227 | accounted or logged. For example, log messages might be generated |
---|
| 1228 | whenever a domain attempts to send a TCP packet containing a SYN. |
---|
| 1229 | \end{itemize} |
---|
| 1230 | |
---|
| 1231 | On receipt of incoming packets, the backend acts as a simple |
---|
| 1232 | demultiplexer: Packets are passed to the appropriate virtual interface |
---|
| 1233 | after any necessary logging and accounting have been carried out. |
---|
| 1234 | |
---|
| 1235 | \subsection{Data Transfer} |
---|
| 1236 | |
---|
| 1237 | Each virtual interface uses two ``descriptor rings'', one for |
---|
| 1238 | transmit, the other for receive. Each descriptor identifies a block |
---|
| 1239 | of contiguous machine memory allocated to the domain. |
---|
| 1240 | |
---|
| 1241 | The transmit ring carries packets to transmit from the guest to the |
---|
| 1242 | backend domain. The return path of the transmit ring carries messages |
---|
| 1243 | indicating that the contents have been physically transmitted and the |
---|
| 1244 | backend no longer requires the associated pages of memory. |
---|
| 1245 | |
---|
| 1246 | To receive packets, the guest places descriptors of unused pages on |
---|
| 1247 | the receive ring. The backend will return received packets by |
---|
| 1248 | exchanging these pages in the domain's memory with new pages |
---|
| 1249 | containing the received data, and passing back descriptors regarding |
---|
| 1250 | the new packets on the ring. This zero-copy approach allows the |
---|
| 1251 | backend to maintain a pool of free pages to receive packets into, and |
---|
| 1252 | then deliver them to appropriate domains after examining their |
---|
| 1253 | headers. |
---|
| 1254 | |
---|
| 1255 | % Real physical addresses are used throughout, with the domain |
---|
| 1256 | % performing translation from pseudo-physical addresses if that is |
---|
| 1257 | % necessary. |
---|
| 1258 | |
---|
| 1259 | If a domain does not keep its receive ring stocked with empty buffers |
---|
| 1260 | then packets destined to it may be dropped. This provides some |
---|
| 1261 | defence against receive livelock problems because an overloaded domain |
---|
| 1262 | will cease to receive further data. Similarly, on the transmit path, |
---|
| 1263 | it provides the application with feedback on the rate at which packets |
---|
| 1264 | are able to leave the system. |
---|
| 1265 | |
---|
| 1266 | Flow control on rings is achieved by including a pair of producer |
---|
| 1267 | indexes on the shared ring page. Each side will maintain a private |
---|
| 1268 | consumer index indicating the next outstanding message. In this |
---|
| 1269 | manner, the domains cooperate to divide the ring into two message |
---|
| 1270 | lists, one in each direction. Notification is decoupled from the |
---|
| 1271 | immediate placement of new messages on the ring; the event channel |
---|
| 1272 | will be used to generate notification when {\em either} a certain |
---|
| 1273 | number of outstanding messages are queued, {\em or} a specified number |
---|
| 1274 | of nanoseconds have elapsed since the oldest message was placed on the |
---|
| 1275 | ring. |
---|
| 1276 | |
---|
| 1277 | %% Not sure if my version is any better -- here is what was here |
---|
| 1278 | %% before: Synchronization between the backend domain and the guest is |
---|
| 1279 | %% achieved using counters held in shared memory that is accessible to |
---|
| 1280 | %% both. Each ring has associated producer and consumer indices |
---|
| 1281 | %% indicating the area in the ring that holds descriptors that contain |
---|
| 1282 | %% data. After receiving {\it n} packets or {\t nanoseconds} after |
---|
| 1283 | %% receiving the first packet, the hypervisor sends an event to the |
---|
| 1284 | %% domain. |
---|
| 1285 | |
---|
| 1286 | |
---|
| 1287 | \subsection{Network ring interface} |
---|
| 1288 | |
---|
| 1289 | The network device uses two shared memory rings for communication: one |
---|
| 1290 | for transmit, one for receieve. |
---|
| 1291 | |
---|
| 1292 | Transmit requests are described by the following structure: |
---|
| 1293 | |
---|
| 1294 | \scriptsize |
---|
| 1295 | \begin{verbatim} |
---|
| 1296 | typedef struct netif_tx_request { |
---|
| 1297 | grant_ref_t gref; /* Reference to buffer page */ |
---|
| 1298 | uint16_t offset; /* Offset within buffer page */ |
---|
| 1299 | uint16_t flags; /* NETTXF_* */ |
---|
| 1300 | uint16_t id; /* Echoed in response message. */ |
---|
| 1301 | uint16_t size; /* Packet size in bytes. */ |
---|
| 1302 | } netif_tx_request_t; |
---|
| 1303 | \end{verbatim} |
---|
| 1304 | \normalsize |
---|
| 1305 | |
---|
| 1306 | \begin{description} |
---|
| 1307 | \item[gref] Grant reference for the network buffer |
---|
| 1308 | \item[offset] Offset to data |
---|
| 1309 | \item[flags] Transmit flags (currently only NETTXF\_csum\_blank is |
---|
| 1310 | supported, to indicate that the protocol checksum field is |
---|
| 1311 | incomplete). |
---|
| 1312 | \item[id] Echoed to guest by the backend in the ring-level response so |
---|
| 1313 | that the guest can match it to this request |
---|
| 1314 | \item[size] Buffer size |
---|
| 1315 | \end{description} |
---|
| 1316 | |
---|
| 1317 | Each transmit request is followed by a transmit response at some later |
---|
| 1318 | date. This is part of the shared-memory communication protocol and |
---|
| 1319 | allows the guest to (potentially) retire internal structures related |
---|
| 1320 | to the request. It does not imply a network-level response. This |
---|
| 1321 | structure is as follows: |
---|
| 1322 | |
---|
| 1323 | \scriptsize |
---|
| 1324 | \begin{verbatim} |
---|
| 1325 | typedef struct netif_tx_response { |
---|
| 1326 | uint16_t id; |
---|
| 1327 | int16_t status; |
---|
| 1328 | } netif_tx_response_t; |
---|
| 1329 | \end{verbatim} |
---|
| 1330 | \normalsize |
---|
| 1331 | |
---|
| 1332 | \begin{description} |
---|
| 1333 | \item[id] Echo of the ID field in the corresponding transmit request. |
---|
| 1334 | \item[status] Success / failure status of the transmit request. |
---|
| 1335 | \end{description} |
---|
| 1336 | |
---|
| 1337 | Receive requests must be queued by the frontend, accompanied by a |
---|
| 1338 | donation of page-frames to the backend. The backend transfers page |
---|
| 1339 | frames full of data back to the guest |
---|
| 1340 | |
---|
| 1341 | \scriptsize |
---|
| 1342 | \begin{verbatim} |
---|
| 1343 | typedef struct { |
---|
| 1344 | uint16_t id; /* Echoed in response message. */ |
---|
| 1345 | grant_ref_t gref; /* Reference to incoming granted frame */ |
---|
| 1346 | } netif_rx_request_t; |
---|
| 1347 | \end{verbatim} |
---|
| 1348 | \normalsize |
---|
| 1349 | |
---|
| 1350 | \begin{description} |
---|
| 1351 | \item[id] Echoed by the frontend to identify this request when |
---|
| 1352 | responding. |
---|
| 1353 | \item[gref] Transfer reference - the backend will use this reference |
---|
| 1354 | to transfer a frame of network data to us. |
---|
| 1355 | \end{description} |
---|
| 1356 | |
---|
| 1357 | Receive response descriptors are queued for each received frame. Note |
---|
| 1358 | that these may only be queued in reply to an existing receive request, |
---|
| 1359 | providing an in-built form of traffic throttling. |
---|
| 1360 | |
---|
| 1361 | \scriptsize |
---|
| 1362 | \begin{verbatim} |
---|
| 1363 | typedef struct { |
---|
| 1364 | uint16_t id; |
---|
| 1365 | uint16_t offset; /* Offset in page of start of received packet */ |
---|
| 1366 | uint16_t flags; /* NETRXF_* */ |
---|
| 1367 | int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */ |
---|
| 1368 | } netif_rx_response_t; |
---|
| 1369 | \end{verbatim} |
---|
| 1370 | \normalsize |
---|
| 1371 | |
---|
| 1372 | \begin{description} |
---|
| 1373 | \item[id] ID echoed from the original request, used by the guest to |
---|
| 1374 | match this response to the original request. |
---|
| 1375 | \item[offset] Offset to data within the transferred frame. |
---|
| 1376 | \item[flags] Transmit flags (currently only NETRXF\_csum\_valid is |
---|
| 1377 | supported, to indicate that the protocol checksum field has already |
---|
| 1378 | been validated). |
---|
| 1379 | \item[status] Success / error status for this operation. |
---|
| 1380 | \end{description} |
---|
| 1381 | |
---|
| 1382 | Note that the receive protocol includes a mechanism for guests to |
---|
| 1383 | receive incoming memory frames but there is no explicit transfer of |
---|
| 1384 | frames in the other direction. Guests are expected to return memory |
---|
| 1385 | to the hypervisor in order to use the network interface. They {\em |
---|
| 1386 | must} do this or they will exceed their maximum memory reservation and |
---|
| 1387 | will not be able to receive incoming frame transfers. When necessary, |
---|
| 1388 | the backend is able to replenish its pool of free network buffers by |
---|
| 1389 | claiming some of this free memory from the hypervisor. |
---|
| 1390 | |
---|
| 1391 | \section{Block I/O} |
---|
| 1392 | |
---|
| 1393 | All guest OS disk access goes through the virtual block device VBD |
---|
| 1394 | interface. This interface allows domains access to portions of block |
---|
| 1395 | storage devices visible to the the block backend device. The VBD |
---|
| 1396 | interface is a split driver, similar to the network interface |
---|
| 1397 | described above. A single shared memory ring is used between the |
---|
| 1398 | frontend and backend drivers for each virtual device, across which |
---|
| 1399 | IO requests and responses are sent. |
---|
| 1400 | |
---|
| 1401 | Any block device accessible to the backend domain, including |
---|
| 1402 | network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, |
---|
| 1403 | can be exported as a VBD. Each VBD is mapped to a device node in the |
---|
| 1404 | guest, specified in the guest's startup configuration. |
---|
| 1405 | |
---|
| 1406 | \subsection{Data Transfer} |
---|
| 1407 | |
---|
| 1408 | The per-(virtual)-device ring between the guest and the block backend |
---|
| 1409 | supports two messages: |
---|
| 1410 | |
---|
| 1411 | \begin{description} |
---|
| 1412 | \item [{\small {\tt READ}}:] Read data from the specified block |
---|
| 1413 | device. The front end identifies the device and location to read |
---|
| 1414 | from and attaches pages for the data to be copied to (typically via |
---|
| 1415 | DMA from the device). The backend acknowledges completed read |
---|
| 1416 | requests as they finish. |
---|
| 1417 | |
---|
| 1418 | \item [{\small {\tt WRITE}}:] Write data to the specified block |
---|
| 1419 | device. This functions essentially as {\small {\tt READ}}, except |
---|
| 1420 | that the data moves to the device instead of from it. |
---|
| 1421 | \end{description} |
---|
| 1422 | |
---|
| 1423 | %% Rather than copying data, the backend simply maps the domain's |
---|
| 1424 | %% buffers in order to enable direct DMA to them. The act of mapping |
---|
| 1425 | %% the buffers also increases the reference counts of the underlying |
---|
| 1426 | %% pages, so that the unprivileged domain cannot try to return them to |
---|
| 1427 | %% the hypervisor, install them as page tables, or any other unsafe |
---|
| 1428 | %% behaviour. |
---|
| 1429 | %% |
---|
| 1430 | %% % block API here |
---|
| 1431 | |
---|
| 1432 | \subsection{Block ring interface} |
---|
| 1433 | |
---|
| 1434 | The block interface is defined by the structures passed over the |
---|
| 1435 | shared memory interface. These structures are either requests (from |
---|
| 1436 | the frontend to the backend) or responses (from the backend to the |
---|
| 1437 | frontend). |
---|
| 1438 | |
---|
| 1439 | The request structure is defined as follows: |
---|
| 1440 | |
---|
| 1441 | \scriptsize |
---|
| 1442 | \begin{verbatim} |
---|
| 1443 | typedef struct blkif_request { |
---|
| 1444 | uint8_t operation; /* BLKIF_OP_??? */ |
---|
| 1445 | uint8_t nr_segments; /* number of segments */ |
---|
| 1446 | blkif_vdev_t handle; /* only for read/write requests */ |
---|
| 1447 | uint64_t id; /* private guest value, echoed in resp */ |
---|
| 1448 | blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ |
---|
| 1449 | struct blkif_request_segment { |
---|
| 1450 | grant_ref_t gref; /* reference to I/O buffer frame */ |
---|
| 1451 | /* @first_sect: first sector in frame to transfer (inclusive). */ |
---|
| 1452 | /* @last_sect: last sector in frame to transfer (inclusive). */ |
---|
| 1453 | uint8_t first_sect, last_sect; |
---|
| 1454 | } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; |
---|
| 1455 | } blkif_request_t; |
---|
| 1456 | \end{verbatim} |
---|
| 1457 | \normalsize |
---|
| 1458 | |
---|
| 1459 | The fields are as follows: |
---|
| 1460 | |
---|
| 1461 | \begin{description} |
---|
| 1462 | \item[operation] operation ID: one of the operations described above |
---|
| 1463 | \item[nr\_segments] number of segments for scatter / gather IO |
---|
| 1464 | described by this request |
---|
| 1465 | \item[handle] identifier for a particular virtual device on this |
---|
| 1466 | interface |
---|
| 1467 | \item[id] this value is echoed in the response message for this IO; |
---|
| 1468 | the guest may use it to identify the original request |
---|
| 1469 | \item[sector\_number] start sector on the virtal device for this |
---|
| 1470 | request |
---|
| 1471 | \item[frame\_and\_sects] This array contains structures encoding |
---|
| 1472 | scatter-gather IO to be performed: |
---|
| 1473 | \begin{description} |
---|
| 1474 | \item[gref] The grant reference for the foreign I/O buffer page. |
---|
| 1475 | \item[first\_sect] First sector to access within the buffer page (0 to 7). |
---|
| 1476 | \item[last\_sect] Last sector to access within the buffer page (0 to 7). |
---|
| 1477 | \end{description} |
---|
| 1478 | Data will be transferred into frames at an offset determined by the |
---|
| 1479 | value of {\tt first\_sect}. |
---|
| 1480 | \end{description} |
---|
| 1481 | |
---|
| 1482 | \section{Virtual TPM} |
---|
| 1483 | |
---|
| 1484 | Virtual TPM (VTPM) support provides TPM functionality to each virtual |
---|
| 1485 | machine that requests this functionality in its configuration file. |
---|
| 1486 | The interface enables domains to access therr own private TPM like it |
---|
| 1487 | was a hardware TPM built into the machine. |
---|
| 1488 | |
---|
| 1489 | The virtual TPM interface is implemented as a split driver, |
---|
| 1490 | similar to the network and block interfaces described above. |
---|
| 1491 | The user domain hosting the frontend exports a character device /dev/tpm0 |
---|
| 1492 | to user-level applications for communicating with the virtual TPM. |
---|
| 1493 | This is the same device interface that is also offered if a hardware TPM |
---|
| 1494 | is available in the system. The backend provides a single interface |
---|
| 1495 | /dev/vtpm where the virtual TPM is waiting for commands from all domains |
---|
| 1496 | that have located their backend in a given domain. |
---|
| 1497 | |
---|
| 1498 | \subsection{Data Transfer} |
---|
| 1499 | |
---|
| 1500 | A single shared memory ring is used between the frontend and backend |
---|
| 1501 | drivers. TPM requests and responses are sent in pages where a pointer |
---|
| 1502 | to those pages and other information is placed into the ring such that |
---|
| 1503 | the backend can map the pages into its memory space using the grant |
---|
| 1504 | table mechanism. |
---|
| 1505 | |
---|
| 1506 | The backend driver has been implemented to only accept well-formed |
---|
| 1507 | TPM requests. To meet this requirement, the length inidicator in the |
---|
| 1508 | TPM request must correctly indicate the length of the request. |
---|
| 1509 | Otherwise an error message is automatically sent back by the device driver. |
---|
| 1510 | |
---|
| 1511 | The virtual TPM implementation listenes for TPM request on /dev/vtpm. Since |
---|
| 1512 | it must be able to apply the TPM request packet to the virtual TPM instance |
---|
| 1513 | associated with the virtual machine, a 4-byte virtual TPM instance |
---|
| 1514 | identifier is prepended to each packet by the backend driver (in network |
---|
| 1515 | byte order) for internal routing of the request. |
---|
| 1516 | |
---|
| 1517 | \subsection{Virtual TPM ring interface} |
---|
| 1518 | |
---|
| 1519 | The TPM protocol is a strict request/response protocol and therefore |
---|
| 1520 | only one ring is used to send requests from the frontend to the backend |
---|
| 1521 | and responses on the reverse path. |
---|
| 1522 | |
---|
| 1523 | The request/response structure is defined as follows: |
---|
| 1524 | |
---|
| 1525 | \scriptsize |
---|
| 1526 | \begin{verbatim} |
---|
| 1527 | typedef struct { |
---|
| 1528 | unsigned long addr; /* Machine address of packet. */ |
---|
| 1529 | grant_ref_t ref; /* grant table access reference. */ |
---|
| 1530 | uint16_t unused; /* unused */ |
---|
| 1531 | uint16_t size; /* Packet size in bytes. */ |
---|
| 1532 | } tpmif_tx_request_t; |
---|
| 1533 | \end{verbatim} |
---|
| 1534 | \normalsize |
---|
| 1535 | |
---|
| 1536 | The fields are as follows: |
---|
| 1537 | |
---|
| 1538 | \begin{description} |
---|
| 1539 | \item[addr] The machine address of the page asscoiated with the TPM |
---|
| 1540 | request/response; a request/response may span multiple |
---|
| 1541 | pages |
---|
| 1542 | \item[ref] The grant table reference associated with the address. |
---|
| 1543 | \item[size] The size of the remaining packet; up to |
---|
| 1544 | PAGE{\textunderscore}SIZE bytes can be found in the |
---|
| 1545 | page referenced by 'addr' |
---|
| 1546 | \end{description} |
---|
| 1547 | |
---|
| 1548 | The frontend initially allocates several pages whose addresses |
---|
| 1549 | are stored in the ring. Only these pages are used for exchange of |
---|
| 1550 | requests and responses. |
---|
| 1551 | |
---|
| 1552 | |
---|
| 1553 | \chapter{Further Information} |
---|
| 1554 | |
---|
| 1555 | If you have questions that are not answered by this manual, the |
---|
| 1556 | sources of information listed below may be of interest to you. Note |
---|
| 1557 | that bug reports, suggestions and contributions related to the |
---|
| 1558 | software (or the documentation) should be sent to the Xen developers' |
---|
| 1559 | mailing list (address below). |
---|
| 1560 | |
---|
| 1561 | |
---|
| 1562 | \section{Other documentation} |
---|
| 1563 | |
---|
| 1564 | If you are mainly interested in using (rather than developing for) |
---|
| 1565 | Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/} |
---|
| 1566 | directory of the Xen source distribution. |
---|
| 1567 | |
---|
| 1568 | % Various HOWTOs are also available in {\tt docs/HOWTOS}. |
---|
| 1569 | |
---|
| 1570 | |
---|
| 1571 | \section{Online references} |
---|
| 1572 | |
---|
| 1573 | The official Xen web site can be found at: |
---|
| 1574 | \begin{quote} {\tt http://www.xensource.com} |
---|
| 1575 | \end{quote} |
---|
| 1576 | |
---|
| 1577 | |
---|
| 1578 | This contains links to the latest versions of all online |
---|
| 1579 | documentation, including the latest version of the FAQ. |
---|
| 1580 | |
---|
| 1581 | Information regarding Xen is also available at the Xen Wiki at |
---|
| 1582 | \begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote} |
---|
| 1583 | The Xen project uses Bugzilla as its bug tracking system. You'll find |
---|
| 1584 | the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/. |
---|
| 1585 | |
---|
| 1586 | |
---|
| 1587 | \section{Mailing lists} |
---|
| 1588 | |
---|
| 1589 | There are several mailing lists that are used to discuss Xen related |
---|
| 1590 | topics. The most widely relevant are listed below. An official page of |
---|
| 1591 | mailing lists and subscription information can be found at \begin{quote} |
---|
| 1592 | {\tt http://lists.xensource.com/} \end{quote} |
---|
| 1593 | |
---|
| 1594 | \begin{description} |
---|
| 1595 | \item[xen-devel@lists.xensource.com] Used for development |
---|
| 1596 | discussions and bug reports. Subscribe at: \\ |
---|
| 1597 | {\small {\tt http://lists.xensource.com/xen-devel}} |
---|
| 1598 | \item[xen-users@lists.xensource.com] Used for installation and usage |
---|
| 1599 | discussions and requests for help. Subscribe at: \\ |
---|
| 1600 | {\small {\tt http://lists.xensource.com/xen-users}} |
---|
| 1601 | \item[xen-announce@lists.xensource.com] Used for announcements only. |
---|
| 1602 | Subscribe at: \\ |
---|
| 1603 | {\small {\tt http://lists.xensource.com/xen-announce}} |
---|
| 1604 | \item[xen-changelog@lists.xensource.com] Changelog feed |
---|
| 1605 | from the unstable and 2.0 trees - developer oriented. Subscribe at: \\ |
---|
| 1606 | {\small {\tt http://lists.xensource.com/xen-changelog}} |
---|
| 1607 | \end{description} |
---|
| 1608 | |
---|
| 1609 | \appendix |
---|
| 1610 | |
---|
| 1611 | |
---|
| 1612 | \chapter{Xen Hypercalls} |
---|
| 1613 | \label{a:hypercalls} |
---|
| 1614 | |
---|
| 1615 | Hypercalls represent the procedural interface to Xen; this appendix |
---|
| 1616 | categorizes and describes the current set of hypercalls. |
---|
| 1617 | |
---|
| 1618 | \section{Invoking Hypercalls} |
---|
| 1619 | |
---|
| 1620 | Hypercalls are invoked in a manner analogous to system calls in a |
---|
| 1621 | conventional operating system; a software interrupt is issued which |
---|
| 1622 | vectors to an entry point within Xen. On x86/32 machines the |
---|
| 1623 | instruction required is {\tt int \$82}; the (real) IDT is setup so |
---|
| 1624 | that this may only be issued from within ring 1. The particular |
---|
| 1625 | hypercall to be invoked is contained in {\tt EAX} --- a list |
---|
| 1626 | mapping these values to symbolic hypercall names can be found |
---|
| 1627 | in {\tt xen/include/public/xen.h}. |
---|
| 1628 | |
---|
| 1629 | On some occasions a set of hypercalls will be required to carry |
---|
| 1630 | out a higher-level function; a good example is when a guest |
---|
| 1631 | operating wishes to context switch to a new process which |
---|
| 1632 | requires updating various privileged CPU state. As an optimization |
---|
| 1633 | for these cases, there is a generic mechanism to issue a set of |
---|
| 1634 | hypercalls as a batch: |
---|
| 1635 | |
---|
| 1636 | \begin{quote} |
---|
| 1637 | \hypercall{multicall(void *call\_list, int nr\_calls)} |
---|
| 1638 | |
---|
| 1639 | Execute a series of hypervisor calls; {\tt nr\_calls} is the length of |
---|
| 1640 | the array of {\tt multicall\_entry\_t} structures pointed to be {\tt |
---|
| 1641 | call\_list}. Each entry contains the hypercall operation code followed |
---|
| 1642 | by up to 7 word-sized arguments. |
---|
| 1643 | \end{quote} |
---|
| 1644 | |
---|
| 1645 | Note that multicalls are provided purely as an optimization; there is |
---|
| 1646 | no requirement to use them when first porting a guest operating |
---|
| 1647 | system. |
---|
| 1648 | |
---|
| 1649 | |
---|
| 1650 | \section{Virtual CPU Setup} |
---|
| 1651 | |
---|
| 1652 | At start of day, a guest operating system needs to setup the virtual |
---|
| 1653 | CPU it is executing on. This includes installing vectors for the |
---|
| 1654 | virtual IDT so that the guest OS can handle interrupts, page faults, |
---|
| 1655 | etc. However the very first thing a guest OS must setup is a pair |
---|
| 1656 | of hypervisor callbacks: these are the entry points which Xen will |
---|
| 1657 | use when it wishes to notify the guest OS of an occurrence. |
---|
| 1658 | |
---|
| 1659 | \begin{quote} |
---|
| 1660 | \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long |
---|
| 1661 | event\_address, unsigned long failsafe\_selector, unsigned long |
---|
| 1662 | failsafe\_address) } |
---|
| 1663 | |
---|
| 1664 | Register the normal (``event'') and failsafe callbacks for |
---|
| 1665 | event processing. In each case the code segment selector and |
---|
| 1666 | address within that segment are provided. The selectors must |
---|
| 1667 | have RPL 1; in XenLinux we simply use the kernel's CS for both |
---|
| 1668 | {\bf event\_selector} and {\bf failsafe\_selector}. |
---|
| 1669 | |
---|
| 1670 | The value {\bf event\_address} specifies the address of the guest OSes |
---|
| 1671 | event handling and dispatch routine; the {\bf failsafe\_address} |
---|
| 1672 | specifies a separate entry point which is used only if a fault occurs |
---|
| 1673 | when Xen attempts to use the normal callback. |
---|
| 1674 | |
---|
| 1675 | \end{quote} |
---|
| 1676 | |
---|
| 1677 | On x86/64 systems the hypercall takes slightly different |
---|
| 1678 | arguments. This is because callback CS does not need to be specified |
---|
| 1679 | (since teh callbacks are entered via SYSRET), and also because an |
---|
| 1680 | entry address needs to be specified for SYSCALLs from guest user |
---|
| 1681 | space: |
---|
| 1682 | |
---|
| 1683 | \begin{quote} |
---|
| 1684 | \hypercall{set\_callbacks(unsigned long event\_address, unsigned long |
---|
| 1685 | failsafe\_address, unsigned long syscall\_address)} |
---|
| 1686 | \end{quote} |
---|
| 1687 | |
---|
| 1688 | |
---|
| 1689 | After installing the hypervisor callbacks, the guest OS can |
---|
| 1690 | install a `virtual IDT' by using the following hypercall: |
---|
| 1691 | |
---|
| 1692 | \begin{quote} |
---|
| 1693 | \hypercall{set\_trap\_table(trap\_info\_t *table)} |
---|
| 1694 | |
---|
| 1695 | Install one or more entries into the per-domain |
---|
| 1696 | trap handler table (essentially a software version of the IDT). |
---|
| 1697 | Each entry in the array pointed to by {\bf table} includes the |
---|
| 1698 | exception vector number with the corresponding segment selector |
---|
| 1699 | and entry point. Most guest OSes can use the same handlers on |
---|
| 1700 | Xen as when running on the real hardware. |
---|
| 1701 | |
---|
| 1702 | |
---|
| 1703 | \end{quote} |
---|
| 1704 | |
---|
| 1705 | A further hypercall is provided for the management of virtual CPUs: |
---|
| 1706 | |
---|
| 1707 | \begin{quote} |
---|
| 1708 | \hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)} |
---|
| 1709 | |
---|
| 1710 | This hypercall can be used to bootstrap VCPUs, to bring them up and |
---|
| 1711 | down and to test their current status. |
---|
| 1712 | |
---|
| 1713 | \end{quote} |
---|
| 1714 | |
---|
| 1715 | \section{Scheduling and Timer} |
---|
| 1716 | |
---|
| 1717 | Domains are preemptively scheduled by Xen according to the |
---|
| 1718 | parameters installed by domain 0 (see Section~\ref{s:dom0ops}). |
---|
| 1719 | In addition, however, a domain may choose to explicitly |
---|
| 1720 | control certain behavior with the following hypercall: |
---|
| 1721 | |
---|
| 1722 | \begin{quote} |
---|
| 1723 | \hypercall{sched\_op\_new(int cmd, void *extra\_args)} |
---|
| 1724 | |
---|
| 1725 | Request scheduling operation from hypervisor. The following |
---|
| 1726 | sub-commands are available: |
---|
| 1727 | |
---|
| 1728 | \begin{description} |
---|
| 1729 | \item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the |
---|
| 1730 | caller marked as runnable. No extra arguments are passed to this |
---|
| 1731 | command. |
---|
| 1732 | \item[SCHEDOP\_block] removes the calling domain from the run queue |
---|
| 1733 | and causes it to sleep until an event is delivered to it. No extra |
---|
| 1734 | arguments are passed to this command. |
---|
| 1735 | \item[SCHEDOP\_shutdown] is used to end the calling domain's |
---|
| 1736 | execution. The extra argument is a {\bf sched\_shutdown} structure |
---|
| 1737 | which indicates the reason why the domain suspended (e.g., for reboot, |
---|
| 1738 | halt, power-off). |
---|
| 1739 | \item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels |
---|
| 1740 | with an optional timeout (all of which are specified in the {\bf |
---|
| 1741 | sched\_poll} extra argument). The semantics are similar to the UNIX |
---|
| 1742 | {\bf poll} system call. The caller must have event-channel upcalls |
---|
| 1743 | masked when executing this command. |
---|
| 1744 | \end{description} |
---|
| 1745 | \end{quote} |
---|
| 1746 | |
---|
| 1747 | {\bf sched\_op\_new} was not available prior to Xen 3.0.2. Older versions |
---|
| 1748 | provide only the following hypercall: |
---|
| 1749 | |
---|
| 1750 | \begin{quote} |
---|
| 1751 | \hypercall{sched\_op(int cmd, unsigned long extra\_arg)} |
---|
| 1752 | |
---|
| 1753 | This hypercall supports the following subset of {\bf sched\_op\_new} commands: |
---|
| 1754 | |
---|
| 1755 | \begin{description} |
---|
| 1756 | \item[SCHEDOP\_yield] (extra argument is 0). |
---|
| 1757 | \item[SCHEDOP\_block] (extra argument is 0). |
---|
| 1758 | \item[SCHEDOP\_shutdown] (extra argument is numeric reason code). |
---|
| 1759 | \end{description} |
---|
| 1760 | \end{quote} |
---|
| 1761 | |
---|
| 1762 | To aid the implementation of a process scheduler within a guest OS, |
---|
| 1763 | Xen provides a virtual programmable timer: |
---|
| 1764 | |
---|
| 1765 | \begin{quote} |
---|
| 1766 | \hypercall{set\_timer\_op(uint64\_t timeout)} |
---|
| 1767 | |
---|
| 1768 | Request a timer event to be sent at the specified system time (time |
---|
| 1769 | in nanoseconds since system boot). |
---|
| 1770 | |
---|
| 1771 | \end{quote} |
---|
| 1772 | |
---|
| 1773 | Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op} |
---|
| 1774 | allows block-with-timeout semantics. |
---|
| 1775 | |
---|
| 1776 | |
---|
| 1777 | \section{Page Table Management} |
---|
| 1778 | |
---|
| 1779 | Since guest operating systems have read-only access to their page |
---|
| 1780 | tables, Xen must be involved when making any changes. The following |
---|
| 1781 | multi-purpose hypercall can be used to modify page-table entries, |
---|
| 1782 | update the machine-to-physical mapping table, flush the TLB, install |
---|
| 1783 | a new page-table base pointer, and more. |
---|
| 1784 | |
---|
| 1785 | \begin{quote} |
---|
| 1786 | \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} |
---|
| 1787 | |
---|
| 1788 | Update the page table for the domain; a set of {\bf count} updates are |
---|
| 1789 | submitted for processing in a batch, with {\bf success\_count} being |
---|
| 1790 | updated to report the number of successful updates. |
---|
| 1791 | |
---|
| 1792 | Each element of {\bf req[]} contains a pointer (address) and value; |
---|
| 1793 | the least significant 2-bits of the pointer are used to distinguish |
---|
| 1794 | the type of update requested as follows: |
---|
| 1795 | \begin{description} |
---|
| 1796 | |
---|
| 1797 | \item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or |
---|
| 1798 | page table entry to the associated value; Xen will check that the |
---|
| 1799 | update is safe, as described in Chapter~\ref{c:memory}. |
---|
| 1800 | |
---|
| 1801 | \item[MMU\_MACHPHYS\_UPDATE:] update an entry in the |
---|
| 1802 | machine-to-physical table. The calling domain must own the machine |
---|
| 1803 | page in question (or be privileged). |
---|
| 1804 | \end{description} |
---|
| 1805 | |
---|
| 1806 | \end{quote} |
---|
| 1807 | |
---|
| 1808 | Explicitly updating batches of page table entries is extremely |
---|
| 1809 | efficient, but can require a number of alterations to the guest |
---|
| 1810 | OS. Using the writable page table mode (Chapter~\ref{c:memory}) is |
---|
| 1811 | recommended for new OS ports. |
---|
| 1812 | |
---|
| 1813 | Regardless of which page table update mode is being used, however, |
---|
| 1814 | there are some occasions (notably handling a demand page fault) where |
---|
| 1815 | a guest OS will wish to modify exactly one PTE rather than a |
---|
| 1816 | batch, and where that PTE is mapped into the current address space. |
---|
| 1817 | This is catered for by the following: |
---|
| 1818 | |
---|
| 1819 | \begin{quote} |
---|
| 1820 | \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val, |
---|
| 1821 | unsigned long flags)} |
---|
| 1822 | |
---|
| 1823 | Update the currently installed PTE that maps virtual address {\bf va} |
---|
| 1824 | to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the |
---|
| 1825 | modification is safe before applying it. The {\bf flags} determine |
---|
| 1826 | which kind of TLB flush, if any, should follow the update. |
---|
| 1827 | |
---|
| 1828 | \end{quote} |
---|
| 1829 | |
---|
| 1830 | Finally, sufficiently privileged domains may occasionally wish to manipulate |
---|
| 1831 | the pages of others: |
---|
| 1832 | |
---|
| 1833 | \begin{quote} |
---|
| 1834 | \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val, |
---|
| 1835 | unsigned long flags, domid\_t domid)} |
---|
| 1836 | |
---|
| 1837 | Identical to {\bf update\_va\_mapping} save that the pages being |
---|
| 1838 | mapped must belong to the domain {\bf domid}. |
---|
| 1839 | |
---|
| 1840 | \end{quote} |
---|
| 1841 | |
---|
| 1842 | An additional MMU hypercall provides an ``extended command'' |
---|
| 1843 | interface. This provides additional functionality beyond the basic |
---|
| 1844 | table updating commands: |
---|
| 1845 | |
---|
| 1846 | \begin{quote} |
---|
| 1847 | |
---|
| 1848 | \hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)} |
---|
| 1849 | |
---|
| 1850 | This hypercall is used to perform additional MMU operations. These |
---|
| 1851 | include updating {\tt cr3} (or just re-installing it for a TLB flush), |
---|
| 1852 | requesting various kinds of TLB flush, flushing the cache, installing |
---|
| 1853 | a new LDT, or pinning \& unpinning page-table pages (to ensure their |
---|
| 1854 | reference count doesn't drop to zero which would require a |
---|
| 1855 | revalidation of all entries). Some of the operations available are |
---|
| 1856 | restricted to domains with sufficient system privileges. |
---|
| 1857 | |
---|
| 1858 | It is also possible for privileged domains to reassign page ownership |
---|
| 1859 | via an extended MMU operation, although grant tables are used instead |
---|
| 1860 | of this where possible; see Section~\ref{s:idc}. |
---|
| 1861 | |
---|
| 1862 | \end{quote} |
---|
| 1863 | |
---|
| 1864 | Finally, a hypercall interface is exposed to activate and deactivate |
---|
| 1865 | various optional facilities provided by Xen for memory management. |
---|
| 1866 | |
---|
| 1867 | \begin{quote} |
---|
| 1868 | \hypercall{vm\_assist(unsigned int cmd, unsigned int type)} |
---|
| 1869 | |
---|
| 1870 | Toggle various memory management modes (in particular writable page |
---|
| 1871 | tables). |
---|
| 1872 | |
---|
| 1873 | \end{quote} |
---|
| 1874 | |
---|
| 1875 | \section{Segmentation Support} |
---|
| 1876 | |
---|
| 1877 | Xen allows guest OSes to install a custom GDT if they require it; |
---|
| 1878 | this is context switched transparently whenever a domain is |
---|
| 1879 | [de]scheduled. The following hypercall is effectively a |
---|
| 1880 | `safe' version of {\tt lgdt}: |
---|
| 1881 | |
---|
| 1882 | \begin{quote} |
---|
| 1883 | \hypercall{set\_gdt(unsigned long *frame\_list, int entries)} |
---|
| 1884 | |
---|
| 1885 | Install a global descriptor table for a domain; {\bf frame\_list} is |
---|
| 1886 | an array of up to 16 machine page frames within which the GDT resides, |
---|
| 1887 | with {\bf entries} being the actual number of descriptor-entry |
---|
| 1888 | slots. All page frames must be mapped read-only within the guest's |
---|
| 1889 | address space, and the table must be large enough to contain Xen's |
---|
| 1890 | reserved entries (see {\bf xen/include/public/arch-x86\_32.h}). |
---|
| 1891 | |
---|
| 1892 | \end{quote} |
---|
| 1893 | |
---|
| 1894 | Many guest OSes will also wish to install LDTs; this is achieved by |
---|
| 1895 | using {\bf mmu\_update} with an extended command, passing the |
---|
| 1896 | linear address of the LDT base along with the number of entries. No |
---|
| 1897 | special safety checks are required; Xen needs to perform this task |
---|
| 1898 | simply since {\tt lldt} requires CPL 0. |
---|
| 1899 | |
---|
| 1900 | |
---|
| 1901 | Xen also allows guest operating systems to update just an |
---|
| 1902 | individual segment descriptor in the GDT or LDT: |
---|
| 1903 | |
---|
| 1904 | \begin{quote} |
---|
| 1905 | \hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)} |
---|
| 1906 | |
---|
| 1907 | Update the GDT/LDT entry at machine address {\bf ma}; the new |
---|
| 1908 | 8-byte descriptor is stored in {\bf desc}. |
---|
| 1909 | Xen performs a number of checks to ensure the descriptor is |
---|
| 1910 | valid. |
---|
| 1911 | |
---|
| 1912 | \end{quote} |
---|
| 1913 | |
---|
| 1914 | Guest OSes can use the above in place of context switching entire |
---|
| 1915 | LDTs (or the GDT) when the number of changing descriptors is small. |
---|
| 1916 | |
---|
| 1917 | \section{Context Switching} |
---|
| 1918 | |
---|
| 1919 | When a guest OS wishes to context switch between two processes, |
---|
| 1920 | it can use the page table and segmentation hypercalls described |
---|
| 1921 | above to perform the the bulk of the privileged work. In addition, |
---|
| 1922 | however, it will need to invoke Xen to switch the kernel (ring 1) |
---|
| 1923 | stack pointer: |
---|
| 1924 | |
---|
| 1925 | \begin{quote} |
---|
| 1926 | \hypercall{stack\_switch(unsigned long ss, unsigned long esp)} |
---|
| 1927 | |
---|
| 1928 | Request kernel stack switch from hypervisor; {\bf ss} is the new |
---|
| 1929 | stack segment, which {\bf esp} is the new stack pointer. |
---|
| 1930 | |
---|
| 1931 | \end{quote} |
---|
| 1932 | |
---|
| 1933 | A useful hypercall for context switching allows ``lazy'' save and |
---|
| 1934 | restore of floating point state: |
---|
| 1935 | |
---|
| 1936 | \begin{quote} |
---|
| 1937 | \hypercall{fpu\_taskswitch(int set)} |
---|
| 1938 | |
---|
| 1939 | This call instructs Xen to set the {\tt TS} bit in the {\tt cr0} |
---|
| 1940 | control register; this means that the next attempt to use floating |
---|
| 1941 | point will cause a trap which the guest OS can trap. Typically it will |
---|
| 1942 | then save/restore the FP state, and clear the {\tt TS} bit, using the |
---|
| 1943 | same call. |
---|
| 1944 | \end{quote} |
---|
| 1945 | |
---|
| 1946 | This is provided as an optimization only; guest OSes can also choose |
---|
| 1947 | to save and restore FP state on all context switches for simplicity. |
---|
| 1948 | |
---|
| 1949 | Finally, a hypercall is provided for entering vm86 mode: |
---|
| 1950 | |
---|
| 1951 | \begin{quote} |
---|
| 1952 | \hypercall{switch\_vm86} |
---|
| 1953 | |
---|
| 1954 | This allows the guest to run code in vm86 mode, which is needed for |
---|
| 1955 | some legacy software. |
---|
| 1956 | \end{quote} |
---|
| 1957 | |
---|
| 1958 | \section{Physical Memory Management} |
---|
| 1959 | |
---|
| 1960 | As mentioned previously, each domain has a maximum and current |
---|
| 1961 | memory allocation. The maximum allocation, set at domain creation |
---|
| 1962 | time, cannot be modified. However a domain can choose to reduce |
---|
| 1963 | and subsequently grow its current allocation by using the |
---|
| 1964 | following call: |
---|
| 1965 | |
---|
| 1966 | \begin{quote} |
---|
| 1967 | \hypercall{memory\_op(unsigned int op, void *arg)} |
---|
| 1968 | |
---|
| 1969 | Increase or decrease current memory allocation (as determined by |
---|
| 1970 | the value of {\bf op}). The available operations are: |
---|
| 1971 | |
---|
| 1972 | \begin{description} |
---|
| 1973 | \item[XENMEM\_increase\_reservation] Request an increase in machine |
---|
| 1974 | memory allocation; {\bf arg} must point to a {\bf |
---|
| 1975 | xen\_memory\_reservation} structure. |
---|
| 1976 | \item[XENMEM\_decrease\_reservation] Request a decrease in machine |
---|
| 1977 | memory allocation; {\bf arg} must point to a {\bf |
---|
| 1978 | xen\_memory\_reservation} structure. |
---|
| 1979 | \item[XENMEM\_maximum\_ram\_page] Request the frame number of the |
---|
| 1980 | highest-addressed frame of machine memory in the system. {\bf arg} |
---|
| 1981 | must point to an {\bf unsigned long} where this value will be |
---|
| 1982 | stored. |
---|
| 1983 | \item[XENMEM\_current\_reservation] Returns current memory reservation |
---|
| 1984 | of the specified domain. |
---|
| 1985 | \item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation |
---|
| 1986 | of the specified domain. |
---|
| 1987 | \end{description} |
---|
| 1988 | |
---|
| 1989 | \end{quote} |
---|
| 1990 | |
---|
| 1991 | In addition to simply reducing or increasing the current memory |
---|
| 1992 | allocation via a `balloon driver', this call is also useful for |
---|
| 1993 | obtaining contiguous regions of machine memory when required (e.g. |
---|
| 1994 | for certain PCI devices, or if using superpages). |
---|
| 1995 | |
---|
| 1996 | |
---|
| 1997 | \section{Inter-Domain Communication} |
---|
| 1998 | \label{s:idc} |
---|
| 1999 | |
---|
| 2000 | Xen provides a simple asynchronous notification mechanism via |
---|
| 2001 | \emph{event channels}. Each domain has a set of end-points (or |
---|
| 2002 | \emph{ports}) which may be bound to an event source (e.g. a physical |
---|
| 2003 | IRQ, a virtual IRQ, or an port in another domain). When a pair of |
---|
| 2004 | end-points in two different domains are bound together, then a `send' |
---|
| 2005 | operation on one will cause an event to be received by the destination |
---|
| 2006 | domain. |
---|
| 2007 | |
---|
| 2008 | The control and use of event channels involves the following hypercall: |
---|
| 2009 | |
---|
| 2010 | \begin{quote} |
---|
| 2011 | \hypercall{event\_channel\_op(evtchn\_op\_t *op)} |
---|
| 2012 | |
---|
| 2013 | Inter-domain event-channel management; {\bf op} is a discriminated |
---|
| 2014 | union which allows the following 7 operations: |
---|
| 2015 | |
---|
| 2016 | \begin{description} |
---|
| 2017 | |
---|
| 2018 | \item[alloc\_unbound:] allocate a free (unbound) local |
---|
| 2019 | port and prepare for connection from a specified domain. |
---|
| 2020 | \item[bind\_virq:] bind a local port to a virtual |
---|
| 2021 | IRQ; any particular VIRQ can be bound to at most one port per domain. |
---|
| 2022 | \item[bind\_pirq:] bind a local port to a physical IRQ; |
---|
| 2023 | once more, a given pIRQ can be bound to at most one port per |
---|
| 2024 | domain. Furthermore the calling domain must be sufficiently |
---|
| 2025 | privileged. |
---|
| 2026 | \item[bind\_interdomain:] construct an interdomain event |
---|
| 2027 | channel; in general, the target domain must have previously allocated |
---|
| 2028 | an unbound port for this channel, although this can be bypassed by |
---|
| 2029 | privileged domains during domain setup. |
---|
| 2030 | \item[close:] close an interdomain event channel. |
---|
| 2031 | \item[send:] send an event to the remote end of a |
---|
| 2032 | interdomain event channel. |
---|
| 2033 | \item[status:] determine the current status of a local port. |
---|
| 2034 | \end{description} |
---|
| 2035 | |
---|
| 2036 | For more details see |
---|
| 2037 | {\bf xen/include/public/event\_channel.h}. |
---|
| 2038 | |
---|
| 2039 | \end{quote} |
---|
| 2040 | |
---|
| 2041 | Event channels are the fundamental communication primitive between |
---|
| 2042 | Xen domains and seamlessly support SMP. However they provide little |
---|
| 2043 | bandwidth for communication {\sl per se}, and hence are typically |
---|
| 2044 | married with a piece of shared memory to produce effective and |
---|
| 2045 | high-performance inter-domain communication. |
---|
| 2046 | |
---|
| 2047 | Safe sharing of memory pages between guest OSes is carried out by |
---|
| 2048 | granting access on a per page basis to individual domains. This is |
---|
| 2049 | achieved by using the {\tt grant\_table\_op} hypercall. |
---|
| 2050 | |
---|
| 2051 | \begin{quote} |
---|
| 2052 | \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} |
---|
| 2053 | |
---|
| 2054 | Used to invoke operations on a grant reference, to setup the grant |
---|
| 2055 | table and to dump the tables' contents for debugging. |
---|
| 2056 | |
---|
| 2057 | \end{quote} |
---|
| 2058 | |
---|
| 2059 | \section{IO Configuration} |
---|
| 2060 | |
---|
| 2061 | Domains with physical device access (i.e.\ driver domains) receive |
---|
| 2062 | limited access to certain PCI devices (bus address space and |
---|
| 2063 | interrupts). However many guest operating systems attempt to |
---|
| 2064 | determine the PCI configuration by directly access the PCI BIOS, |
---|
| 2065 | which cannot be allowed for safety. |
---|
| 2066 | |
---|
| 2067 | Instead, Xen provides the following hypercall: |
---|
| 2068 | |
---|
| 2069 | \begin{quote} |
---|
| 2070 | \hypercall{physdev\_op(void *physdev\_op)} |
---|
| 2071 | |
---|
| 2072 | Set and query IRQ configuration details, set the system IOPL, set the |
---|
| 2073 | TSS IO bitmap. |
---|
| 2074 | |
---|
| 2075 | \end{quote} |
---|
| 2076 | |
---|
| 2077 | |
---|
| 2078 | For examples of using {\tt physdev\_op}, see the |
---|
| 2079 | Xen-specific PCI code in the linux sparse tree. |
---|
| 2080 | |
---|
| 2081 | \section{Administrative Operations} |
---|
| 2082 | \label{s:dom0ops} |
---|
| 2083 | |
---|
| 2084 | A large number of control operations are available to a sufficiently |
---|
| 2085 | privileged domain (typically domain 0). These allow the creation and |
---|
| 2086 | management of new domains, for example. A complete list is given |
---|
| 2087 | below: for more details on any or all of these, please see |
---|
| 2088 | {\tt xen/include/public/dom0\_ops.h} |
---|
| 2089 | |
---|
| 2090 | |
---|
| 2091 | \begin{quote} |
---|
| 2092 | \hypercall{dom0\_op(dom0\_op\_t *op)} |
---|
| 2093 | |
---|
| 2094 | Administrative domain operations for domain management. The options are: |
---|
| 2095 | |
---|
| 2096 | \begin{description} |
---|
| 2097 | \item [DOM0\_GETMEMLIST:] get list of pages used by the domain |
---|
| 2098 | |
---|
| 2099 | \item [DOM0\_SCHEDCTL:] |
---|
| 2100 | |
---|
| 2101 | \item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain |
---|
| 2102 | |
---|
| 2103 | \item [DOM0\_CREATEDOMAIN:] create a new domain |
---|
| 2104 | |
---|
| 2105 | \item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated |
---|
| 2106 | with a domain |
---|
| 2107 | |
---|
| 2108 | \item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run |
---|
| 2109 | queue. |
---|
| 2110 | |
---|
| 2111 | \item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable |
---|
| 2112 | once again. |
---|
| 2113 | |
---|
| 2114 | \item [DOM0\_GETDOMAININFO:] get statistics about the domain |
---|
| 2115 | |
---|
| 2116 | \item [DOM0\_SETDOMAININFO:] set VCPU-related attributes |
---|
| 2117 | |
---|
| 2118 | \item [DOM0\_MSR:] read or write model specific registers |
---|
| 2119 | |
---|
| 2120 | \item [DOM0\_DEBUG:] interactively invoke the debugger |
---|
| 2121 | |
---|
| 2122 | \item [DOM0\_SETTIME:] set system time |
---|
| 2123 | |
---|
| 2124 | \item [DOM0\_GETPAGEFRAMEINFO:] |
---|
| 2125 | |
---|
| 2126 | \item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring |
---|
| 2127 | |
---|
| 2128 | \item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU |
---|
| 2129 | |
---|
| 2130 | \item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes |
---|
| 2131 | |
---|
| 2132 | \item [DOM0\_PHYSINFO:] get information about the host machine |
---|
| 2133 | |
---|
| 2134 | \item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler |
---|
| 2135 | |
---|
| 2136 | \item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes |
---|
| 2137 | |
---|
| 2138 | \item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain |
---|
| 2139 | |
---|
| 2140 | \item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting |
---|
| 2141 | page frame info |
---|
| 2142 | |
---|
| 2143 | \item [DOM0\_ADD\_MEMTYPE:] set MTRRs |
---|
| 2144 | |
---|
| 2145 | \item [DOM0\_DEL\_MEMTYPE:] remove a memory type range |
---|
| 2146 | |
---|
| 2147 | \item [DOM0\_READ\_MEMTYPE:] read MTRR |
---|
| 2148 | |
---|
| 2149 | \item [DOM0\_PERFCCONTROL:] control Xen's software performance |
---|
| 2150 | counters |
---|
| 2151 | |
---|
| 2152 | \item [DOM0\_MICROCODE:] update CPU microcode |
---|
| 2153 | |
---|
| 2154 | \item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an |
---|
| 2155 | IO port range (enable / disable a range for a particular domain) |
---|
| 2156 | |
---|
| 2157 | \item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU |
---|
| 2158 | |
---|
| 2159 | \item [DOM0\_GETVCPUINFO:] get current state for a VCPU |
---|
| 2160 | \item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain |
---|
| 2161 | info |
---|
| 2162 | |
---|
| 2163 | \item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it |
---|
| 2164 | needs to handle (e.g. noirqbalance) |
---|
| 2165 | |
---|
| 2166 | \item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory |
---|
| 2167 | map |
---|
| 2168 | |
---|
| 2169 | \item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain |
---|
| 2170 | |
---|
| 2171 | \item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain |
---|
| 2172 | |
---|
| 2173 | \end{description} |
---|
| 2174 | \end{quote} |
---|
| 2175 | |
---|
| 2176 | Most of the above are best understood by looking at the code |
---|
| 2177 | implementing them (in {\tt xen/common/dom0\_ops.c}) and in |
---|
| 2178 | the user-space tools that use them (mostly in {\tt tools/libxc}). |
---|
| 2179 | |
---|
| 2180 | \section{Access Control Module Hypercalls} |
---|
| 2181 | \label{s:acmops} |
---|
| 2182 | |
---|
| 2183 | Hypercalls relating to the management of the Access Control Module are |
---|
| 2184 | also restricted to domain 0 access for now. For more details on any or |
---|
| 2185 | all of these, please see {\tt xen/include/public/acm\_ops.h}. A |
---|
| 2186 | complete list is given below: |
---|
| 2187 | |
---|
| 2188 | \begin{quote} |
---|
| 2189 | |
---|
| 2190 | \hypercall{acm\_op(int cmd, void *args)} |
---|
| 2191 | |
---|
| 2192 | This hypercall can be used to configure the state of the ACM, query |
---|
| 2193 | that state, request access control decisions and dump additional |
---|
| 2194 | information. |
---|
| 2195 | |
---|
| 2196 | \begin{description} |
---|
| 2197 | |
---|
| 2198 | \item [ACMOP\_SETPOLICY:] set the access control policy |
---|
| 2199 | |
---|
| 2200 | \item [ACMOP\_GETPOLICY:] get the current access control policy and |
---|
| 2201 | status |
---|
| 2202 | |
---|
| 2203 | \item [ACMOP\_DUMPSTATS:] get current access control hook invocation |
---|
| 2204 | statistics |
---|
| 2205 | |
---|
| 2206 | \item [ACMOP\_GETSSID:] get security access control information for a |
---|
| 2207 | domain |
---|
| 2208 | |
---|
| 2209 | \item [ACMOP\_GETDECISION:] get access decision based on the currently |
---|
| 2210 | enforced access control policy |
---|
| 2211 | |
---|
| 2212 | \end{description} |
---|
| 2213 | \end{quote} |
---|
| 2214 | |
---|
| 2215 | Most of the above are best understood by looking at the code |
---|
| 2216 | implementing them (in {\tt xen/common/acm\_ops.c}) and in the |
---|
| 2217 | user-space tools that use them (mostly in {\tt tools/security} and |
---|
| 2218 | {\tt tools/python/xen/lowlevel/acm}). |
---|
| 2219 | |
---|
| 2220 | |
---|
| 2221 | \section{Debugging Hypercalls} |
---|
| 2222 | |
---|
| 2223 | A few additional hypercalls are mainly useful for debugging: |
---|
| 2224 | |
---|
| 2225 | \begin{quote} |
---|
| 2226 | \hypercall{console\_io(int cmd, int count, char *str)} |
---|
| 2227 | |
---|
| 2228 | Use Xen to interact with the console; operations are: |
---|
| 2229 | |
---|
| 2230 | {CONSOLEIO\_write}: Output count characters from buffer str. |
---|
| 2231 | |
---|
| 2232 | {CONSOLEIO\_read}: Input at most count characters into buffer str. |
---|
| 2233 | \end{quote} |
---|
| 2234 | |
---|
| 2235 | A pair of hypercalls allows access to the underlying debug registers: |
---|
| 2236 | \begin{quote} |
---|
| 2237 | \hypercall{set\_debugreg(int reg, unsigned long value)} |
---|
| 2238 | |
---|
| 2239 | Set debug register {\bf reg} to {\bf value} |
---|
| 2240 | |
---|
| 2241 | \hypercall{get\_debugreg(int reg)} |
---|
| 2242 | |
---|
| 2243 | Return the contents of the debug register {\bf reg} |
---|
| 2244 | \end{quote} |
---|
| 2245 | |
---|
| 2246 | And finally: |
---|
| 2247 | \begin{quote} |
---|
| 2248 | \hypercall{xen\_version(int cmd)} |
---|
| 2249 | |
---|
| 2250 | Request Xen version number. |
---|
| 2251 | \end{quote} |
---|
| 2252 | |
---|
| 2253 | This is useful to ensure that user-space tools are in sync |
---|
| 2254 | with the underlying hypervisor. |
---|
| 2255 | |
---|
| 2256 | |
---|
| 2257 | \end{document} |
---|