source: trunk/packages/xen-3.1/xen-3.1/docs/src/interface.tex @ 34

Last change on this file since 34 was 34, checked in by hartmans, 17 years ago

Add xen and xen-common

File size: 87.1 KB
Line 
1\documentclass[11pt,twoside,final,openright]{report}
2\usepackage{a4,graphicx,html,setspace,times}
3\usepackage{comment,parskip}
4\setstretch{1.15}
5
6% LIBRARY FUNCTIONS
7
8\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
9
10\begin{document}
11
12% TITLE PAGE
13\pagestyle{empty}
14\begin{center}
15\vspace*{\fill}
16\includegraphics{figs/xenlogo.eps}
17\vfill
18\vfill
19\vfill
20\begin{tabular}{l}
21{\Huge \bf Interface manual} \\[4mm]
22{\huge Xen v3.0 for x86} \\[80mm]
23
24{\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
25{\Large University of Cambridge, UK} \\[20mm]
26\end{tabular}
27\end{center}
28
29{\bf DISCLAIMER: This documentation is always under active development
30and as such there may be mistakes and omissions --- watch out for
31these and please report any you find to the developer's mailing list.
32The latest version is always available on-line.  Contributions of
33material, suggestions and corrections are welcome.  }
34
35\vfill
36\cleardoublepage
37
38% TABLE OF CONTENTS
39\pagestyle{plain}
40\pagenumbering{roman}
41{ \parskip 0pt plus 1pt
42  \tableofcontents }
43\cleardoublepage
44
45% PREPARE FOR MAIN TEXT
46\pagenumbering{arabic}
47\raggedbottom
48\widowpenalty=10000
49\clubpenalty=10000
50\parindent=0pt
51\parskip=5pt
52\renewcommand{\topfraction}{.8}
53\renewcommand{\bottomfraction}{.8}
54\renewcommand{\textfraction}{.2}
55\renewcommand{\floatpagefraction}{.8}
56\setstretch{1.1}
57
58\chapter{Introduction}
59
60Xen allows the hardware resources of a machine to be virtualized and
61dynamically partitioned, allowing multiple different {\em guest}
62operating system images to be run simultaneously.  Virtualizing the
63machine in this manner provides considerable flexibility, for example
64allowing different users to choose their preferred operating system
65(e.g., Linux, NetBSD, or a custom operating system).  Furthermore, Xen
66provides secure partitioning between virtual machines (known as
67{\em domains} in Xen terminology), and enables better resource
68accounting and QoS isolation than can be achieved with a conventional
69operating system.
70
71Xen essentially takes a `whole machine' virtualization approach as
72pioneered by IBM VM/370.  However, unlike VM/370 or more recent
73efforts such as VMware and Virtual PC, Xen does not attempt to
74completely virtualize the underlying hardware.  Instead parts of the
75hosted guest operating systems are modified to work with the VMM; the
76operating system is effectively ported to a new target architecture,
77typically requiring changes in just the machine-dependent code.  The
78user-level API is unchanged, and so existing binaries and operating
79system distributions work without modification.
80
81In addition to exporting virtualized instances of CPU, memory, network
82and block devices, Xen exposes a control interface to manage how these
83resources are shared between the running domains. Access to the
84control interface is restricted: it may only be used by one
85specially-privileged VM, known as {\em domain 0}.  This domain is a
86required part of any Xen-based server and runs the application software
87that manages the control-plane aspects of the platform.  Running the
88control software in {\it domain 0}, distinct from the hypervisor
89itself, allows the Xen framework to separate the notions of
90mechanism and policy within the system.
91
92
93\chapter{Virtual Architecture}
94
95In a Xen/x86 system, only the hypervisor runs with full processor
96privileges ({\it ring 0} in the x86 four-ring model). It has full
97access to the physical memory available in the system and is
98responsible for allocating portions of it to running domains. 
99
100On a 32-bit x86 system, guest operating systems may use {\it rings 1},
101{\it 2} and {\it 3} as they see fit.  Segmentation is used to prevent
102the guest OS from accessing the portion of the address space that is
103reserved for Xen.  We expect most guest operating systems will use
104ring 1 for their own operation and place applications in ring 3.
105
106On 64-bit systems it is not possible to protect the hypervisor from
107untrusted guest code running in rings 1 and 2. Guests are therefore
108restricted to run in ring 3 only. The guest kernel is protected from its
109applications by context switching between the kernel and currently
110running application.
111
112In this chapter we consider the basic virtual architecture provided by
113Xen: CPU state, exception and interrupt handling, and time.
114Other aspects such as memory and device access are discussed in later
115chapters.
116
117
118\section{CPU state}
119
120All privileged state must be handled by Xen.  The guest OS has no
121direct access to CR3 and is not permitted to update privileged bits in
122EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
123these are analogous to system calls but occur from ring 1 to ring 0.
124
125A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
126
127
128\section{Exceptions}
129
130A virtual IDT is provided --- a domain can submit a table of trap
131handlers to Xen via the {\bf set\_trap\_table} hypercall.  The
132exception stack frame presented to a virtual trap handler is identical
133to its native equivalent.
134
135
136\section{Interrupts and events}
137
138Interrupts are virtualized by mapping them to \emph{event channels},
139which are delivered asynchronously to the target domain using a callback
140supplied via the {\bf set\_callbacks} hypercall.  A guest OS can map
141these events onto its standard interrupt dispatch mechanisms.  Xen is
142responsible for determining the target domain that will handle each
143physical interrupt source. For more details on the binding of event
144sources to event channels, see Chapter~\ref{c:devices}.
145
146
147\section{Time}
148
149Guest operating systems need to be aware of the passage of both real
150(or wallclock) time and their own `virtual time' (the time for which
151they have been executing). Furthermore, Xen has a notion of time which
152is used for scheduling. The following notions of time are provided:
153
154\begin{description}
155\item[Cycle counter time.]
156
157  This provides a fine-grained time reference.  The cycle counter time
158  is used to accurately extrapolate the other time references.  On SMP
159  machines it is currently assumed that the cycle counter time is
160  synchronized between CPUs.  The current x86-based implementation
161  achieves this within inter-CPU communication latencies.
162
163\item[System time.]
164
165  This is a 64-bit counter which holds the number of nanoseconds that
166  have elapsed since system boot.
167
168\item[Wall clock time.]
169
170  This is the time of day in a Unix-style {\bf struct timeval}
171  (seconds and microseconds since 1 January 1970, adjusted by leap
172  seconds).  An NTP client hosted by {\it domain 0} can keep this
173  value accurate.
174
175\item[Domain virtual time.]
176
177  This progresses at the same pace as system time, but only while a
178  domain is executing --- it stops while a domain is de-scheduled.
179  Therefore the share of the CPU that a domain receives is indicated
180  by the rate at which its virtual time increases.
181
182\end{description}
183
184
185Xen exports timestamps for system time and wall-clock time to guest
186operating systems through a shared page of memory.  Xen also provides
187the cycle counter time at the instant the timestamps were calculated,
188and the CPU frequency in Hertz.  This allows the guest to extrapolate
189system and wall-clock times accurately based on the current cycle
190counter time.
191
192Since all time stamps need to be updated and read \emph{atomically}
193a version number is also stored in the shared info page, which is
194incremented before and after updating the timestamps. Thus a guest can
195be sure that it read a consistent state by checking the two version
196numbers are equal and even.
197
198Xen includes a periodic ticker which sends a timer event to the
199currently executing domain every 10ms.  The Xen scheduler also sends a
200timer event whenever a domain is scheduled; this allows the guest OS
201to adjust for the time that has passed while it has been inactive.  In
202addition, Xen allows each domain to request that they receive a timer
203event sent at a specified system time by using the {\bf
204  set\_timer\_op} hypercall.  Guest OSes may use this timer to
205implement timeout values when they block.
206
207
208\section{Xen CPU Scheduling}
209
210Xen offers a uniform API for CPU schedulers.  It is possible to choose
211from a number of schedulers at boot and it should be easy to add more.
212The SEDF and Credit schedulers are part of the normal Xen
213distribution.  SEDF will be going away and its use should be
214avoided once the credit scheduler has stabilized and become the default.
215The Credit scheduler provides proportional fair shares of the
216host's CPUs to the running domains. It does this while transparently
217load balancing runnable VCPUs across the whole system.
218
219\paragraph*{Note: SMP host support}
220Xen has always supported SMP host systems. When using the credit scheduler,
221a domain's VCPUs will be dynamically moved across physical CPUs to maximise
222domain and system throughput. VCPUs can also be manually restricted to be
223mapped only on a subset of the host's physical CPUs, using the pinning
224mechanism.
225
226
227%% More information on the characteristics and use of these schedulers
228%% is available in {\bf Sched-HOWTO.txt}.
229
230
231\section{Privileged operations}
232
233Xen exports an extended interface to privileged domains (viz.\ {\it
234  Domain 0}). This allows such domains to build and boot other domains
235on the server, and provides control interfaces for managing
236scheduling, memory, networking, and block devices.
237
238\chapter{Memory}
239\label{c:memory} 
240
241Xen is responsible for managing the allocation of physical memory to
242domains, and for ensuring safe use of the paging and segmentation
243hardware.
244
245
246\section{Memory Allocation}
247
248As well as allocating a portion of physical memory for its own private
249use, Xen also reserves s small fixed portion of every virtual address
250space. This is located in the top 64MB on 32-bit systems, the top
251168MB on PAE systems, and a larger portion in the middle of the
252address space on 64-bit systems. Unreserved physical memory is
253available for allocation to domains at a page granularity.  Xen tracks
254the ownership and use of each page, which allows it to enforce secure
255partitioning between domains.
256
257Each domain has a maximum and current physical memory allocation.  A
258guest OS may run a `balloon driver' to dynamically adjust its current
259memory allocation up to its limit.
260
261
262\section{Pseudo-Physical Memory}
263
264Since physical memory is allocated and freed on a page granularity,
265there is no guarantee that a domain will receive a contiguous stretch
266of physical memory. However most operating systems do not have good
267support for operating in a fragmented physical address space. To aid
268porting such operating systems to run on top of Xen, we make a
269distinction between \emph{machine memory} and \emph{pseudo-physical
270  memory}.
271
272Put simply, machine memory refers to the entire amount of memory
273installed in the machine, including that reserved by Xen, in use by
274various domains, or currently unallocated. We consider machine memory
275to comprise a set of 4kB \emph{machine page frames} numbered
276consecutively starting from 0. Machine frame numbers mean the same
277within Xen or any domain.
278
279Pseudo-physical memory, on the other hand, is a per-domain
280abstraction. It allows a guest operating system to consider its memory
281allocation to consist of a contiguous range of physical page frames
282starting at physical frame 0, despite the fact that the underlying
283machine page frames may be sparsely allocated and in any order.
284
285To achieve this, Xen maintains a globally readable {\it
286  machine-to-physical} table which records the mapping from machine
287page frames to pseudo-physical ones. In addition, each domain is
288supplied with a {\it physical-to-machine} table which performs the
289inverse mapping. Clearly the machine-to-physical table has size
290proportional to the amount of RAM installed in the machine, while each
291physical-to-machine table has size proportional to the memory
292allocation of the given domain.
293
294Architecture dependent code in guest operating systems can then use
295the two tables to provide the abstraction of pseudo-physical memory.
296In general, only certain specialized parts of the operating system
297(such as page table management) needs to understand the difference
298between machine and pseudo-physical addresses.
299
300
301\section{Page Table Updates}
302
303In the default mode of operation, Xen enforces read-only access to
304page tables and requires guest operating systems to explicitly request
305any modifications.  Xen validates all such requests and only applies
306updates that it deems safe.  This is necessary to prevent domains from
307adding arbitrary mappings to their page tables.
308
309To aid validation, Xen associates a type and reference count with each
310memory page. A page has one of the following mutually-exclusive types
311at any point in time: page directory ({\sf PD}), page table ({\sf
312  PT}), local descriptor table ({\sf LDT}), global descriptor table
313({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
314create readable mappings of its own memory regardless of its current
315type.
316
317%%% XXX: possibly explain more about ref count 'lifecyle' here?
318This mechanism is used to maintain the invariants required for safety;
319for example, a domain cannot have a writable mapping to any part of a
320page table as this would require the page concerned to simultaneously
321be of types {\sf PT} and {\sf RW}.
322
323\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)}
324
325This hypercall is used to make updates to either the domain's
326pagetables or to the machine to physical mapping table.  It supports
327submitting a queue of updates, allowing batching for maximal
328performance.  Explicitly queuing updates using this interface will
329cause any outstanding writable pagetable state to be flushed from the
330system.
331
332\section{Writable Page Tables}
333
334Xen also provides an alternative mode of operation in which guests
335have the illusion that their page tables are directly writable.  Of
336course this is not really the case, since Xen must still validate
337modifications to ensure secure partitioning. To this end, Xen traps
338any write attempt to a memory page of type {\sf PT} (i.e., that is
339currently part of a page table).  If such an access occurs, Xen
340temporarily allows write access to that page while at the same time
341\emph{disconnecting} it from the page table that is currently in use.
342This allows the guest to safely make updates to the page because the
343newly-updated entries cannot be used by the MMU until Xen revalidates
344and reconnects the page.  Reconnection occurs automatically in a
345number of situations: for example, when the guest modifies a different
346page-table page, when the domain is preempted, or whenever the guest
347uses Xen's explicit page-table update interfaces.
348
349Writable pagetable functionality is enabled when the guest requests
350it, using a {\bf vm\_assist} hypercall.  Writable pagetables do {\em
351not} provide full virtualisation of the MMU, so the memory management
352code of the guest still needs to be aware that it is running on Xen.
353Since the guest's page tables are used directly, it must translate
354pseudo-physical addresses to real machine addresses when building page
355table entries.  The guest may not attempt to map its own pagetables
356writably, since this would violate the memory type invariants; page
357tables will automatically be made writable by the hypervisor, as
358necessary.
359
360\section{Shadow Page Tables}
361
362Finally, Xen also supports a form of \emph{shadow page tables} in
363which the guest OS uses a independent copy of page tables which are
364unknown to the hardware (i.e.\ which are never pointed to by {\tt
365  cr3}). Instead Xen propagates changes made to the guest's tables to
366the real ones, and vice versa. This is useful for logging page writes
367(e.g.\ for live migration or checkpoint). A full version of the shadow
368page tables also allows guest OS porting with less effort.
369
370
371\section{Segment Descriptor Tables}
372
373At start of day a guest is supplied with a default GDT, which does not reside
374within its own memory allocation.  If the guest wishes to use other
375than the default `flat' ring-1 and ring-3 segments that this GDT
376provides, it must register a custom GDT and/or LDT with Xen, allocated
377from its own memory.
378
379The following hypercall is used to specify a new GDT:
380
381\begin{quote}
382  int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
383    entries})
384
385  \emph{frame\_list}: An array of up to 14 machine page frames within
386  which the GDT resides.  Any frame registered as a GDT frame may only
387  be mapped read-only within the guest's address space (e.g., no
388  writable mappings, no use as a page-table page, and so on). Only 14
389  pages may be specified because pages 15 and 16 are reserved for
390  the hypervisor's GDT entries.
391
392  \emph{entries}: The number of descriptor-entry slots in the GDT.
393\end{quote}
394
395The LDT is updated via the generic MMU update mechanism (i.e., via the
396{\bf mmu\_update} hypercall.
397
398\section{Start of Day}
399
400The start-of-day environment for guest operating systems is rather
401different to that provided by the underlying hardware. In particular,
402the processor is already executing in protected mode with paging
403enabled.
404
405{\it Domain 0} is created and booted by Xen itself. For all subsequent
406domains, the analogue of the boot-loader is the {\it domain builder},
407user-space software running in {\it domain 0}. The domain builder is
408responsible for building the initial page tables for a domain and
409loading its kernel image at the appropriate virtual address.
410
411\section{VM assists}
412
413Xen provides a number of ``assists'' for guest memory management.
414These are available on an ``opt-in'' basis to provide commonly-used
415extra functionality to a guest.
416
417\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
418
419The {\bf cmd} parameter describes the action to be taken, whilst the
420{\bf type} parameter describes the kind of assist that is being
421referred to.  Available commands are as follows:
422
423\begin{description}
424\item[VMASST\_CMD\_enable] Enable a particular assist type
425\item[VMASST\_CMD\_disable] Disable a particular assist type
426\end{description}
427
428And the available types are:
429
430\begin{description}
431\item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
432  instructions that rely on 4GB segments (such as the techniques used
433  by some TLS solutions).
434\item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the
435  guest if the above segment fixups are used: allows the guest to
436  display a warning message during boot.
437\item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
438  mode - described above.
439\end{description}
440
441
442\chapter{Xen Info Pages}
443
444The {\bf Shared info page} is used to share various CPU-related state
445between the guest OS and the hypervisor.  This information includes VCPU
446status, time information and event channel (virtual interrupt) state.
447The {\bf Start info page} is used to pass build-time information to
448the guest when it boots and when it is resumed from a suspended state.
449This chapter documents the fields included in the {\bf
450shared\_info\_t} and {\bf start\_info\_t} structures for use by the
451guest OS.
452
453\section{Shared info page}
454
455The {\bf shared\_info\_t} is accessed at run time by both Xen and the
456guest OS.  It is used to pass information relating to the
457virtual CPU and virtual machine state between the OS and the
458hypervisor.
459
460The structure is declared in {\bf xen/include/public/xen.h}:
461
462\scriptsize
463\begin{verbatim}
464typedef struct shared_info {
465    vcpu_info_t vcpu_info[MAX_VIRT_CPUS];
466
467    /*
468     * A domain can create "event channels" on which it can send and receive
469     * asynchronous event notifications. There are three classes of event that
470     * are delivered by this mechanism:
471     *  1. Bi-directional inter- and intra-domain connections. Domains must
472     *     arrange out-of-band to set up a connection (usually by allocating
473     *     an unbound 'listener' port and avertising that via a storage service
474     *     such as xenstore).
475     *  2. Physical interrupts. A domain with suitable hardware-access
476     *     privileges can bind an event-channel port to a physical interrupt
477     *     source.
478     *  3. Virtual interrupts ('events'). A domain can bind an event-channel
479     *     port to a virtual interrupt source, such as the virtual-timer
480     *     device or the emergency console.
481     *
482     * Event channels are addressed by a "port index". Each channel is
483     * associated with two bits of information:
484     *  1. PENDING -- notifies the domain that there is a pending notification
485     *     to be processed. This bit is cleared by the guest.
486     *  2. MASK -- if this bit is clear then a 0->1 transition of PENDING
487     *     will cause an asynchronous upcall to be scheduled. This bit is only
488     *     updated by the guest. It is read-only within Xen. If a channel
489     *     becomes pending while the channel is masked then the 'edge' is lost
490     *     (i.e., when the channel is unmasked, the guest must manually handle
491     *     pending notifications as no upcall will be scheduled by Xen).
492     *
493     * To expedite scanning of pending notifications, any 0->1 pending
494     * transition on an unmasked channel causes a corresponding bit in a
495     * per-vcpu selector word to be set. Each bit in the selector covers a
496     * 'C long' in the PENDING bitfield array.
497     */
498    unsigned long evtchn_pending[sizeof(unsigned long) * 8];
499    unsigned long evtchn_mask[sizeof(unsigned long) * 8];
500
501    /*
502     * Wallclock time: updated only by control software. Guests should base
503     * their gettimeofday() syscall on this wallclock-base value.
504     */
505    uint32_t wc_version;      /* Version counter: see vcpu_time_info_t. */
506    uint32_t wc_sec;          /* Secs  00:00:00 UTC, Jan 1, 1970.  */
507    uint32_t wc_nsec;         /* Nsecs 00:00:00 UTC, Jan 1, 1970.  */
508
509    arch_shared_info_t arch;
510
511} shared_info_t;
512\end{verbatim}
513\normalsize
514
515\begin{description}
516\item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of
517  which holds either runtime information about a virtual CPU, or is
518  ``empty'' if the corresponding VCPU does not exist.
519\item[evtchn\_pending] Guest-global array, with one bit per event
520  channel.  Bits are set if an event is currently pending on that
521  channel.
522\item[evtchn\_mask] Guest-global array for masking notifications on
523  event channels.
524\item[wc\_version] Version counter for current wallclock time.
525\item[wc\_sec] Whole seconds component of current wallclock time.
526\item[wc\_nsec] Nanoseconds component of current wallclock time.
527\item[arch] Host architecture-dependent portion of the shared info
528  structure.
529\end{description}
530
531\subsection{vcpu\_info\_t}
532
533\scriptsize
534\begin{verbatim}
535typedef struct vcpu_info {
536    /*
537     * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
538     * a pending notification for a particular VCPU. It is then cleared
539     * by the guest OS /before/ checking for pending work, thus avoiding
540     * a set-and-check race. Note that the mask is only accessed by Xen
541     * on the CPU that is currently hosting the VCPU. This means that the
542     * pending and mask flags can be updated by the guest without special
543     * synchronisation (i.e., no need for the x86 LOCK prefix).
544     * This may seem suboptimal because if the pending flag is set by
545     * a different CPU then an IPI may be scheduled even when the mask
546     * is set. However, note:
547     *  1. The task of 'interrupt holdoff' is covered by the per-event-
548     *     channel mask bits. A 'noisy' event that is continually being
549     *     triggered can be masked at source at this very precise
550     *     granularity.
551     *  2. The main purpose of the per-VCPU mask is therefore to restrict
552     *     reentrant execution: whether for concurrency control, or to
553     *     prevent unbounded stack usage. Whatever the purpose, we expect
554     *     that the mask will be asserted only for short periods at a time,
555     *     and so the likelihood of a 'spurious' IPI is suitably small.
556     * The mask is read before making an event upcall to the guest: a
557     * non-zero mask therefore guarantees that the VCPU will not receive
558     * an upcall activation. The mask is cleared when the VCPU requests
559     * to block: this avoids wakeup-waiting races.
560     */
561    uint8_t evtchn_upcall_pending;
562    uint8_t evtchn_upcall_mask;
563    unsigned long evtchn_pending_sel;
564    arch_vcpu_info_t arch;
565    vcpu_time_info_t time;
566} vcpu_info_t; /* 64 bytes (x86) */
567\end{verbatim}
568\normalsize
569
570\begin{description}
571\item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
572  that there are pending events to be received.
573\item[evtchn\_upcall\_mask] This is set non-zero to disable all
574  interrupts for this CPU for short periods of time.  If individual
575  event channels need to be masked, the {\bf evtchn\_mask} in the {\bf
576  shared\_info\_t} is used instead.
577\item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
578  bit is set in this selector to indicate which word of the {\bf
579  evtchn\_pending} array in the {\bf shared\_info\_t} contains the
580  event in question.
581\item[arch] Architecture-specific VCPU info. On x86 this contains the
582  virtualized CR2 register (page fault linear address) for this VCPU.
583\item[time] Time values for this VCPU.
584\end{description}
585
586\subsection{vcpu\_time\_info}
587
588\scriptsize
589\begin{verbatim}
590typedef struct vcpu_time_info {
591    /*
592     * Updates to the following values are preceded and followed by an
593     * increment of 'version'. The guest can therefore detect updates by
594     * looking for changes to 'version'. If the least-significant bit of
595     * the version number is set then an update is in progress and the guest
596     * must wait to read a consistent set of values.
597     * The correct way to interact with the version number is similar to
598     * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
599     */
600    uint32_t version;
601    uint32_t pad0;
602    uint64_t tsc_timestamp;   /* TSC at last update of time vals.  */
603    uint64_t system_time;     /* Time, in nanosecs, since boot.    */
604    /*
605     * Current system time:
606     *   system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
607     * CPU frequency (Hz):
608     *   ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
609     */
610    uint32_t tsc_to_system_mul;
611    int8_t   tsc_shift;
612    int8_t   pad1[3];
613} vcpu_time_info_t; /* 32 bytes */
614\end{verbatim}
615\normalsize
616
617\begin{description}
618\item[version] Used to ensure the guest gets consistent time updates.
619\item[tsc\_timestamp] Cycle counter timestamp of last time value;
620  could be used to expolate in between updates, for instance.
621\item[system\_time] Time since boot (nanoseconds).
622\item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
623(used in extrapolating current time).
624\item[tsc\_shift] Cycle counter to nanoseconds shift (used in
625extrapolating current time).
626\end{description}
627
628\subsection{arch\_shared\_info\_t}
629
630On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from
631xen/public/arch-x86\_32.h):
632
633\scriptsize
634\begin{verbatim}
635typedef struct arch_shared_info {
636    unsigned long max_pfn;                  /* max pfn that appears in table */
637    /* Frame containing list of mfns containing list of mfns containing p2m. */
638    unsigned long pfn_to_mfn_frame_list_list;
639} arch_shared_info_t;
640\end{verbatim}
641\normalsize
642
643\begin{description}
644\item[max\_pfn] The maximum PFN listed in the physical-to-machine
645  mapping table (P2M table).
646\item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
647  that contains the machine addresses of the P2M table frames.
648\end{description}
649
650\section{Start info page}
651
652The start info structure is declared as the following (in {\bf
653xen/include/public/xen.h}):
654
655\scriptsize
656\begin{verbatim}
657#define MAX_GUEST_CMDLINE 1024
658typedef struct start_info {
659    /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME.    */
660    char magic[32];             /* "Xen-<version>.<subversion>". */
661    unsigned long nr_pages;     /* Total pages allocated to this domain.  */
662    unsigned long shared_info;  /* MACHINE address of shared info struct. */
663    uint32_t flags;             /* SIF_xxx flags.                         */
664    unsigned long store_mfn;    /* MACHINE page number of shared page.    */
665    uint32_t store_evtchn;      /* Event channel for store communication. */
666    unsigned long console_mfn;  /* MACHINE address of console page.       */
667    uint32_t console_evtchn;    /* Event channel for console messages.    */
668    /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME).     */
669    unsigned long pt_base;      /* VIRTUAL address of page directory.     */
670    unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames.       */
671    unsigned long mfn_list;     /* VIRTUAL address of page-frame list.    */
672    unsigned long mod_start;    /* VIRTUAL address of pre-loaded module.  */
673    unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
674    int8_t cmd_line[MAX_GUEST_CMDLINE];
675} start_info_t;
676\end{verbatim}
677\normalsize
678
679The fields are in two groups: the first group are always filled in
680when a domain is booted or resumed, the second set are only used at
681boot time.
682
683The always-available group is as follows:
684
685\begin{description}
686\item[magic] A text string identifying the Xen version to the guest.
687\item[nr\_pages] The number of real machine pages available to the
688  guest.
689\item[shared\_info] Machine address of the shared info structure,
690  allowing the guest to map it during initialisation.
691\item[flags] Flags for describing optional extra settings to the
692  guest.
693\item[store\_mfn] Machine address of the Xenstore communications page.
694\item[store\_evtchn] Event channel to communicate with the store.
695\item[console\_mfn] Machine address of the console data page.
696\item[console\_evtchn] Event channel to notify the console backend.
697\end{description}
698
699The boot-only group may only be safely referred to during system boot:
700
701\begin{description}
702\item[pt\_base] Virtual address of the page directory created for us
703  by the domain builder.
704\item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
705  pagetables.
706\item[mfn\_list] Virtual address of the list of machine frames this
707  domain owns.
708\item[mod\_start] Virtual address of any pre-loaded modules
709  (e.g. ramdisk)
710\item[mod\_len] Size of pre-loaded module (if any).
711\item[cmd\_line] Kernel command line passed by the domain builder.
712\end{description}
713
714
715% by Mark Williamson <mark.williamson@cl.cam.ac.uk>
716
717\chapter{Event Channels}
718\label{c:eventchannels}
719
720Event channels are the basic primitive provided by Xen for event
721notifications.  An event is the Xen equivalent of a hardware
722interrupt.  They essentially store one bit of information, the event
723of interest is signalled by transitioning this bit from 0 to 1.
724
725Notifications are received by a guest via an upcall from Xen,
726indicating when an event arrives (setting the bit).  Further
727notifications are masked until the bit is cleared again (therefore,
728guests must check the value of the bit after re-enabling event
729delivery to ensure no missed notifications).
730
731Event notifications can be masked by setting a flag; this is
732equivalent to disabling interrupts and can be used to ensure atomicity
733of certain operations in the guest kernel.
734
735\section{Hypercall interface}
736
737\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
738
739The event channel operation hypercall is used for all operations on
740event channels / ports.  Operations are distinguished by the value of
741the {\bf cmd} field of the {\bf op} structure.  The possible commands
742are described below:
743
744\begin{description}
745
746\item[EVTCHNOP\_alloc\_unbound]
747  Allocate a new event channel port, ready to be connected to by a
748  remote domain.
749  \begin{itemize}
750  \item Specified domain must exist.
751  \item A free port must exist in that domain.
752  \end{itemize}
753  Unprivileged domains may only allocate their own ports, privileged
754  domains may also allocate ports in other domains.
755\item[EVTCHNOP\_bind\_interdomain]
756  Bind an event channel for interdomain communications.
757  \begin{itemize}
758  \item Caller domain must have a free port to bind.
759  \item Remote domain must exist.
760  \item Remote port must be allocated and currently unbound.
761  \item Remote port must be expecting the caller domain as the ``remote''.
762  \end{itemize}
763\item[EVTCHNOP\_bind\_virq]
764  Allocate a port and bind a VIRQ to it.
765  \begin{itemize}
766  \item Caller domain must have a free port to bind.
767  \item VIRQ must be valid.
768  \item VCPU must exist.
769  \item VIRQ must not currently be bound to an event channel.
770  \end{itemize}
771\item[EVTCHNOP\_bind\_ipi]
772  Allocate and bind a port for notifying other virtual CPUs.
773  \begin{itemize}
774  \item Caller domain must have a free port to bind.
775  \item VCPU must exist.
776  \end{itemize}
777\item[EVTCHNOP\_bind\_pirq]
778  Allocate and bind a port to a real IRQ.
779  \begin{itemize}
780  \item Caller domain must have a free port to bind.
781  \item PIRQ must be within the valid range.
782  \item Another binding for this PIRQ must not exist for this domain.
783  \item Caller must have an available port.
784  \end{itemize}
785\item[EVTCHNOP\_close]
786  Close an event channel (no more events will be received).
787  \begin{itemize}
788  \item Port must be valid (currently allocated).
789  \end{itemize}
790\item[EVTCHNOP\_send] Send a notification on an event channel attached
791  to a port.
792  \begin{itemize}
793  \item Port must be valid.
794  \item Only valid for Interdomain, IPI or Allocated Unbound ports.
795  \end{itemize}
796\item[EVTCHNOP\_status] Query the status of a port; what kind of port,
797  whether it is bound, what remote domain is expected, what PIRQ or
798  VIRQ it is bound to, what VCPU will be notified, etc.
799  Unprivileged domains may only query the state of their own ports.
800  Privileged domains may query any port.
801\item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
802  receive notification upcalls only on that VCPU.
803  \begin{itemize}
804  \item VCPU must exist.
805  \item Port must be valid.
806  \item Event channel must be either: allocated but unbound, bound to
807  an interdomain event channel, bound to a PIRQ.
808  \end{itemize}
809
810\end{description}
811
812%%
813%% grant_tables.tex
814%%
815%% Made by Mark Williamson
816%% Login   <mark@maw48>
817%%
818
819\chapter{Grant tables}
820\label{c:granttables}
821
822Xen's grant tables provide a generic mechanism to memory sharing
823between domains.  This shared memory interface underpins the split
824device drivers for block and network IO.
825
826Each domain has its own {\bf grant table}.  This is a data structure
827that is shared with Xen; it allows the domain to tell Xen what kind of
828permissions other domains have on its pages.  Entries in the grant
829table are identified by {\bf grant references}.  A grant reference is
830an integer, which indexes into the grant table.  It acts as a
831capability which the grantee can use to perform operations on the
832granter's memory.
833
834This capability-based system allows shared-memory communications
835between unprivileged domains.  A grant reference also encapsulates the
836details of a shared page, removing the need for a domain to know the
837real machine address of a page it is sharing.  This makes it possible
838to share memory correctly with domains running in fully virtualised
839memory.
840
841\section{Interface}
842
843\subsection{Grant table manipulation}
844
845Creating and destroying grant references is done by direct access to
846the grant table.  This removes the need to involve Xen when creating
847grant references, modifying access permissions, etc.  The grantee
848domain will invoke hypercalls to use the grant references.  Four main
849operations can be accomplished by directly manipulating the table:
850
851\begin{description}
852\item[Grant foreign access] allocate a new entry in the grant table
853  and fill out the access permissions accordingly.  The access
854  permissions will be looked up by Xen when the grantee attempts to
855  use the reference to map the granted frame.
856\item[End foreign access] check that the grant reference is not
857  currently in use, then remove the mapping permissions for the frame.
858  This prevents further mappings from taking place but does not allow
859  forced revocations of existing mappings.
860\item[Grant foreign transfer] allocate a new entry in the table
861  specifying transfer permissions for the grantee.  Xen will look up
862  this entry when the grantee attempts to transfer a frame to the
863  granter.
864\item[End foreign transfer] remove permissions to prevent a transfer
865  occurring in future.  If the transfer is already committed,
866  modifying the grant table cannot prevent it from completing.
867\end{description}
868
869\subsection{Hypercalls}
870
871Use of grant references is accomplished via a hypercall.  The grant
872table op hypercall takes three arguments:
873
874\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
875
876{\bf cmd} indicates the grant table operation of interest.  {\bf uop}
877is a pointer to a structure (or an array of structures) describing the
878operation to be performed.  The {\bf count} field describes how many
879grant table operations are being batched together.
880
881The core logic is situated in {\bf xen/common/grant\_table.c}.  The
882grant table operation hypercall can be used to perform the following
883actions:
884
885\begin{description}
886\item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
887  domain, map the referred page into the caller's address space.
888\item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
889  from the caller's address space.  This is used to voluntarily
890  relinquish a mapping to a granted page.
891\item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
892\item[GNTTABOP\_dump\_table] Debugging operation.
893\item[GNTTABOP\_transfer] Given a transfer reference from another
894  domain, transfer ownership of a page frame to that domain.
895\end{description}
896
897%%
898%% xenstore.tex
899%%
900%% Made by Mark Williamson
901%% Login   <mark@maw48>
902%%
903
904\chapter{Xenstore}
905
906Xenstore is the mechanism by which control-plane activities occur.
907These activities include:
908
909\begin{itemize}
910\item Setting up shared memory regions and event channels for use with
911  the split device drivers.
912\item Notifying the guest of control events (e.g. balloon driver
913  requests)
914\item Reporting back status information from the guest
915  (e.g. performance-related statistics, etc).
916\end{itemize}
917
918The store is arranged as a hierachical collection of key-value pairs.
919Each domain has a directory hierarchy containing data related to its
920configuration.  Domains are permitted to register for notifications
921about changes in subtrees of the store, and to apply changes to the
922store transactionally.
923
924\section{Guidelines}
925
926A few principles govern the operation of the store:
927
928\begin{itemize}
929\item Domains should only modify the contents of their own
930  directories.
931\item The setup protocol for a device channel should simply consist of
932  entering the configuration data into the store.
933\item The store should allow device discovery without requiring the
934  relevant device drivers to be loaded: a Xen ``bus'' should be
935  visible to probing code in the guest.
936\item The store should be usable for inter-tool communications,
937  allowing the tools themselves to be decomposed into a number of
938  smaller utilities, rather than a single monolithic entity.  This
939  also facilitates the development of alternate user interfaces to the
940  same functionality.
941\end{itemize}
942
943\section{Store layout}
944
945There are three main paths in XenStore:
946
947\begin{description}
948\item[/vm] stores configuration information about domain
949\item[/local/domain] stores information about the domain on the local node (domid, etc.)
950\item[/tool] stores information for the various tools
951\end{description}
952
953The {\bf /vm} path stores configuration information for a domain.
954This information doesn't change and is indexed by the domain's UUID.
955A {\bf /vm} entry contains the following information:
956
957\begin{description}
958\item[uuid] uuid of the domain (somewhat redundant)
959\item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
960\item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
961\item[on\_crash] the action to take on a domain crash (destroy or restart)
962\item[vcpus] the number of allocated vcpus for the domain
963\item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
964\item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
965\item[name] the name of the domain
966\end{description}
967
968
969{\bf /vm/$<$uuid$>$/image/}
970
971The image path is only available for Domain-Us and contains:
972\begin{description}
973\item[ostype] identifies the builder type (linux or vmx)
974\item[kernel] path to kernel on domain-0
975\item[cmdline] command line to pass to domain-U kernel
976\item[ramdisk] path to ramdisk on domain-0
977\end{description}
978
979{\bf /local}
980
981The {\tt /local} path currently only contains one directory, {\tt
982/local/domain} that is indexed by domain id.  It contains the running
983domain information.  The reason to have two storage areas is that
984during migration, the uuid doesn't change but the domain id does.  The
985{\tt /local/domain} directory can be created and populated before
986finalizing the migration enabling localhost to localhost migration.
987
988{\bf /local/domain/$<$domid$>$}
989
990This path contains:
991
992\begin{description}
993\item[cpu\_time] xend start time (this is only around for domain-0)
994\item[handle] private handle for xend
995\item[name] see /vm
996\item[on\_reboot] see /vm
997\item[on\_poweroff] see /vm
998\item[on\_crash] see /vm
999\item[vm] the path to the VM directory for the domain
1000\item[domid] the domain id (somewhat redundant)
1001\item[running] indicates that the domain is currently running
1002\item[memory] the current memory in megabytes for the domain (empty for domain-0?)
1003\item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
1004\item[memory\_KiB] the memory allocated to the domain (in kilobytes)
1005\item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
1006\item[cpu\_weight] the weight assigned to the domain
1007\item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
1008\item[online\_vcpus] how many vcpus are currently online
1009\item[vcpus] the total number of vcpus allocated to the domain
1010\item[console/] a directory for console information
1011  \begin{description}
1012  \item[ring-ref] the grant table reference of the console ring queue
1013  \item[port] the event channel being used for the console ring queue (local port)
1014  \item[tty] the current tty the console data is being exposed of
1015  \item[limit] the limit (in bytes) of console data to buffer
1016  \end{description}
1017\item[backend/] a directory containing all backends the domain hosts
1018  \begin{description}
1019  \item[vbd/] a directory containing vbd backends
1020    \begin{description}
1021    \item[$<$domid$>$/] a directory containing vbd's for domid
1022      \begin{description}
1023      \item[$<$virtual-device$>$/] a directory for a particular
1024        virtual-device on domid
1025        \begin{description}
1026        \item[frontend-id] domain id of frontend
1027        \item[frontend] the path to the frontend domain
1028        \item[physical-device] backend device number
1029        \item[sector-size] backend sector size
1030        \item[info] 0 read/write, 1 read-only (is this right?)
1031        \item[domain] name of frontend domain
1032        \item[params] parameters for device
1033        \item[type] the type of the device
1034        \item[dev] the virtual device (as given by the user)
1035        \item[node] output from block creation script
1036        \end{description}
1037      \end{description}
1038    \end{description}
1039 
1040  \item[vif/] a directory containing vif backends
1041    \begin{description}
1042    \item[$<$domid$>$/] a directory containing vif's for domid
1043      \begin{description}
1044      \item[$<$vif number$>$/] a directory for each vif
1045      \item[frontend-id] the domain id of the frontend
1046      \item[frontend] the path to the frontend
1047      \item[mac] the mac address of the vif
1048      \item[bridge] the bridge the vif is connected to
1049      \item[handle] the handle of the vif
1050      \item[script] the script used to create/stop the vif
1051      \item[domain] the name of the frontend
1052      \end{description}
1053    \end{description}
1054
1055  \item[vtpm/] a directory containin vtpm backends
1056    \begin{description}
1057    \item[$<$domid$>$/] a directory containing vtpm's for domid
1058      \begin{description}
1059      \item[$<$vtpm number$>$/] a directory for each vtpm
1060      \item[frontend-id] the domain id of the frontend
1061      \item[frontend] the path to the frontend
1062      \item[instance] the instance of the virtual TPM that is used
1063      \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file;
1064           may be different from {\bf instance}
1065      \item[domain] the name of the domain of the frontend
1066      \end{description}
1067    \end{description}
1068
1069  \end{description}
1070
1071  \item[device/] a directory containing the frontend devices for the
1072    domain
1073    \begin{description}
1074    \item[vbd/] a directory containing vbd frontend devices for the
1075      domain
1076      \begin{description}
1077      \item[$<$virtual-device$>$/] a directory containing the vbd frontend for
1078        virtual-device
1079        \begin{description}
1080        \item[virtual-device] the device number of the frontend device
1081        \item[backend-id] the domain id of the backend
1082        \item[backend] the path of the backend in the store (/local/domain
1083          path)
1084        \item[ring-ref] the grant table reference for the block request
1085          ring queue
1086        \item[event-channel] the event channel used for the block request
1087          ring queue
1088        \end{description}
1089       
1090      \item[vif/] a directory containing vif frontend devices for the
1091        domain
1092        \begin{description}
1093        \item[$<$id$>$/] a directory for vif id frontend device for the domain
1094          \begin{description}
1095          \item[backend-id] the backend domain id
1096          \item[mac] the mac address of the vif
1097          \item[handle] the internal vif handle
1098          \item[backend] a path to the backend's store entry
1099          \item[tx-ring-ref] the grant table reference for the transmission ring queue
1100          \item[rx-ring-ref] the grant table reference for the receiving ring queue
1101          \item[event-channel] the event channel used for the two ring queues
1102          \end{description}
1103        \end{description}
1104
1105      \item[vtpm/] a directory containing the vtpm frontend device for the
1106        domain
1107        \begin{description}
1108        \item[$<$id$>$] a directory for vtpm id frontend device for the domain
1109          \begin{description}
1110          \item[backend-id] the backend domain id
1111          \item[backend] a path to the backend's store entry
1112          \item[ring-ref] the grant table reference for the tx/rx ring
1113          \item[event-channel] the event channel used for the ring
1114          \end{description}
1115        \end{description}
1116       
1117      \item[device-misc/] miscellanous information for devices
1118        \begin{description}
1119        \item[vif/] miscellanous information for vif devices
1120          \begin{description}
1121          \item[nextDeviceID] the next device id to use
1122          \end{description}
1123        \end{description}
1124      \end{description}
1125    \end{description}
1126
1127  \item[security/] access control information for the domain
1128    \begin{description}
1129    \item[ssidref] security reference identifier used inside the hypervisor
1130    \item[access\_control/] security label used by management tools
1131      \begin{description}
1132       \item[label] security label name
1133       \item[policy] security policy name
1134      \end{description}
1135    \end{description}
1136
1137  \item[store/] per-domain information for the store
1138    \begin{description}
1139    \item[port] the event channel used for the store ring queue
1140    \item[ring-ref] - the grant table reference used for the store's
1141      communication channel
1142    \end{description}
1143   
1144  \item[image] - private xend information
1145\end{description}
1146
1147
1148\chapter{Devices}
1149\label{c:devices}
1150
1151Virtual devices under Xen are provided by a {\bf split device driver}
1152architecture.  The illusion of the virtual device is provided by two
1153co-operating drivers: the {\bf frontend}, which runs an the
1154unprivileged domain and the {\bf backend}, which runs in a domain with
1155access to the real device hardware (often called a {\bf driver
1156domain}; in practice domain 0 usually fulfills this function).
1157
1158The frontend driver appears to the unprivileged guest as if it were a
1159real device, for instance a block or network device.  It receives IO
1160requests from its kernel as usual, however since it does not have
1161access to the physical hardware of the system it must then issue
1162requests to the backend.  The backend driver is responsible for
1163receiving these IO requests, verifying that they are safe and then
1164issuing them to the real device hardware.  The backend driver appears
1165to its kernel as a normal user of in-kernel IO functionality.  When
1166the IO completes the backend notifies the frontend that the data is
1167ready for use; the frontend is then able to report IO completion to
1168its own kernel.
1169
1170Frontend drivers are designed to be simple; most of the complexity is
1171in the backend, which has responsibility for translating device
1172addresses, verifying that requests are well-formed and do not violate
1173isolation guarantees, etc.
1174
1175Split drivers exchange requests and responses in shared memory, with
1176an event channel for asynchronous notifications of activity.  When the
1177frontend driver comes up, it uses Xenstore to set up a shared memory
1178frame and an interdomain event channel for communications with the
1179backend.  Once this connection is established, the two can communicate
1180directly by placing requests / responses into shared memory and then
1181sending notifications on the event channel.  This separation of
1182notification from data transfer allows message batching, and results
1183in very efficient device access.
1184
1185This chapter focuses on some individual split device interfaces
1186available to Xen guests.
1187
1188       
1189\section{Network I/O}
1190
1191Virtual network device services are provided by shared memory
1192communication with a backend domain.  From the point of view of other
1193domains, the backend may be viewed as a virtual ethernet switch
1194element with each domain having one or more virtual network interfaces
1195connected to it.
1196
1197From the point of view of the backend domain itself, the network
1198backend driver consists of a number of ethernet devices.  Each of
1199these has a logical direct connection to a virtual network device in
1200another domain.  This allows the backend domain to route, bridge,
1201firewall, etc the traffic to / from the other domains using normal
1202operating system mechanisms.
1203
1204\subsection{Backend Packet Handling}
1205
1206The backend driver is responsible for a variety of actions relating to
1207the transmission and reception of packets from the physical device.
1208With regard to transmission, the backend performs these key actions:
1209
1210\begin{itemize}
1211\item {\bf Validation:} To ensure that domains do not attempt to
1212  generate invalid (e.g. spoofed) traffic, the backend driver may
1213  validate headers ensuring that source MAC and IP addresses match the
1214  interface that they have been sent from.
1215
1216  Validation functions can be configured using standard firewall rules
1217  ({\small{\tt iptables}} in the case of Linux).
1218 
1219\item {\bf Scheduling:} Since a number of domains can share a single
1220  physical network interface, the backend must mediate access when
1221  several domains each have packets queued for transmission.  This
1222  general scheduling function subsumes basic shaping or rate-limiting
1223  schemes.
1224 
1225\item {\bf Logging and Accounting:} The backend domain can be
1226  configured with classifier rules that control how packets are
1227  accounted or logged.  For example, log messages might be generated
1228  whenever a domain attempts to send a TCP packet containing a SYN.
1229\end{itemize}
1230
1231On receipt of incoming packets, the backend acts as a simple
1232demultiplexer: Packets are passed to the appropriate virtual interface
1233after any necessary logging and accounting have been carried out.
1234
1235\subsection{Data Transfer}
1236
1237Each virtual interface uses two ``descriptor rings'', one for
1238transmit, the other for receive.  Each descriptor identifies a block
1239of contiguous machine memory allocated to the domain.
1240
1241The transmit ring carries packets to transmit from the guest to the
1242backend domain.  The return path of the transmit ring carries messages
1243indicating that the contents have been physically transmitted and the
1244backend no longer requires the associated pages of memory.
1245
1246To receive packets, the guest places descriptors of unused pages on
1247the receive ring.  The backend will return received packets by
1248exchanging these pages in the domain's memory with new pages
1249containing the received data, and passing back descriptors regarding
1250the new packets on the ring.  This zero-copy approach allows the
1251backend to maintain a pool of free pages to receive packets into, and
1252then deliver them to appropriate domains after examining their
1253headers.
1254
1255% Real physical addresses are used throughout, with the domain
1256% performing translation from pseudo-physical addresses if that is
1257% necessary.
1258
1259If a domain does not keep its receive ring stocked with empty buffers
1260then packets destined to it may be dropped.  This provides some
1261defence against receive livelock problems because an overloaded domain
1262will cease to receive further data.  Similarly, on the transmit path,
1263it provides the application with feedback on the rate at which packets
1264are able to leave the system.
1265
1266Flow control on rings is achieved by including a pair of producer
1267indexes on the shared ring page.  Each side will maintain a private
1268consumer index indicating the next outstanding message.  In this
1269manner, the domains cooperate to divide the ring into two message
1270lists, one in each direction.  Notification is decoupled from the
1271immediate placement of new messages on the ring; the event channel
1272will be used to generate notification when {\em either} a certain
1273number of outstanding messages are queued, {\em or} a specified number
1274of nanoseconds have elapsed since the oldest message was placed on the
1275ring.
1276
1277%% Not sure if my version is any better -- here is what was here
1278%% before: Synchronization between the backend domain and the guest is
1279%% achieved using counters held in shared memory that is accessible to
1280%% both.  Each ring has associated producer and consumer indices
1281%% indicating the area in the ring that holds descriptors that contain
1282%% data.  After receiving {\it n} packets or {\t nanoseconds} after
1283%% receiving the first packet, the hypervisor sends an event to the
1284%% domain.
1285
1286
1287\subsection{Network ring interface}
1288
1289The network device uses two shared memory rings for communication: one
1290for transmit, one for receieve.
1291
1292Transmit requests are described by the following structure:
1293
1294\scriptsize
1295\begin{verbatim}
1296typedef struct netif_tx_request {
1297    grant_ref_t gref;      /* Reference to buffer page */
1298    uint16_t offset;       /* Offset within buffer page */
1299    uint16_t flags;        /* NETTXF_* */
1300    uint16_t id;           /* Echoed in response message. */
1301    uint16_t size;         /* Packet size in bytes.       */
1302} netif_tx_request_t;
1303\end{verbatim}
1304\normalsize
1305
1306\begin{description}
1307\item[gref] Grant reference for the network buffer
1308\item[offset] Offset to data
1309\item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
1310  supported, to indicate that the protocol checksum field is
1311  incomplete).
1312\item[id] Echoed to guest by the backend in the ring-level response so
1313  that the guest can match it to this request
1314\item[size] Buffer size
1315\end{description}
1316
1317Each transmit request is followed by a transmit response at some later
1318date.  This is part of the shared-memory communication protocol and
1319allows the guest to (potentially) retire internal structures related
1320to the request.  It does not imply a network-level response.  This
1321structure is as follows:
1322
1323\scriptsize
1324\begin{verbatim}
1325typedef struct netif_tx_response {
1326    uint16_t id;
1327    int16_t  status;
1328} netif_tx_response_t;
1329\end{verbatim}
1330\normalsize
1331
1332\begin{description}
1333\item[id] Echo of the ID field in the corresponding transmit request.
1334\item[status] Success / failure status of the transmit request.
1335\end{description}
1336
1337Receive requests must be queued by the frontend, accompanied by a
1338donation of page-frames to the backend.  The backend transfers page
1339frames full of data back to the guest
1340
1341\scriptsize
1342\begin{verbatim}
1343typedef struct {
1344    uint16_t    id;        /* Echoed in response message.        */
1345    grant_ref_t gref;      /* Reference to incoming granted frame */
1346} netif_rx_request_t;
1347\end{verbatim}
1348\normalsize
1349
1350\begin{description}
1351\item[id] Echoed by the frontend to identify this request when
1352  responding.
1353\item[gref] Transfer reference - the backend will use this reference
1354  to transfer a frame of network data to us.
1355\end{description}
1356
1357Receive response descriptors are queued for each received frame.  Note
1358that these may only be queued in reply to an existing receive request,
1359providing an in-built form of traffic throttling.
1360
1361\scriptsize
1362\begin{verbatim}
1363typedef struct {
1364    uint16_t id;
1365    uint16_t offset;       /* Offset in page of start of received packet  */
1366    uint16_t flags;        /* NETRXF_* */
1367    int16_t  status;       /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
1368} netif_rx_response_t;
1369\end{verbatim}
1370\normalsize
1371
1372\begin{description}
1373\item[id] ID echoed from the original request, used by the guest to
1374  match this response to the original request.
1375\item[offset] Offset to data within the transferred frame.
1376\item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
1377  supported, to indicate that the protocol checksum field has already
1378  been validated).
1379\item[status] Success / error status for this operation.
1380\end{description}
1381
1382Note that the receive protocol includes a mechanism for guests to
1383receive incoming memory frames but there is no explicit transfer of
1384frames in the other direction.  Guests are expected to return memory
1385to the hypervisor in order to use the network interface.  They {\em
1386must} do this or they will exceed their maximum memory reservation and
1387will not be able to receive incoming frame transfers.  When necessary,
1388the backend is able to replenish its pool of free network buffers by
1389claiming some of this free memory from the hypervisor.
1390
1391\section{Block I/O}
1392
1393All guest OS disk access goes through the virtual block device VBD
1394interface.  This interface allows domains access to portions of block
1395storage devices visible to the the block backend device.  The VBD
1396interface is a split driver, similar to the network interface
1397described above.  A single shared memory ring is used between the
1398frontend and backend drivers for each virtual device, across which
1399IO requests and responses are sent.
1400
1401Any block device accessible to the backend domain, including
1402network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
1403can be exported as a VBD.  Each VBD is mapped to a device node in the
1404guest, specified in the guest's startup configuration.
1405
1406\subsection{Data Transfer}
1407
1408The per-(virtual)-device ring between the guest and the block backend
1409supports two messages:
1410
1411\begin{description}
1412\item [{\small {\tt READ}}:] Read data from the specified block
1413  device.  The front end identifies the device and location to read
1414  from and attaches pages for the data to be copied to (typically via
1415  DMA from the device).  The backend acknowledges completed read
1416  requests as they finish.
1417
1418\item [{\small {\tt WRITE}}:] Write data to the specified block
1419  device.  This functions essentially as {\small {\tt READ}}, except
1420  that the data moves to the device instead of from it.
1421\end{description}
1422
1423%% Rather than copying data, the backend simply maps the domain's
1424%% buffers in order to enable direct DMA to them.  The act of mapping
1425%% the buffers also increases the reference counts of the underlying
1426%% pages, so that the unprivileged domain cannot try to return them to
1427%% the hypervisor, install them as page tables, or any other unsafe
1428%% behaviour.
1429%%
1430%% % block API here
1431
1432\subsection{Block ring interface}
1433
1434The block interface is defined by the structures passed over the
1435shared memory interface.  These structures are either requests (from
1436the frontend to the backend) or responses (from the backend to the
1437frontend).
1438
1439The request structure is defined as follows:
1440
1441\scriptsize
1442\begin{verbatim}
1443typedef struct blkif_request {
1444    uint8_t        operation;    /* BLKIF_OP_???                         */
1445    uint8_t        nr_segments;  /* number of segments                   */
1446    blkif_vdev_t   handle;       /* only for read/write requests         */
1447    uint64_t       id;           /* private guest value, echoed in resp  */
1448    blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
1449    struct blkif_request_segment {
1450        grant_ref_t gref;        /* reference to I/O buffer frame        */
1451        /* @first_sect: first sector in frame to transfer (inclusive).   */
1452        /* @last_sect: last sector in frame to transfer (inclusive).     */
1453        uint8_t     first_sect, last_sect;
1454    } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
1455} blkif_request_t;
1456\end{verbatim}
1457\normalsize
1458
1459The fields are as follows:
1460
1461\begin{description}
1462\item[operation] operation ID: one of the operations described above
1463\item[nr\_segments] number of segments for scatter / gather IO
1464  described by this request
1465\item[handle] identifier for a particular virtual device on this
1466  interface
1467\item[id] this value is echoed in the response message for this IO;
1468  the guest may use it to identify the original request
1469\item[sector\_number] start sector on the virtal device for this
1470  request
1471\item[frame\_and\_sects] This array contains structures encoding
1472  scatter-gather IO to be performed:
1473  \begin{description}
1474  \item[gref] The grant reference for the foreign I/O buffer page.
1475  \item[first\_sect] First sector to access within the buffer page (0 to 7).
1476  \item[last\_sect] Last sector to access within the buffer page (0 to 7).
1477  \end{description}
1478  Data will be transferred into frames at an offset determined by the
1479  value of {\tt first\_sect}.
1480\end{description}
1481
1482\section{Virtual TPM}
1483
1484Virtual TPM (VTPM) support provides TPM functionality to each virtual
1485machine that requests this functionality in its configuration file.
1486The interface enables domains to access therr own private TPM like it
1487was a hardware TPM built into the machine.
1488
1489The virtual TPM interface is implemented as a split driver,
1490similar to the network and block interfaces described above.
1491The user domain hosting the frontend exports a character device /dev/tpm0
1492to user-level applications for communicating with the virtual TPM.
1493This is the same device interface that is also offered if a hardware TPM
1494is available in the system. The backend provides a single interface
1495/dev/vtpm where the virtual TPM is waiting for commands from all domains
1496that have located their backend in a given domain.
1497
1498\subsection{Data Transfer}
1499
1500A single shared memory ring is used between the frontend and backend
1501drivers. TPM requests and responses are sent in pages where a pointer
1502to those pages and other information is placed into the ring such that
1503the backend can map the pages into its memory space using the grant
1504table mechanism.
1505
1506The backend driver has been implemented to only accept well-formed
1507TPM requests. To meet this requirement, the length inidicator in the
1508TPM request must correctly indicate the length of the request.
1509Otherwise an error message is automatically sent back by the device driver.
1510
1511The virtual TPM implementation listenes for TPM request on /dev/vtpm. Since
1512it must be able to apply the TPM request packet to the virtual TPM instance
1513associated with the virtual machine, a 4-byte virtual TPM instance
1514identifier is prepended to each packet by the backend driver (in network
1515byte order) for internal routing of the request.
1516
1517\subsection{Virtual TPM ring interface}
1518
1519The TPM protocol is a strict request/response protocol and therefore
1520only one ring is used to send requests from the frontend to the backend
1521and responses on the reverse path.
1522
1523The request/response structure is defined as follows:
1524
1525\scriptsize
1526\begin{verbatim}
1527typedef struct {
1528    unsigned long addr;     /* Machine address of packet.     */
1529    grant_ref_t ref;        /* grant table access reference.  */
1530    uint16_t unused;        /* unused                         */
1531    uint16_t size;          /* Packet size in bytes.          */
1532} tpmif_tx_request_t;
1533\end{verbatim}
1534\normalsize
1535
1536The fields are as follows:
1537
1538\begin{description}
1539\item[addr] The machine address of the page asscoiated with the TPM
1540            request/response; a request/response may span multiple
1541            pages
1542\item[ref]  The grant table reference associated with the address.
1543\item[size] The size of the remaining packet; up to
1544            PAGE{\textunderscore}SIZE bytes can be found in the
1545            page referenced by 'addr'
1546\end{description}
1547
1548The frontend initially allocates several pages whose addresses
1549are stored in the ring. Only these pages are used for exchange of
1550requests and responses.
1551
1552
1553\chapter{Further Information}
1554
1555If you have questions that are not answered by this manual, the
1556sources of information listed below may be of interest to you.  Note
1557that bug reports, suggestions and contributions related to the
1558software (or the documentation) should be sent to the Xen developers'
1559mailing list (address below).
1560
1561
1562\section{Other documentation}
1563
1564If you are mainly interested in using (rather than developing for)
1565Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
1566directory of the Xen source distribution.
1567
1568% Various HOWTOs are also available in {\tt docs/HOWTOS}.
1569
1570
1571\section{Online references}
1572
1573The official Xen web site can be found at:
1574\begin{quote} {\tt http://www.xensource.com}
1575\end{quote}
1576
1577
1578This contains links to the latest versions of all online
1579documentation, including the latest version of the FAQ.
1580
1581Information regarding Xen is also available at the Xen Wiki at
1582\begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote}
1583The Xen project uses Bugzilla as its bug tracking system. You'll find
1584the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.
1585
1586
1587\section{Mailing lists}
1588
1589There are several mailing lists that are used to discuss Xen related
1590topics. The most widely relevant are listed below. An official page of
1591mailing lists and subscription information can be found at \begin{quote}
1592  {\tt http://lists.xensource.com/} \end{quote}
1593
1594\begin{description}
1595\item[xen-devel@lists.xensource.com] Used for development
1596  discussions and bug reports.  Subscribe at: \\
1597  {\small {\tt http://lists.xensource.com/xen-devel}}
1598\item[xen-users@lists.xensource.com] Used for installation and usage
1599  discussions and requests for help.  Subscribe at: \\
1600  {\small {\tt http://lists.xensource.com/xen-users}}
1601\item[xen-announce@lists.xensource.com] Used for announcements only.
1602  Subscribe at: \\
1603  {\small {\tt http://lists.xensource.com/xen-announce}}
1604\item[xen-changelog@lists.xensource.com] Changelog feed
1605  from the unstable and 2.0 trees - developer oriented.  Subscribe at: \\
1606  {\small {\tt http://lists.xensource.com/xen-changelog}}
1607\end{description}
1608
1609\appendix
1610
1611
1612\chapter{Xen Hypercalls}
1613\label{a:hypercalls}
1614
1615Hypercalls represent the procedural interface to Xen; this appendix
1616categorizes and describes the current set of hypercalls.
1617
1618\section{Invoking Hypercalls} 
1619
1620Hypercalls are invoked in a manner analogous to system calls in a
1621conventional operating system; a software interrupt is issued which
1622vectors to an entry point within Xen. On x86/32 machines the
1623instruction required is {\tt int \$82}; the (real) IDT is setup so
1624that this may only be issued from within ring 1. The particular
1625hypercall to be invoked is contained in {\tt EAX} --- a list
1626mapping these values to symbolic hypercall names can be found
1627in {\tt xen/include/public/xen.h}.
1628
1629On some occasions a set of hypercalls will be required to carry
1630out a higher-level function; a good example is when a guest
1631operating wishes to context switch to a new process which
1632requires updating various privileged CPU state. As an optimization
1633for these cases, there is a generic mechanism to issue a set of
1634hypercalls as a batch:
1635
1636\begin{quote}
1637\hypercall{multicall(void *call\_list, int nr\_calls)}
1638
1639Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
1640the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
1641call\_list}. Each entry contains the hypercall operation code followed
1642by up to 7 word-sized arguments.
1643\end{quote}
1644
1645Note that multicalls are provided purely as an optimization; there is
1646no requirement to use them when first porting a guest operating
1647system.
1648
1649
1650\section{Virtual CPU Setup} 
1651
1652At start of day, a guest operating system needs to setup the virtual
1653CPU it is executing on. This includes installing vectors for the
1654virtual IDT so that the guest OS can handle interrupts, page faults,
1655etc. However the very first thing a guest OS must setup is a pair
1656of hypervisor callbacks: these are the entry points which Xen will
1657use when it wishes to notify the guest OS of an occurrence.
1658
1659\begin{quote}
1660\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
1661  event\_address, unsigned long failsafe\_selector, unsigned long
1662  failsafe\_address) }
1663
1664Register the normal (``event'') and failsafe callbacks for
1665event processing. In each case the code segment selector and
1666address within that segment are provided. The selectors must
1667have RPL 1; in XenLinux we simply use the kernel's CS for both
1668{\bf event\_selector} and {\bf failsafe\_selector}.
1669
1670The value {\bf event\_address} specifies the address of the guest OSes
1671event handling and dispatch routine; the {\bf failsafe\_address}
1672specifies a separate entry point which is used only if a fault occurs
1673when Xen attempts to use the normal callback.
1674
1675\end{quote} 
1676
1677On x86/64 systems the hypercall takes slightly different
1678arguments. This is because callback CS does not need to be specified
1679(since teh callbacks are entered via SYSRET), and also because an
1680entry address needs to be specified for SYSCALLs from guest user
1681space:
1682
1683\begin{quote}
1684\hypercall{set\_callbacks(unsigned long event\_address, unsigned long
1685  failsafe\_address, unsigned long syscall\_address)}
1686\end{quote} 
1687
1688
1689After installing the hypervisor callbacks, the guest OS can
1690install a `virtual IDT' by using the following hypercall:
1691
1692\begin{quote} 
1693\hypercall{set\_trap\_table(trap\_info\_t *table)} 
1694
1695Install one or more entries into the per-domain
1696trap handler table (essentially a software version of the IDT).
1697Each entry in the array pointed to by {\bf table} includes the
1698exception vector number with the corresponding segment selector
1699and entry point. Most guest OSes can use the same handlers on
1700Xen as when running on the real hardware.
1701
1702
1703\end{quote} 
1704
1705A further hypercall is provided for the management of virtual CPUs:
1706
1707\begin{quote}
1708\hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}
1709
1710This hypercall can be used to bootstrap VCPUs, to bring them up and
1711down and to test their current status.
1712
1713\end{quote}
1714
1715\section{Scheduling and Timer}
1716
1717Domains are preemptively scheduled by Xen according to the
1718parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
1719In addition, however, a domain may choose to explicitly
1720control certain behavior with the following hypercall:
1721
1722\begin{quote} 
1723\hypercall{sched\_op\_new(int cmd, void *extra\_args)}
1724
1725Request scheduling operation from hypervisor. The following
1726sub-commands are available:
1727
1728\begin{description}
1729\item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the
1730caller marked as runnable. No extra arguments are passed to this
1731command.
1732\item[SCHEDOP\_block] removes the calling domain from the run queue
1733and causes it to sleep until an event is delivered to it. No extra
1734arguments are passed to this command.
1735\item[SCHEDOP\_shutdown] is used to end the calling domain's
1736execution. The extra argument is a {\bf sched\_shutdown} structure
1737which indicates the reason why the domain suspended (e.g., for reboot,
1738halt, power-off).
1739\item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels
1740with an optional timeout (all of which are specified in the {\bf
1741sched\_poll} extra argument). The semantics are similar to the UNIX
1742{\bf poll} system call. The caller must have event-channel upcalls
1743masked when executing this command.
1744\end{description}
1745\end{quote} 
1746
1747{\bf sched\_op\_new}  was not available prior to Xen 3.0.2. Older versions
1748provide only the following hypercall:
1749
1750\begin{quote} 
1751\hypercall{sched\_op(int cmd, unsigned long extra\_arg)}
1752
1753This hypercall supports the following subset of {\bf sched\_op\_new} commands:
1754
1755\begin{description}
1756\item[SCHEDOP\_yield] (extra argument is 0).
1757\item[SCHEDOP\_block] (extra argument is 0).
1758\item[SCHEDOP\_shutdown] (extra argument is numeric reason code).
1759\end{description}
1760\end{quote}
1761
1762To aid the implementation of a process scheduler within a guest OS,
1763Xen provides a virtual programmable timer:
1764
1765\begin{quote}
1766\hypercall{set\_timer\_op(uint64\_t timeout)} 
1767
1768Request a timer event to be sent at the specified system time (time
1769in nanoseconds since system boot).
1770
1771\end{quote} 
1772
1773Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op} 
1774allows block-with-timeout semantics.
1775
1776
1777\section{Page Table Management} 
1778
1779Since guest operating systems have read-only access to their page
1780tables, Xen must be involved when making any changes. The following
1781multi-purpose hypercall can be used to modify page-table entries,
1782update the machine-to-physical mapping table, flush the TLB, install
1783a new page-table base pointer, and more.
1784
1785\begin{quote} 
1786\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} 
1787
1788Update the page table for the domain; a set of {\bf count} updates are
1789submitted for processing in a batch, with {\bf success\_count} being
1790updated to report the number of successful updates. 
1791
1792Each element of {\bf req[]} contains a pointer (address) and value;
1793the least significant 2-bits of the pointer are used to distinguish
1794the type of update requested as follows:
1795\begin{description} 
1796
1797\item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
1798page table entry to the associated value; Xen will check that the
1799update is safe, as described in Chapter~\ref{c:memory}.
1800
1801\item[MMU\_MACHPHYS\_UPDATE:] update an entry in the
1802  machine-to-physical table. The calling domain must own the machine
1803  page in question (or be privileged).
1804\end{description}
1805
1806\end{quote}
1807
1808Explicitly updating batches of page table entries is extremely
1809efficient, but can require a number of alterations to the guest
1810OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
1811recommended for new OS ports.
1812
1813Regardless of which page table update mode is being used, however,
1814there are some occasions (notably handling a demand page fault) where
1815a guest OS will wish to modify exactly one PTE rather than a
1816batch, and where that PTE is mapped into the current address space.
1817This is catered for by the following:
1818
1819\begin{quote} 
1820\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
1821                         unsigned long flags)}
1822
1823Update the currently installed PTE that maps virtual address {\bf va}
1824to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the
1825modification  is safe before applying it. The {\bf flags} determine
1826which kind of TLB flush, if any, should follow the update.
1827
1828\end{quote} 
1829
1830Finally, sufficiently privileged domains may occasionally wish to manipulate
1831the pages of others:
1832
1833\begin{quote}
1834\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
1835                         unsigned long flags, domid\_t domid)}
1836
1837Identical to {\bf update\_va\_mapping} save that the pages being
1838mapped must belong to the domain {\bf domid}.
1839
1840\end{quote}
1841
1842An additional MMU hypercall provides an ``extended command''
1843interface.  This provides additional functionality beyond the basic
1844table updating commands:
1845
1846\begin{quote}
1847
1848\hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)}
1849
1850This hypercall is used to perform additional MMU operations.  These
1851include updating {\tt cr3} (or just re-installing it for a TLB flush),
1852requesting various kinds of TLB flush, flushing the cache, installing
1853a new LDT, or pinning \& unpinning page-table pages (to ensure their
1854reference count doesn't drop to zero which would require a
1855revalidation of all entries).  Some of the operations available are
1856restricted to domains with sufficient system privileges.
1857
1858It is also possible for privileged domains to reassign page ownership
1859via an extended MMU operation, although grant tables are used instead
1860of this where possible; see Section~\ref{s:idc}.
1861
1862\end{quote}
1863
1864Finally, a hypercall interface is exposed to activate and deactivate
1865various optional facilities provided by Xen for memory management.
1866
1867\begin{quote} 
1868\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
1869
1870Toggle various memory management modes (in particular writable page
1871tables).
1872
1873\end{quote} 
1874
1875\section{Segmentation Support}
1876
1877Xen allows guest OSes to install a custom GDT if they require it;
1878this is context switched transparently whenever a domain is
1879[de]scheduled.  The following hypercall is effectively a
1880`safe' version of {\tt lgdt}:
1881
1882\begin{quote}
1883\hypercall{set\_gdt(unsigned long *frame\_list, int entries)} 
1884
1885Install a global descriptor table for a domain; {\bf frame\_list} is
1886an array of up to 16 machine page frames within which the GDT resides,
1887with {\bf entries} being the actual number of descriptor-entry
1888slots. All page frames must be mapped read-only within the guest's
1889address space, and the table must be large enough to contain Xen's
1890reserved entries (see {\bf xen/include/public/arch-x86\_32.h}).
1891
1892\end{quote}
1893
1894Many guest OSes will also wish to install LDTs; this is achieved by
1895using {\bf mmu\_update} with an extended command, passing the
1896linear address of the LDT base along with the number of entries. No
1897special safety checks are required; Xen needs to perform this task
1898simply since {\tt lldt} requires CPL 0.
1899
1900
1901Xen also allows guest operating systems to update just an
1902individual segment descriptor in the GDT or LDT: 
1903
1904\begin{quote}
1905\hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}
1906
1907Update the GDT/LDT entry at machine address {\bf ma}; the new
19088-byte descriptor is stored in {\bf desc}.
1909Xen performs a number of checks to ensure the descriptor is
1910valid.
1911
1912\end{quote}
1913
1914Guest OSes can use the above in place of context switching entire
1915LDTs (or the GDT) when the number of changing descriptors is small.
1916
1917\section{Context Switching} 
1918
1919When a guest OS wishes to context switch between two processes,
1920it can use the page table and segmentation hypercalls described
1921above to perform the the bulk of the privileged work. In addition,
1922however, it will need to invoke Xen to switch the kernel (ring 1)
1923stack pointer:
1924
1925\begin{quote} 
1926\hypercall{stack\_switch(unsigned long ss, unsigned long esp)} 
1927
1928Request kernel stack switch from hypervisor; {\bf ss} is the new
1929stack segment, which {\bf esp} is the new stack pointer.
1930
1931\end{quote} 
1932
1933A useful hypercall for context switching allows ``lazy'' save and
1934restore of floating point state:
1935
1936\begin{quote}
1937\hypercall{fpu\_taskswitch(int set)} 
1938
1939This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
1940control register; this means that the next attempt to use floating
1941point will cause a trap which the guest OS can trap. Typically it will
1942then save/restore the FP state, and clear the {\tt TS} bit, using the
1943same call.
1944\end{quote} 
1945
1946This is provided as an optimization only; guest OSes can also choose
1947to save and restore FP state on all context switches for simplicity.
1948
1949Finally, a hypercall is provided for entering vm86 mode:
1950
1951\begin{quote}
1952\hypercall{switch\_vm86}
1953
1954This allows the guest to run code in vm86 mode, which is needed for
1955some legacy software.
1956\end{quote}
1957
1958\section{Physical Memory Management}
1959
1960As mentioned previously, each domain has a maximum and current
1961memory allocation. The maximum allocation, set at domain creation
1962time, cannot be modified. However a domain can choose to reduce
1963and subsequently grow its current allocation by using the
1964following call:
1965
1966\begin{quote} 
1967\hypercall{memory\_op(unsigned int op, void *arg)}
1968
1969Increase or decrease current memory allocation (as determined by
1970the value of {\bf op}).  The available operations are:
1971
1972\begin{description}
1973\item[XENMEM\_increase\_reservation] Request an increase in machine
1974  memory allocation; {\bf arg} must point to a {\bf
1975  xen\_memory\_reservation} structure.
1976\item[XENMEM\_decrease\_reservation] Request a decrease in machine
1977  memory allocation; {\bf arg} must point to a {\bf
1978  xen\_memory\_reservation} structure.
1979\item[XENMEM\_maximum\_ram\_page] Request the frame number of the
1980  highest-addressed frame of machine memory in the system.  {\bf arg}
1981  must point to an {\bf unsigned long} where this value will be
1982  stored.
1983\item[XENMEM\_current\_reservation] Returns current memory reservation
1984  of the specified domain.
1985\item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation
1986  of the specified domain.
1987\end{description}
1988
1989\end{quote} 
1990
1991In addition to simply reducing or increasing the current memory
1992allocation via a `balloon driver', this call is also useful for
1993obtaining contiguous regions of machine memory when required (e.g.
1994for certain PCI devices, or if using superpages). 
1995
1996
1997\section{Inter-Domain Communication}
1998\label{s:idc} 
1999
2000Xen provides a simple asynchronous notification mechanism via
2001\emph{event channels}. Each domain has a set of end-points (or
2002\emph{ports}) which may be bound to an event source (e.g. a physical
2003IRQ, a virtual IRQ, or an port in another domain). When a pair of
2004end-points in two different domains are bound together, then a `send'
2005operation on one will cause an event to be received by the destination
2006domain.
2007
2008The control and use of event channels involves the following hypercall:
2009
2010\begin{quote}
2011\hypercall{event\_channel\_op(evtchn\_op\_t *op)} 
2012
2013Inter-domain event-channel management; {\bf op} is a discriminated
2014union which allows the following 7 operations:
2015
2016\begin{description} 
2017
2018\item[alloc\_unbound:] allocate a free (unbound) local
2019  port and prepare for connection from a specified domain.
2020\item[bind\_virq:] bind a local port to a virtual
2021IRQ; any particular VIRQ can be bound to at most one port per domain.
2022\item[bind\_pirq:] bind a local port to a physical IRQ;
2023once more, a given pIRQ can be bound to at most one port per
2024domain. Furthermore the calling domain must be sufficiently
2025privileged.
2026\item[bind\_interdomain:] construct an interdomain event
2027channel; in general, the target domain must have previously allocated
2028an unbound port for this channel, although this can be bypassed by
2029privileged domains during domain setup.
2030\item[close:] close an interdomain event channel.
2031\item[send:] send an event to the remote end of a
2032interdomain event channel.
2033\item[status:] determine the current status of a local port.
2034\end{description} 
2035
2036For more details see
2037{\bf xen/include/public/event\_channel.h}.
2038
2039\end{quote} 
2040
2041Event channels are the fundamental communication primitive between
2042Xen domains and seamlessly support SMP. However they provide little
2043bandwidth for communication {\sl per se}, and hence are typically
2044married with a piece of shared memory to produce effective and
2045high-performance inter-domain communication.
2046
2047Safe sharing of memory pages between guest OSes is carried out by
2048granting access on a per page basis to individual domains. This is
2049achieved by using the {\tt grant\_table\_op} hypercall.
2050
2051\begin{quote}
2052\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
2053
2054Used to invoke operations on a grant reference, to setup the grant
2055table and to dump the tables' contents for debugging.
2056
2057\end{quote} 
2058
2059\section{IO Configuration} 
2060
2061Domains with physical device access (i.e.\ driver domains) receive
2062limited access to certain PCI devices (bus address space and
2063interrupts). However many guest operating systems attempt to
2064determine the PCI configuration by directly access the PCI BIOS,
2065which cannot be allowed for safety.
2066
2067Instead, Xen provides the following hypercall:
2068
2069\begin{quote}
2070\hypercall{physdev\_op(void *physdev\_op)}
2071
2072Set and query IRQ configuration details, set the system IOPL, set the
2073TSS IO bitmap.
2074
2075\end{quote} 
2076
2077
2078For examples of using {\tt physdev\_op}, see the
2079Xen-specific PCI code in the linux sparse tree.
2080
2081\section{Administrative Operations}
2082\label{s:dom0ops}
2083
2084A large number of control operations are available to a sufficiently
2085privileged domain (typically domain 0). These allow the creation and
2086management of new domains, for example. A complete list is given
2087below: for more details on any or all of these, please see
2088{\tt xen/include/public/dom0\_ops.h} 
2089
2090
2091\begin{quote}
2092\hypercall{dom0\_op(dom0\_op\_t *op)} 
2093
2094Administrative domain operations for domain management. The options are:
2095
2096\begin{description} 
2097\item [DOM0\_GETMEMLIST:] get list of pages used by the domain
2098
2099\item [DOM0\_SCHEDCTL:]
2100
2101\item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
2102
2103\item [DOM0\_CREATEDOMAIN:] create a new domain
2104
2105\item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated
2106with a domain
2107
2108\item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
2109queue.
2110
2111\item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
2112  once again.
2113
2114\item [DOM0\_GETDOMAININFO:] get statistics about the domain
2115
2116\item [DOM0\_SETDOMAININFO:] set VCPU-related attributes
2117
2118\item [DOM0\_MSR:] read or write model specific registers
2119
2120\item [DOM0\_DEBUG:] interactively invoke the debugger
2121
2122\item [DOM0\_SETTIME:] set system time
2123
2124\item [DOM0\_GETPAGEFRAMEINFO:]
2125
2126\item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
2127
2128\item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
2129
2130\item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes
2131
2132\item [DOM0\_PHYSINFO:] get information about the host machine
2133
2134\item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
2135
2136\item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
2137
2138\item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
2139
2140\item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
2141page frame info
2142
2143\item [DOM0\_ADD\_MEMTYPE:] set MTRRs
2144
2145\item [DOM0\_DEL\_MEMTYPE:] remove a memory type range
2146
2147\item [DOM0\_READ\_MEMTYPE:] read MTRR
2148
2149\item [DOM0\_PERFCCONTROL:] control Xen's software performance
2150counters
2151
2152\item [DOM0\_MICROCODE:] update CPU microcode
2153
2154\item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
2155IO port range (enable / disable a range for a particular domain)
2156
2157\item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU
2158
2159\item [DOM0\_GETVCPUINFO:] get current state for a VCPU
2160\item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain
2161info
2162
2163\item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
2164needs to handle (e.g. noirqbalance)
2165
2166\item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
2167map
2168
2169\item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain
2170
2171\item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain
2172
2173\end{description} 
2174\end{quote} 
2175
2176Most of the above are best understood by looking at the code
2177implementing them (in {\tt xen/common/dom0\_ops.c}) and in
2178the user-space tools that use them (mostly in {\tt tools/libxc}).
2179
2180\section{Access Control Module Hypercalls}
2181\label{s:acmops}
2182
2183Hypercalls relating to the management of the Access Control Module are
2184also restricted to domain 0 access for now. For more details on any or
2185all of these, please see {\tt xen/include/public/acm\_ops.h}.  A
2186complete list is given below:
2187
2188\begin{quote}
2189
2190\hypercall{acm\_op(int cmd, void *args)}
2191
2192This hypercall can be used to configure the state of the ACM, query
2193that state, request access control decisions and dump additional
2194information.
2195
2196\begin{description}
2197
2198\item [ACMOP\_SETPOLICY:] set the access control policy
2199
2200\item [ACMOP\_GETPOLICY:] get the current access control policy and
2201  status
2202
2203\item [ACMOP\_DUMPSTATS:] get current access control hook invocation
2204  statistics
2205
2206\item [ACMOP\_GETSSID:] get security access control information for a
2207  domain
2208
2209\item [ACMOP\_GETDECISION:] get access decision based on the currently
2210  enforced access control policy
2211
2212\end{description}
2213\end{quote}
2214
2215Most of the above are best understood by looking at the code
2216implementing them (in {\tt xen/common/acm\_ops.c}) and in the
2217user-space tools that use them (mostly in {\tt tools/security} and
2218{\tt tools/python/xen/lowlevel/acm}).
2219
2220
2221\section{Debugging Hypercalls} 
2222
2223A few additional hypercalls are mainly useful for debugging:
2224
2225\begin{quote} 
2226\hypercall{console\_io(int cmd, int count, char *str)}
2227
2228Use Xen to interact with the console; operations are:
2229
2230{CONSOLEIO\_write}: Output count characters from buffer str.
2231
2232{CONSOLEIO\_read}: Input at most count characters into buffer str.
2233\end{quote} 
2234
2235A pair of hypercalls allows access to the underlying debug registers:
2236\begin{quote}
2237\hypercall{set\_debugreg(int reg, unsigned long value)}
2238
2239Set debug register {\bf reg} to {\bf value} 
2240
2241\hypercall{get\_debugreg(int reg)}
2242
2243Return the contents of the debug register {\bf reg}
2244\end{quote}
2245
2246And finally:
2247\begin{quote}
2248\hypercall{xen\_version(int cmd)}
2249
2250Request Xen version number.
2251\end{quote} 
2252
2253This is useful to ensure that user-space tools are in sync
2254with the underlying hypervisor.
2255
2256
2257\end{document}
Note: See TracBrowser for help on using the repository browser.