Please file new bugs on Launchpad: Invirt or XVM (if you're not sure which, just pick one)

Context Navigation

source: trunk/packages/xen-3.1/xen-3.1/docs/src/interface.tex @ 34

Last change on this file since 34 was 34, checked in by hartmans, 18 years ago
Add xen and xen-common
File size: 87.1 KB

Rev	Line
[34]	1	\documentclass[11pt,twoside,final,openright]{report}
	2	\usepackage{a4,graphicx,html,setspace,times}
	3	\usepackage{comment,parskip}
	4	\setstretch{1.15}
	5
	6	% LIBRARY FUNCTIONS
	7
	8	\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
	9
	10	\begin{document}
	11
	12	% TITLE PAGE
	13	\pagestyle{empty}
	14	\begin{center}
	15	\vspace*{\fill}
	16	\includegraphics{figs/xenlogo.eps}
	17	\vfill
	18	\vfill
	19	\vfill
	20	\begin{tabular}{l}
	21	{\Huge \bf Interface manual} \\[4mm]
	22	{\huge Xen v3.0 for x86} \\[80mm]
	23
	24	{\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
	25	{\Large University of Cambridge, UK} \\[20mm]
	26	\end{tabular}
	27	\end{center}
	28
	29	{\bf DISCLAIMER: This documentation is always under active development
	30	and as such there may be mistakes and omissions --- watch out for
	31	these and please report any you find to the developer's mailing list.
	32	The latest version is always available on-line. Contributions of
	33	material, suggestions and corrections are welcome. }
	34
	35	\vfill
	36	\cleardoublepage
	37
	38	% TABLE OF CONTENTS
	39	\pagestyle{plain}
	40	\pagenumbering{roman}
	41	{ \parskip 0pt plus 1pt
	42	\tableofcontents }
	43	\cleardoublepage
	44
	45	% PREPARE FOR MAIN TEXT
	46	\pagenumbering{arabic}
	47	\raggedbottom
	48	\widowpenalty=10000
	49	\clubpenalty=10000
	50	\parindent=0pt
	51	\parskip=5pt
	52	\renewcommand{\topfraction}{.8}
	53	\renewcommand{\bottomfraction}{.8}
	54	\renewcommand{\textfraction}{.2}
	55	\renewcommand{\floatpagefraction}{.8}
	56	\setstretch{1.1}
	57
	58	\chapter{Introduction}
	59
	60	Xen allows the hardware resources of a machine to be virtualized and
	61	dynamically partitioned, allowing multiple different {\em guest}
	62	operating system images to be run simultaneously. Virtualizing the
	63	machine in this manner provides considerable flexibility, for example
	64	allowing different users to choose their preferred operating system
	65	(e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen
	66	provides secure partitioning between virtual machines (known as
	67	{\em domains} in Xen terminology), and enables better resource
	68	accounting and QoS isolation than can be achieved with a conventional
	69	operating system.
	70
	71	Xen essentially takes a `whole machine' virtualization approach as
	72	pioneered by IBM VM/370. However, unlike VM/370 or more recent
	73	efforts such as VMware and Virtual PC, Xen does not attempt to
	74	completely virtualize the underlying hardware. Instead parts of the
	75	hosted guest operating systems are modified to work with the VMM; the
	76	operating system is effectively ported to a new target architecture,
	77	typically requiring changes in just the machine-dependent code. The
	78	user-level API is unchanged, and so existing binaries and operating
	79	system distributions work without modification.
	80
	81	In addition to exporting virtualized instances of CPU, memory, network
	82	and block devices, Xen exposes a control interface to manage how these
	83	resources are shared between the running domains. Access to the
	84	control interface is restricted: it may only be used by one
	85	specially-privileged VM, known as {\em domain 0}. This domain is a
	86	required part of any Xen-based server and runs the application software
	87	that manages the control-plane aspects of the platform. Running the
	88	control software in {\it domain 0}, distinct from the hypervisor
	89	itself, allows the Xen framework to separate the notions of
	90	mechanism and policy within the system.
	91
	92
	93	\chapter{Virtual Architecture}
	94
	95	In a Xen/x86 system, only the hypervisor runs with full processor
	96	privileges ({\it ring 0} in the x86 four-ring model). It has full
	97	access to the physical memory available in the system and is
	98	responsible for allocating portions of it to running domains.
	99
	100	On a 32-bit x86 system, guest operating systems may use {\it rings 1},
	101	{\it 2} and {\it 3} as they see fit. Segmentation is used to prevent
	102	the guest OS from accessing the portion of the address space that is
	103	reserved for Xen. We expect most guest operating systems will use
	104	ring 1 for their own operation and place applications in ring 3.
	105
	106	On 64-bit systems it is not possible to protect the hypervisor from
	107	untrusted guest code running in rings 1 and 2. Guests are therefore
	108	restricted to run in ring 3 only. The guest kernel is protected from its
	109	applications by context switching between the kernel and currently
	110	running application.
	111
	112	In this chapter we consider the basic virtual architecture provided by
	113	Xen: CPU state, exception and interrupt handling, and time.
	114	Other aspects such as memory and device access are discussed in later
	115	chapters.
	116
	117
	118	\section{CPU state}
	119
	120	All privileged state must be handled by Xen. The guest OS has no
	121	direct access to CR3 and is not permitted to update privileged bits in
	122	EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
	123	these are analogous to system calls but occur from ring 1 to ring 0.
	124
	125	A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
	126
	127
	128	\section{Exceptions}
	129
	130	A virtual IDT is provided --- a domain can submit a table of trap
	131	handlers to Xen via the {\bf set\_trap\_table} hypercall. The
	132	exception stack frame presented to a virtual trap handler is identical
	133	to its native equivalent.
	134
	135
	136	\section{Interrupts and events}
	137
	138	Interrupts are virtualized by mapping them to \emph{event channels},
	139	which are delivered asynchronously to the target domain using a callback
	140	supplied via the {\bf set\_callbacks} hypercall. A guest OS can map
	141	these events onto its standard interrupt dispatch mechanisms. Xen is
	142	responsible for determining the target domain that will handle each
	143	physical interrupt source. For more details on the binding of event
	144	sources to event channels, see Chapter~\ref{c:devices}.
	145
	146
	147	\section{Time}
	148
	149	Guest operating systems need to be aware of the passage of both real
	150	(or wallclock) time and their own `virtual time' (the time for which
	151	they have been executing). Furthermore, Xen has a notion of time which
	152	is used for scheduling. The following notions of time are provided:
	153
	154	\begin{description}
	155	\item[Cycle counter time.]
	156
	157	This provides a fine-grained time reference. The cycle counter time
	158	is used to accurately extrapolate the other time references. On SMP
	159	machines it is currently assumed that the cycle counter time is
	160	synchronized between CPUs. The current x86-based implementation
	161	achieves this within inter-CPU communication latencies.
	162
	163	\item[System time.]
	164
	165	This is a 64-bit counter which holds the number of nanoseconds that
	166	have elapsed since system boot.
	167
	168	\item[Wall clock time.]
	169
	170	This is the time of day in a Unix-style {\bf struct timeval}
	171	(seconds and microseconds since 1 January 1970, adjusted by leap
	172	seconds). An NTP client hosted by {\it domain 0} can keep this
	173	value accurate.
	174
	175	\item[Domain virtual time.]
	176
	177	This progresses at the same pace as system time, but only while a
	178	domain is executing --- it stops while a domain is de-scheduled.
	179	Therefore the share of the CPU that a domain receives is indicated
	180	by the rate at which its virtual time increases.
	181
	182	\end{description}
	183
	184
	185	Xen exports timestamps for system time and wall-clock time to guest
	186	operating systems through a shared page of memory. Xen also provides
	187	the cycle counter time at the instant the timestamps were calculated,
	188	and the CPU frequency in Hertz. This allows the guest to extrapolate
	189	system and wall-clock times accurately based on the current cycle
	190	counter time.
	191
	192	Since all time stamps need to be updated and read \emph{atomically}
	193	a version number is also stored in the shared info page, which is
	194	incremented before and after updating the timestamps. Thus a guest can
	195	be sure that it read a consistent state by checking the two version
	196	numbers are equal and even.
	197
	198	Xen includes a periodic ticker which sends a timer event to the
	199	currently executing domain every 10ms. The Xen scheduler also sends a
	200	timer event whenever a domain is scheduled; this allows the guest OS
	201	to adjust for the time that has passed while it has been inactive. In
	202	addition, Xen allows each domain to request that they receive a timer
	203	event sent at a specified system time by using the {\bf
	204	set\_timer\_op} hypercall. Guest OSes may use this timer to
	205	implement timeout values when they block.
	206
	207
	208	\section{Xen CPU Scheduling}
	209
	210	Xen offers a uniform API for CPU schedulers. It is possible to choose
	211	from a number of schedulers at boot and it should be easy to add more.
	212	The SEDF and Credit schedulers are part of the normal Xen
	213	distribution. SEDF will be going away and its use should be
	214	avoided once the credit scheduler has stabilized and become the default.
	215	The Credit scheduler provides proportional fair shares of the
	216	host's CPUs to the running domains. It does this while transparently
	217	load balancing runnable VCPUs across the whole system.
	218
	219	\paragraph*{Note: SMP host support}
	220	Xen has always supported SMP host systems. When using the credit scheduler,
	221	a domain's VCPUs will be dynamically moved across physical CPUs to maximise
	222	domain and system throughput. VCPUs can also be manually restricted to be
	223	mapped only on a subset of the host's physical CPUs, using the pinning
	224	mechanism.
	225
	226
	227	%% More information on the characteristics and use of these schedulers
	228	%% is available in {\bf Sched-HOWTO.txt}.
	229
	230
	231	\section{Privileged operations}
	232
	233	Xen exports an extended interface to privileged domains (viz.\ {\it
	234	Domain 0}). This allows such domains to build and boot other domains
	235	on the server, and provides control interfaces for managing
	236	scheduling, memory, networking, and block devices.
	237
	238	\chapter{Memory}
	239	\label{c:memory}
	240
	241	Xen is responsible for managing the allocation of physical memory to
	242	domains, and for ensuring safe use of the paging and segmentation
	243	hardware.
	244
	245
	246	\section{Memory Allocation}
	247
	248	As well as allocating a portion of physical memory for its own private
	249	use, Xen also reserves s small fixed portion of every virtual address
	250	space. This is located in the top 64MB on 32-bit systems, the top
	251	168MB on PAE systems, and a larger portion in the middle of the
	252	address space on 64-bit systems. Unreserved physical memory is
	253	available for allocation to domains at a page granularity. Xen tracks
	254	the ownership and use of each page, which allows it to enforce secure
	255	partitioning between domains.
	256
	257	Each domain has a maximum and current physical memory allocation. A
	258	guest OS may run a `balloon driver' to dynamically adjust its current
	259	memory allocation up to its limit.
	260
	261
	262	\section{Pseudo-Physical Memory}
	263
	264	Since physical memory is allocated and freed on a page granularity,
	265	there is no guarantee that a domain will receive a contiguous stretch
	266	of physical memory. However most operating systems do not have good
	267	support for operating in a fragmented physical address space. To aid
	268	porting such operating systems to run on top of Xen, we make a
	269	distinction between \emph{machine memory} and \emph{pseudo-physical
	270	memory}.
	271
	272	Put simply, machine memory refers to the entire amount of memory
	273	installed in the machine, including that reserved by Xen, in use by
	274	various domains, or currently unallocated. We consider machine memory
	275	to comprise a set of 4kB \emph{machine page frames} numbered
	276	consecutively starting from 0. Machine frame numbers mean the same
	277	within Xen or any domain.
	278
	279	Pseudo-physical memory, on the other hand, is a per-domain
	280	abstraction. It allows a guest operating system to consider its memory
	281	allocation to consist of a contiguous range of physical page frames
	282	starting at physical frame 0, despite the fact that the underlying
	283	machine page frames may be sparsely allocated and in any order.
	284
	285	To achieve this, Xen maintains a globally readable {\it
	286	machine-to-physical} table which records the mapping from machine
	287	page frames to pseudo-physical ones. In addition, each domain is
	288	supplied with a {\it physical-to-machine} table which performs the
	289	inverse mapping. Clearly the machine-to-physical table has size
	290	proportional to the amount of RAM installed in the machine, while each
	291	physical-to-machine table has size proportional to the memory
	292	allocation of the given domain.
	293
	294	Architecture dependent code in guest operating systems can then use
	295	the two tables to provide the abstraction of pseudo-physical memory.
	296	In general, only certain specialized parts of the operating system
	297	(such as page table management) needs to understand the difference
	298	between machine and pseudo-physical addresses.
	299
	300
	301	\section{Page Table Updates}
	302
	303	In the default mode of operation, Xen enforces read-only access to
	304	page tables and requires guest operating systems to explicitly request
	305	any modifications. Xen validates all such requests and only applies
	306	updates that it deems safe. This is necessary to prevent domains from
	307	adding arbitrary mappings to their page tables.
	308
	309	To aid validation, Xen associates a type and reference count with each
	310	memory page. A page has one of the following mutually-exclusive types
	311	at any point in time: page directory ({\sf PD}), page table ({\sf
	312	PT}), local descriptor table ({\sf LDT}), global descriptor table
	313	({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
	314	create readable mappings of its own memory regardless of its current
	315	type.
	316
	317	%%% XXX: possibly explain more about ref count 'lifecyle' here?
	318	This mechanism is used to maintain the invariants required for safety;
	319	for example, a domain cannot have a writable mapping to any part of a
	320	page table as this would require the page concerned to simultaneously
	321	be of types {\sf PT} and {\sf RW}.
	322
	323	\hypercall{mmu\_update(mmu\_update\_t req, int count, int success\_count, domid\_t domid)}
	324
	325	This hypercall is used to make updates to either the domain's
	326	pagetables or to the machine to physical mapping table. It supports
	327	submitting a queue of updates, allowing batching for maximal
	328	performance. Explicitly queuing updates using this interface will
	329	cause any outstanding writable pagetable state to be flushed from the
	330	system.
	331
	332	\section{Writable Page Tables}
	333
	334	Xen also provides an alternative mode of operation in which guests
	335	have the illusion that their page tables are directly writable. Of
	336	course this is not really the case, since Xen must still validate
	337	modifications to ensure secure partitioning. To this end, Xen traps
	338	any write attempt to a memory page of type {\sf PT} (i.e., that is
	339	currently part of a page table). If such an access occurs, Xen
	340	temporarily allows write access to that page while at the same time
	341	\emph{disconnecting} it from the page table that is currently in use.
	342	This allows the guest to safely make updates to the page because the
	343	newly-updated entries cannot be used by the MMU until Xen revalidates
	344	and reconnects the page. Reconnection occurs automatically in a
	345	number of situations: for example, when the guest modifies a different
	346	page-table page, when the domain is preempted, or whenever the guest
	347	uses Xen's explicit page-table update interfaces.
	348
	349	Writable pagetable functionality is enabled when the guest requests
	350	it, using a {\bf vm\_assist} hypercall. Writable pagetables do {\em
	351	not} provide full virtualisation of the MMU, so the memory management
	352	code of the guest still needs to be aware that it is running on Xen.
	353	Since the guest's page tables are used directly, it must translate
	354	pseudo-physical addresses to real machine addresses when building page
	355	table entries. The guest may not attempt to map its own pagetables
	356	writably, since this would violate the memory type invariants; page
	357	tables will automatically be made writable by the hypervisor, as
	358	necessary.
	359
	360	\section{Shadow Page Tables}
	361
	362	Finally, Xen also supports a form of \emph{shadow page tables} in
	363	which the guest OS uses a independent copy of page tables which are
	364	unknown to the hardware (i.e.\ which are never pointed to by {\tt
	365	cr3}). Instead Xen propagates changes made to the guest's tables to
	366	the real ones, and vice versa. This is useful for logging page writes
	367	(e.g.\ for live migration or checkpoint). A full version of the shadow
	368	page tables also allows guest OS porting with less effort.
	369
	370
	371	\section{Segment Descriptor Tables}
	372
	373	At start of day a guest is supplied with a default GDT, which does not reside
	374	within its own memory allocation. If the guest wishes to use other
	375	than the default `flat' ring-1 and ring-3 segments that this GDT
	376	provides, it must register a custom GDT and/or LDT with Xen, allocated
	377	from its own memory.
	378
	379	The following hypercall is used to specify a new GDT:
	380
	381	\begin{quote}
	382	int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
	383	entries})
	384
	385	\emph{frame\_list}: An array of up to 14 machine page frames within
	386	which the GDT resides. Any frame registered as a GDT frame may only
	387	be mapped read-only within the guest's address space (e.g., no
	388	writable mappings, no use as a page-table page, and so on). Only 14
	389	pages may be specified because pages 15 and 16 are reserved for
	390	the hypervisor's GDT entries.
	391
	392	\emph{entries}: The number of descriptor-entry slots in the GDT.
	393	\end{quote}
	394
	395	The LDT is updated via the generic MMU update mechanism (i.e., via the
	396	{\bf mmu\_update} hypercall.
	397
	398	\section{Start of Day}
	399
	400	The start-of-day environment for guest operating systems is rather
	401	different to that provided by the underlying hardware. In particular,
	402	the processor is already executing in protected mode with paging
	403	enabled.
	404
	405	{\it Domain 0} is created and booted by Xen itself. For all subsequent
	406	domains, the analogue of the boot-loader is the {\it domain builder},
	407	user-space software running in {\it domain 0}. The domain builder is
	408	responsible for building the initial page tables for a domain and
	409	loading its kernel image at the appropriate virtual address.
	410
	411	\section{VM assists}
	412
	413	Xen provides a number of ``assists'' for guest memory management.
	414	These are available on an ``opt-in'' basis to provide commonly-used
	415	extra functionality to a guest.
	416
	417	\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
	418
	419	The {\bf cmd} parameter describes the action to be taken, whilst the
	420	{\bf type} parameter describes the kind of assist that is being
	421	referred to. Available commands are as follows:
	422
	423	\begin{description}
	424	\item[VMASST\_CMD\_enable] Enable a particular assist type
	425	\item[VMASST\_CMD\_disable] Disable a particular assist type
	426	\end{description}
	427
	428	And the available types are:
	429
	430	\begin{description}
	431	\item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
	432	instructions that rely on 4GB segments (such as the techniques used
	433	by some TLS solutions).
	434	\item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the
	435	guest if the above segment fixups are used: allows the guest to
	436	display a warning message during boot.
	437	\item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
	438	mode - described above.
	439	\end{description}
	440
	441
	442	\chapter{Xen Info Pages}
	443
	444	The {\bf Shared info page} is used to share various CPU-related state
	445	between the guest OS and the hypervisor. This information includes VCPU
	446	status, time information and event channel (virtual interrupt) state.
	447	The {\bf Start info page} is used to pass build-time information to
	448	the guest when it boots and when it is resumed from a suspended state.
	449	This chapter documents the fields included in the {\bf
	450	shared\_info\_t} and {\bf start\_info\_t} structures for use by the
	451	guest OS.
	452
	453	\section{Shared info page}
	454
	455	The {\bf shared\_info\_t} is accessed at run time by both Xen and the
	456	guest OS. It is used to pass information relating to the
	457	virtual CPU and virtual machine state between the OS and the
	458	hypervisor.
	459
	460	The structure is declared in {\bf xen/include/public/xen.h}:
	461
	462	\scriptsize
	463	\begin{verbatim}
	464	typedef struct shared_info {
	465	vcpu_info_t vcpu_info[MAX_VIRT_CPUS];
	466
	467	/*
	468	* A domain can create "event channels" on which it can send and receive
	469	* asynchronous event notifications. There are three classes of event that
	470	* are delivered by this mechanism:
	471	* 1. Bi-directional inter- and intra-domain connections. Domains must
	472	* arrange out-of-band to set up a connection (usually by allocating
	473	* an unbound 'listener' port and avertising that via a storage service
	474	* such as xenstore).
	475	* 2. Physical interrupts. A domain with suitable hardware-access
	476	* privileges can bind an event-channel port to a physical interrupt
	477	* source.
	478	* 3. Virtual interrupts ('events'). A domain can bind an event-channel
	479	* port to a virtual interrupt source, such as the virtual-timer
	480	* device or the emergency console.
	481	*
	482	* Event channels are addressed by a "port index". Each channel is
	483	* associated with two bits of information:
	484	* 1. PENDING -- notifies the domain that there is a pending notification
	485	* to be processed. This bit is cleared by the guest.
	486	* 2. MASK -- if this bit is clear then a 0->1 transition of PENDING
	487	* will cause an asynchronous upcall to be scheduled. This bit is only
	488	* updated by the guest. It is read-only within Xen. If a channel
	489	* becomes pending while the channel is masked then the 'edge' is lost
	490	* (i.e., when the channel is unmasked, the guest must manually handle
	491	* pending notifications as no upcall will be scheduled by Xen).
	492	*
	493	* To expedite scanning of pending notifications, any 0->1 pending
	494	* transition on an unmasked channel causes a corresponding bit in a
	495	* per-vcpu selector word to be set. Each bit in the selector covers a
	496	* 'C long' in the PENDING bitfield array.
	497	*/
	498	unsigned long evtchn_pending[sizeof(unsigned long) * 8];
	499	unsigned long evtchn_mask[sizeof(unsigned long) * 8];
	500
	501	/*
	502	* Wallclock time: updated only by control software. Guests should base
	503	* their gettimeofday() syscall on this wallclock-base value.
	504	*/
	505	uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
	506	uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
	507	uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
	508
	509	arch_shared_info_t arch;
	510
	511	} shared_info_t;
	512	\end{verbatim}
	513	\normalsize
	514
	515	\begin{description}
	516	\item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of
	517	which holds either runtime information about a virtual CPU, or is
	518	``empty'' if the corresponding VCPU does not exist.
	519	\item[evtchn\_pending] Guest-global array, with one bit per event
	520	channel. Bits are set if an event is currently pending on that
	521	channel.
	522	\item[evtchn\_mask] Guest-global array for masking notifications on
	523	event channels.
	524	\item[wc\_version] Version counter for current wallclock time.
	525	\item[wc\_sec] Whole seconds component of current wallclock time.
	526	\item[wc\_nsec] Nanoseconds component of current wallclock time.
	527	\item[arch] Host architecture-dependent portion of the shared info
	528	structure.
	529	\end{description}
	530
	531	\subsection{vcpu\_info\_t}
	532
	533	\scriptsize
	534	\begin{verbatim}
	535	typedef struct vcpu_info {
	536	/*
	537	* 'evtchn_upcall_pending' is written non-zero by Xen to indicate
	538	* a pending notification for a particular VCPU. It is then cleared
	539	* by the guest OS /before/ checking for pending work, thus avoiding
	540	* a set-and-check race. Note that the mask is only accessed by Xen
	541	* on the CPU that is currently hosting the VCPU. This means that the
	542	* pending and mask flags can be updated by the guest without special
	543	* synchronisation (i.e., no need for the x86 LOCK prefix).
	544	* This may seem suboptimal because if the pending flag is set by
	545	* a different CPU then an IPI may be scheduled even when the mask
	546	* is set. However, note:
	547	* 1. The task of 'interrupt holdoff' is covered by the per-event-
	548	* channel mask bits. A 'noisy' event that is continually being
	549	* triggered can be masked at source at this very precise
	550	* granularity.
	551	* 2. The main purpose of the per-VCPU mask is therefore to restrict
	552	* reentrant execution: whether for concurrency control, or to
	553	* prevent unbounded stack usage. Whatever the purpose, we expect
	554	* that the mask will be asserted only for short periods at a time,
	555	* and so the likelihood of a 'spurious' IPI is suitably small.
	556	* The mask is read before making an event upcall to the guest: a
	557	* non-zero mask therefore guarantees that the VCPU will not receive
	558	* an upcall activation. The mask is cleared when the VCPU requests
	559	* to block: this avoids wakeup-waiting races.
	560	*/
	561	uint8_t evtchn_upcall_pending;
	562	uint8_t evtchn_upcall_mask;
	563	unsigned long evtchn_pending_sel;
	564	arch_vcpu_info_t arch;
	565	vcpu_time_info_t time;
	566	} vcpu_info_t; /* 64 bytes (x86) */
	567	\end{verbatim}
	568	\normalsize
	569
	570	\begin{description}
	571	\item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
	572	that there are pending events to be received.
	573	\item[evtchn\_upcall\_mask] This is set non-zero to disable all
	574	interrupts for this CPU for short periods of time. If individual
	575	event channels need to be masked, the {\bf evtchn\_mask} in the {\bf
	576	shared\_info\_t} is used instead.
	577	\item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
	578	bit is set in this selector to indicate which word of the {\bf
	579	evtchn\_pending} array in the {\bf shared\_info\_t} contains the
	580	event in question.
	581	\item[arch] Architecture-specific VCPU info. On x86 this contains the
	582	virtualized CR2 register (page fault linear address) for this VCPU.
	583	\item[time] Time values for this VCPU.
	584	\end{description}
	585
	586	\subsection{vcpu\_time\_info}
	587
	588	\scriptsize
	589	\begin{verbatim}
	590	typedef struct vcpu_time_info {
	591	/*
	592	* Updates to the following values are preceded and followed by an
	593	* increment of 'version'. The guest can therefore detect updates by
	594	* looking for changes to 'version'. If the least-significant bit of
	595	* the version number is set then an update is in progress and the guest
	596	* must wait to read a consistent set of values.
	597	* The correct way to interact with the version number is similar to
	598	* Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
	599	*/
	600	uint32_t version;
	601	uint32_t pad0;
	602	uint64_t tsc_timestamp; /* TSC at last update of time vals. */
	603	uint64_t system_time; /* Time, in nanosecs, since boot. */
	604	/*
	605	* Current system time:
	606	* system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
	607	* CPU frequency (Hz):
	608	* ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
	609	*/
	610	uint32_t tsc_to_system_mul;
	611	int8_t tsc_shift;
	612	int8_t pad1[3];
	613	} vcpu_time_info_t; /* 32 bytes */
	614	\end{verbatim}
	615	\normalsize
	616
	617	\begin{description}
	618	\item[version] Used to ensure the guest gets consistent time updates.
	619	\item[tsc\_timestamp] Cycle counter timestamp of last time value;
	620	could be used to expolate in between updates, for instance.
	621	\item[system\_time] Time since boot (nanoseconds).
	622	\item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
	623	(used in extrapolating current time).
	624	\item[tsc\_shift] Cycle counter to nanoseconds shift (used in
	625	extrapolating current time).
	626	\end{description}
	627
	628	\subsection{arch\_shared\_info\_t}
	629
	630	On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from
	631	xen/public/arch-x86\_32.h):
	632
	633	\scriptsize
	634	\begin{verbatim}
	635	typedef struct arch_shared_info {
	636	unsigned long max_pfn; /* max pfn that appears in table */
	637	/* Frame containing list of mfns containing list of mfns containing p2m. */
	638	unsigned long pfn_to_mfn_frame_list_list;
	639	} arch_shared_info_t;
	640	\end{verbatim}
	641	\normalsize
	642
	643	\begin{description}
	644	\item[max\_pfn] The maximum PFN listed in the physical-to-machine
	645	mapping table (P2M table).
	646	\item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
	647	that contains the machine addresses of the P2M table frames.
	648	\end{description}
	649
	650	\section{Start info page}
	651
	652	The start info structure is declared as the following (in {\bf
	653	xen/include/public/xen.h}):
	654
	655	\scriptsize
	656	\begin{verbatim}
	657	#define MAX_GUEST_CMDLINE 1024
	658	typedef struct start_info {
	659	/* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
	660	char magic[32]; /* "Xen-<version>.<subversion>". */
	661	unsigned long nr_pages; /* Total pages allocated to this domain. */
	662	unsigned long shared_info; /* MACHINE address of shared info struct. */
	663	uint32_t flags; /* SIF_xxx flags. */
	664	unsigned long store_mfn; /* MACHINE page number of shared page. */
	665	uint32_t store_evtchn; /* Event channel for store communication. */
	666	unsigned long console_mfn; /* MACHINE address of console page. */
	667	uint32_t console_evtchn; /* Event channel for console messages. */
	668	/* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
	669	unsigned long pt_base; /* VIRTUAL address of page directory. */
	670	unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
	671	unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
	672	unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
	673	unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
	674	int8_t cmd_line[MAX_GUEST_CMDLINE];
	675	} start_info_t;
	676	\end{verbatim}
	677	\normalsize
	678
	679	The fields are in two groups: the first group are always filled in
	680	when a domain is booted or resumed, the second set are only used at
	681	boot time.
	682
	683	The always-available group is as follows:
	684
	685	\begin{description}
	686	\item[magic] A text string identifying the Xen version to the guest.
	687	\item[nr\_pages] The number of real machine pages available to the
	688	guest.
	689	\item[shared\_info] Machine address of the shared info structure,
	690	allowing the guest to map it during initialisation.
	691	\item[flags] Flags for describing optional extra settings to the
	692	guest.
	693	\item[store\_mfn] Machine address of the Xenstore communications page.
	694	\item[store\_evtchn] Event channel to communicate with the store.
	695	\item[console\_mfn] Machine address of the console data page.
	696	\item[console\_evtchn] Event channel to notify the console backend.
	697	\end{description}
	698
	699	The boot-only group may only be safely referred to during system boot:
	700
	701	\begin{description}
	702	\item[pt\_base] Virtual address of the page directory created for us
	703	by the domain builder.
	704	\item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
	705	pagetables.
	706	\item[mfn\_list] Virtual address of the list of machine frames this
	707	domain owns.
	708	\item[mod\_start] Virtual address of any pre-loaded modules
	709	(e.g. ramdisk)
	710	\item[mod\_len] Size of pre-loaded module (if any).
	711	\item[cmd\_line] Kernel command line passed by the domain builder.
	712	\end{description}
	713
	714
	715	% by Mark Williamson <mark.williamson@cl.cam.ac.uk>
	716
	717	\chapter{Event Channels}
	718	\label{c:eventchannels}
	719
	720	Event channels are the basic primitive provided by Xen for event
	721	notifications. An event is the Xen equivalent of a hardware
	722	interrupt. They essentially store one bit of information, the event
	723	of interest is signalled by transitioning this bit from 0 to 1.
	724
	725	Notifications are received by a guest via an upcall from Xen,
	726	indicating when an event arrives (setting the bit). Further
	727	notifications are masked until the bit is cleared again (therefore,
	728	guests must check the value of the bit after re-enabling event
	729	delivery to ensure no missed notifications).
	730
	731	Event notifications can be masked by setting a flag; this is
	732	equivalent to disabling interrupts and can be used to ensure atomicity
	733	of certain operations in the guest kernel.
	734
	735	\section{Hypercall interface}
	736
	737	\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
	738
	739	The event channel operation hypercall is used for all operations on
	740	event channels / ports. Operations are distinguished by the value of
	741	the {\bf cmd} field of the {\bf op} structure. The possible commands
	742	are described below:
	743
	744	\begin{description}
	745
	746	\item[EVTCHNOP\_alloc\_unbound]
	747	Allocate a new event channel port, ready to be connected to by a
	748	remote domain.
	749	\begin{itemize}
	750	\item Specified domain must exist.
	751	\item A free port must exist in that domain.
	752	\end{itemize}
	753	Unprivileged domains may only allocate their own ports, privileged
	754	domains may also allocate ports in other domains.
	755	\item[EVTCHNOP\_bind\_interdomain]
	756	Bind an event channel for interdomain communications.
	757	\begin{itemize}
	758	\item Caller domain must have a free port to bind.
	759	\item Remote domain must exist.
	760	\item Remote port must be allocated and currently unbound.
	761	\item Remote port must be expecting the caller domain as the ``remote''.
	762	\end{itemize}
	763	\item[EVTCHNOP\_bind\_virq]
	764	Allocate a port and bind a VIRQ to it.
	765	\begin{itemize}
	766	\item Caller domain must have a free port to bind.
	767	\item VIRQ must be valid.
	768	\item VCPU must exist.
	769	\item VIRQ must not currently be bound to an event channel.
	770	\end{itemize}
	771	\item[EVTCHNOP\_bind\_ipi]
	772	Allocate and bind a port for notifying other virtual CPUs.
	773	\begin{itemize}
	774	\item Caller domain must have a free port to bind.
	775	\item VCPU must exist.
	776	\end{itemize}
	777	\item[EVTCHNOP\_bind\_pirq]
	778	Allocate and bind a port to a real IRQ.
	779	\begin{itemize}
	780	\item Caller domain must have a free port to bind.
	781	\item PIRQ must be within the valid range.
	782	\item Another binding for this PIRQ must not exist for this domain.
	783	\item Caller must have an available port.
	784	\end{itemize}
	785	\item[EVTCHNOP\_close]
	786	Close an event channel (no more events will be received).
	787	\begin{itemize}
	788	\item Port must be valid (currently allocated).
	789	\end{itemize}
	790	\item[EVTCHNOP\_send] Send a notification on an event channel attached
	791	to a port.
	792	\begin{itemize}
	793	\item Port must be valid.
	794	\item Only valid for Interdomain, IPI or Allocated Unbound ports.
	795	\end{itemize}
	796	\item[EVTCHNOP\_status] Query the status of a port; what kind of port,
	797	whether it is bound, what remote domain is expected, what PIRQ or
	798	VIRQ it is bound to, what VCPU will be notified, etc.
	799	Unprivileged domains may only query the state of their own ports.
	800	Privileged domains may query any port.
	801	\item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
	802	receive notification upcalls only on that VCPU.
	803	\begin{itemize}
	804	\item VCPU must exist.
	805	\item Port must be valid.
	806	\item Event channel must be either: allocated but unbound, bound to
	807	an interdomain event channel, bound to a PIRQ.
	808	\end{itemize}
	809
	810	\end{description}
	811
	812	%%
	813	%% grant_tables.tex
	814	%%
	815	%% Made by Mark Williamson
	816	%% Login <mark@maw48>
	817	%%
	818
	819	\chapter{Grant tables}
	820	\label{c:granttables}
	821
	822	Xen's grant tables provide a generic mechanism to memory sharing
	823	between domains. This shared memory interface underpins the split
	824	device drivers for block and network IO.
	825
	826	Each domain has its own {\bf grant table}. This is a data structure
	827	that is shared with Xen; it allows the domain to tell Xen what kind of
	828	permissions other domains have on its pages. Entries in the grant
	829	table are identified by {\bf grant references}. A grant reference is
	830	an integer, which indexes into the grant table. It acts as a
	831	capability which the grantee can use to perform operations on the
	832	granter's memory.
	833
	834	This capability-based system allows shared-memory communications
	835	between unprivileged domains. A grant reference also encapsulates the
	836	details of a shared page, removing the need for a domain to know the
	837	real machine address of a page it is sharing. This makes it possible
	838	to share memory correctly with domains running in fully virtualised
	839	memory.
	840
	841	\section{Interface}
	842
	843	\subsection{Grant table manipulation}
	844
	845	Creating and destroying grant references is done by direct access to
	846	the grant table. This removes the need to involve Xen when creating
	847	grant references, modifying access permissions, etc. The grantee
	848	domain will invoke hypercalls to use the grant references. Four main
	849	operations can be accomplished by directly manipulating the table:
	850
	851	\begin{description}
	852	\item[Grant foreign access] allocate a new entry in the grant table
	853	and fill out the access permissions accordingly. The access
	854	permissions will be looked up by Xen when the grantee attempts to
	855	use the reference to map the granted frame.
	856	\item[End foreign access] check that the grant reference is not
	857	currently in use, then remove the mapping permissions for the frame.
	858	This prevents further mappings from taking place but does not allow
	859	forced revocations of existing mappings.
	860	\item[Grant foreign transfer] allocate a new entry in the table
	861	specifying transfer permissions for the grantee. Xen will look up
	862	this entry when the grantee attempts to transfer a frame to the
	863	granter.
	864	\item[End foreign transfer] remove permissions to prevent a transfer
	865	occurring in future. If the transfer is already committed,
	866	modifying the grant table cannot prevent it from completing.
	867	\end{description}
	868
	869	\subsection{Hypercalls}
	870
	871	Use of grant references is accomplished via a hypercall. The grant
	872	table op hypercall takes three arguments:
	873
	874	\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
	875
	876	{\bf cmd} indicates the grant table operation of interest. {\bf uop}
	877	is a pointer to a structure (or an array of structures) describing the
	878	operation to be performed. The {\bf count} field describes how many
	879	grant table operations are being batched together.
	880
	881	The core logic is situated in {\bf xen/common/grant\_table.c}. The
	882	grant table operation hypercall can be used to perform the following
	883	actions:
	884
	885	\begin{description}
	886	\item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
	887	domain, map the referred page into the caller's address space.
	888	\item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
	889	from the caller's address space. This is used to voluntarily
	890	relinquish a mapping to a granted page.
	891	\item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
	892	\item[GNTTABOP\_dump\_table] Debugging operation.
	893	\item[GNTTABOP\_transfer] Given a transfer reference from another
	894	domain, transfer ownership of a page frame to that domain.
	895	\end{description}
	896
	897	%%
	898	%% xenstore.tex
	899	%%
	900	%% Made by Mark Williamson
	901	%% Login <mark@maw48>
	902	%%
	903
	904	\chapter{Xenstore}
	905
	906	Xenstore is the mechanism by which control-plane activities occur.
	907	These activities include:
	908
	909	\begin{itemize}
	910	\item Setting up shared memory regions and event channels for use with
	911	the split device drivers.
	912	\item Notifying the guest of control events (e.g. balloon driver
	913	requests)
	914	\item Reporting back status information from the guest
	915	(e.g. performance-related statistics, etc).
	916	\end{itemize}
	917
	918	The store is arranged as a hierachical collection of key-value pairs.
	919	Each domain has a directory hierarchy containing data related to its
	920	configuration. Domains are permitted to register for notifications
	921	about changes in subtrees of the store, and to apply changes to the
	922	store transactionally.
	923
	924	\section{Guidelines}
	925
	926	A few principles govern the operation of the store:
	927
	928	\begin{itemize}
	929	\item Domains should only modify the contents of their own
	930	directories.
	931	\item The setup protocol for a device channel should simply consist of
	932	entering the configuration data into the store.
	933	\item The store should allow device discovery without requiring the
	934	relevant device drivers to be loaded: a Xen ``bus'' should be
	935	visible to probing code in the guest.
	936	\item The store should be usable for inter-tool communications,
	937	allowing the tools themselves to be decomposed into a number of
	938	smaller utilities, rather than a single monolithic entity. This
	939	also facilitates the development of alternate user interfaces to the
	940	same functionality.
	941	\end{itemize}
	942
	943	\section{Store layout}
	944
	945	There are three main paths in XenStore:
	946
	947	\begin{description}
	948	\item[/vm] stores configuration information about domain
	949	\item[/local/domain] stores information about the domain on the local node (domid, etc.)
	950	\item[/tool] stores information for the various tools
	951	\end{description}
	952
	953	The {\bf /vm} path stores configuration information for a domain.
	954	This information doesn't change and is indexed by the domain's UUID.
	955	A {\bf /vm} entry contains the following information:
	956
	957	\begin{description}
	958	\item[uuid] uuid of the domain (somewhat redundant)
	959	\item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
	960	\item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
	961	\item[on\_crash] the action to take on a domain crash (destroy or restart)
	962	\item[vcpus] the number of allocated vcpus for the domain
	963	\item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
	964	\item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
	965	\item[name] the name of the domain
	966	\end{description}
	967
	968
	969	{\bf /vm/$<$uuid$>$/image/}
	970
	971	The image path is only available for Domain-Us and contains:
	972	\begin{description}
	973	\item[ostype] identifies the builder type (linux or vmx)
	974	\item[kernel] path to kernel on domain-0
	975	\item[cmdline] command line to pass to domain-U kernel
	976	\item[ramdisk] path to ramdisk on domain-0
	977	\end{description}
	978
	979	{\bf /local}
	980
	981	The {\tt /local} path currently only contains one directory, {\tt
	982	/local/domain} that is indexed by domain id. It contains the running
	983	domain information. The reason to have two storage areas is that
	984	during migration, the uuid doesn't change but the domain id does. The
	985	{\tt /local/domain} directory can be created and populated before
	986	finalizing the migration enabling localhost to localhost migration.
	987
	988	{\bf /local/domain/$<$domid$>$}
	989
	990	This path contains:
	991
	992	\begin{description}
	993	\item[cpu\_time] xend start time (this is only around for domain-0)
	994	\item[handle] private handle for xend
	995	\item[name] see /vm
	996	\item[on\_reboot] see /vm
	997	\item[on\_poweroff] see /vm
	998	\item[on\_crash] see /vm
	999	\item[vm] the path to the VM directory for the domain
	1000	\item[domid] the domain id (somewhat redundant)
	1001	\item[running] indicates that the domain is currently running
	1002	\item[memory] the current memory in megabytes for the domain (empty for domain-0?)
	1003	\item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
	1004	\item[memory\_KiB] the memory allocated to the domain (in kilobytes)
	1005	\item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
	1006	\item[cpu\_weight] the weight assigned to the domain
	1007	\item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
	1008	\item[online\_vcpus] how many vcpus are currently online
	1009	\item[vcpus] the total number of vcpus allocated to the domain
	1010	\item[console/] a directory for console information
	1011	\begin{description}
	1012	\item[ring-ref] the grant table reference of the console ring queue
	1013	\item[port] the event channel being used for the console ring queue (local port)
	1014	\item[tty] the current tty the console data is being exposed of
	1015	\item[limit] the limit (in bytes) of console data to buffer
	1016	\end{description}
	1017	\item[backend/] a directory containing all backends the domain hosts
	1018	\begin{description}
	1019	\item[vbd/] a directory containing vbd backends
	1020	\begin{description}
	1021	\item[$<$domid$>$/] a directory containing vbd's for domid
	1022	\begin{description}
	1023	\item[$<$virtual-device$>$/] a directory for a particular
	1024	virtual-device on domid
	1025	\begin{description}
	1026	\item[frontend-id] domain id of frontend
	1027	\item[frontend] the path to the frontend domain
	1028	\item[physical-device] backend device number
	1029	\item[sector-size] backend sector size
	1030	\item[info] 0 read/write, 1 read-only (is this right?)
	1031	\item[domain] name of frontend domain
	1032	\item[params] parameters for device
	1033	\item[type] the type of the device
	1034	\item[dev] the virtual device (as given by the user)
	1035	\item[node] output from block creation script
	1036	\end{description}
	1037	\end{description}
	1038	\end{description}
	1039
	1040	\item[vif/] a directory containing vif backends
	1041	\begin{description}
	1042	\item[$<$domid$>$/] a directory containing vif's for domid
	1043	\begin{description}
	1044	\item[$<$vif number$>$/] a directory for each vif
	1045	\item[frontend-id] the domain id of the frontend
	1046	\item[frontend] the path to the frontend
	1047	\item[mac] the mac address of the vif
	1048	\item[bridge] the bridge the vif is connected to
	1049	\item[handle] the handle of the vif
	1050	\item[script] the script used to create/stop the vif
	1051	\item[domain] the name of the frontend
	1052	\end{description}
	1053	\end{description}
	1054
	1055	\item[vtpm/] a directory containin vtpm backends
	1056	\begin{description}
	1057	\item[$<$domid$>$/] a directory containing vtpm's for domid
	1058	\begin{description}
	1059	\item[$<$vtpm number$>$/] a directory for each vtpm
	1060	\item[frontend-id] the domain id of the frontend
	1061	\item[frontend] the path to the frontend
	1062	\item[instance] the instance of the virtual TPM that is used
	1063	\item[pref{\textunderscore}instance] the instance number as given in the VM configuration file;
	1064	may be different from {\bf instance}
	1065	\item[domain] the name of the domain of the frontend
	1066	\end{description}
	1067	\end{description}
	1068
	1069	\end{description}
	1070
	1071	\item[device/] a directory containing the frontend devices for the
	1072	domain
	1073	\begin{description}
	1074	\item[vbd/] a directory containing vbd frontend devices for the
	1075	domain
	1076	\begin{description}
	1077	\item[$<$virtual-device$>$/] a directory containing the vbd frontend for
	1078	virtual-device
	1079	\begin{description}
	1080	\item[virtual-device] the device number of the frontend device
	1081	\item[backend-id] the domain id of the backend
	1082	\item[backend] the path of the backend in the store (/local/domain
	1083	path)
	1084	\item[ring-ref] the grant table reference for the block request
	1085	ring queue
	1086	\item[event-channel] the event channel used for the block request
	1087	ring queue
	1088	\end{description}
	1089
	1090	\item[vif/] a directory containing vif frontend devices for the
	1091	domain
	1092	\begin{description}
	1093	\item[$<$id$>$/] a directory for vif id frontend device for the domain
	1094	\begin{description}
	1095	\item[backend-id] the backend domain id
	1096	\item[mac] the mac address of the vif
	1097	\item[handle] the internal vif handle
	1098	\item[backend] a path to the backend's store entry
	1099	\item[tx-ring-ref] the grant table reference for the transmission ring queue
	1100	\item[rx-ring-ref] the grant table reference for the receiving ring queue
	1101	\item[event-channel] the event channel used for the two ring queues
	1102	\end{description}
	1103	\end{description}
	1104
	1105	\item[vtpm/] a directory containing the vtpm frontend device for the
	1106	domain
	1107	\begin{description}
	1108	\item[$<$id$>$] a directory for vtpm id frontend device for the domain
	1109	\begin{description}
	1110	\item[backend-id] the backend domain id
	1111	\item[backend] a path to the backend's store entry
	1112	\item[ring-ref] the grant table reference for the tx/rx ring
	1113	\item[event-channel] the event channel used for the ring
	1114	\end{description}
	1115	\end{description}
	1116
	1117	\item[device-misc/] miscellanous information for devices
	1118	\begin{description}
	1119	\item[vif/] miscellanous information for vif devices
	1120	\begin{description}
	1121	\item[nextDeviceID] the next device id to use
	1122	\end{description}
	1123	\end{description}
	1124	\end{description}
	1125	\end{description}
	1126
	1127	\item[security/] access control information for the domain
	1128	\begin{description}
	1129	\item[ssidref] security reference identifier used inside the hypervisor
	1130	\item[access\_control/] security label used by management tools
	1131	\begin{description}
	1132	\item[label] security label name
	1133	\item[policy] security policy name
	1134	\end{description}
	1135	\end{description}
	1136
	1137	\item[store/] per-domain information for the store
	1138	\begin{description}
	1139	\item[port] the event channel used for the store ring queue
	1140	\item[ring-ref] - the grant table reference used for the store's
	1141	communication channel
	1142	\end{description}
	1143
	1144	\item[image] - private xend information
	1145	\end{description}
	1146
	1147
	1148	\chapter{Devices}
	1149	\label{c:devices}
	1150
	1151	Virtual devices under Xen are provided by a {\bf split device driver}
	1152	architecture. The illusion of the virtual device is provided by two
	1153	co-operating drivers: the {\bf frontend}, which runs an the
	1154	unprivileged domain and the {\bf backend}, which runs in a domain with
	1155	access to the real device hardware (often called a {\bf driver
	1156	domain}; in practice domain 0 usually fulfills this function).
	1157
	1158	The frontend driver appears to the unprivileged guest as if it were a
	1159	real device, for instance a block or network device. It receives IO
	1160	requests from its kernel as usual, however since it does not have
	1161	access to the physical hardware of the system it must then issue
	1162	requests to the backend. The backend driver is responsible for
	1163	receiving these IO requests, verifying that they are safe and then
	1164	issuing them to the real device hardware. The backend driver appears
	1165	to its kernel as a normal user of in-kernel IO functionality. When
	1166	the IO completes the backend notifies the frontend that the data is
	1167	ready for use; the frontend is then able to report IO completion to
	1168	its own kernel.
	1169
	1170	Frontend drivers are designed to be simple; most of the complexity is
	1171	in the backend, which has responsibility for translating device
	1172	addresses, verifying that requests are well-formed and do not violate
	1173	isolation guarantees, etc.
	1174
	1175	Split drivers exchange requests and responses in shared memory, with
	1176	an event channel for asynchronous notifications of activity. When the
	1177	frontend driver comes up, it uses Xenstore to set up a shared memory
	1178	frame and an interdomain event channel for communications with the
	1179	backend. Once this connection is established, the two can communicate
	1180	directly by placing requests / responses into shared memory and then
	1181	sending notifications on the event channel. This separation of
	1182	notification from data transfer allows message batching, and results
	1183	in very efficient device access.
	1184
	1185	This chapter focuses on some individual split device interfaces
	1186	available to Xen guests.
	1187
	1188
	1189	\section{Network I/O}
	1190
	1191	Virtual network device services are provided by shared memory
	1192	communication with a backend domain. From the point of view of other
	1193	domains, the backend may be viewed as a virtual ethernet switch
	1194	element with each domain having one or more virtual network interfaces
	1195	connected to it.
	1196
	1197	From the point of view of the backend domain itself, the network
	1198	backend driver consists of a number of ethernet devices. Each of
	1199	these has a logical direct connection to a virtual network device in
	1200	another domain. This allows the backend domain to route, bridge,
	1201	firewall, etc the traffic to / from the other domains using normal
	1202	operating system mechanisms.
	1203
	1204	\subsection{Backend Packet Handling}
	1205
	1206	The backend driver is responsible for a variety of actions relating to
	1207	the transmission and reception of packets from the physical device.
	1208	With regard to transmission, the backend performs these key actions:
	1209
	1210	\begin{itemize}
	1211	\item {\bf Validation:} To ensure that domains do not attempt to
	1212	generate invalid (e.g. spoofed) traffic, the backend driver may
	1213	validate headers ensuring that source MAC and IP addresses match the
	1214	interface that they have been sent from.
	1215
	1216	Validation functions can be configured using standard firewall rules
	1217	({\small{\tt iptables}} in the case of Linux).
	1218
	1219	\item {\bf Scheduling:} Since a number of domains can share a single
	1220	physical network interface, the backend must mediate access when
	1221	several domains each have packets queued for transmission. This
	1222	general scheduling function subsumes basic shaping or rate-limiting
	1223	schemes.
	1224
	1225	\item {\bf Logging and Accounting:} The backend domain can be
	1226	configured with classifier rules that control how packets are
	1227	accounted or logged. For example, log messages might be generated
	1228	whenever a domain attempts to send a TCP packet containing a SYN.
	1229	\end{itemize}
	1230
	1231	On receipt of incoming packets, the backend acts as a simple
	1232	demultiplexer: Packets are passed to the appropriate virtual interface
	1233	after any necessary logging and accounting have been carried out.
	1234
	1235	\subsection{Data Transfer}
	1236
	1237	Each virtual interface uses two ``descriptor rings'', one for
	1238	transmit, the other for receive. Each descriptor identifies a block
	1239	of contiguous machine memory allocated to the domain.
	1240
	1241	The transmit ring carries packets to transmit from the guest to the
	1242	backend domain. The return path of the transmit ring carries messages
	1243	indicating that the contents have been physically transmitted and the
	1244	backend no longer requires the associated pages of memory.
	1245
	1246	To receive packets, the guest places descriptors of unused pages on
	1247	the receive ring. The backend will return received packets by
	1248	exchanging these pages in the domain's memory with new pages
	1249	containing the received data, and passing back descriptors regarding
	1250	the new packets on the ring. This zero-copy approach allows the
	1251	backend to maintain a pool of free pages to receive packets into, and
	1252	then deliver them to appropriate domains after examining their
	1253	headers.
	1254
	1255	% Real physical addresses are used throughout, with the domain
	1256	% performing translation from pseudo-physical addresses if that is
	1257	% necessary.
	1258
	1259	If a domain does not keep its receive ring stocked with empty buffers
	1260	then packets destined to it may be dropped. This provides some
	1261	defence against receive livelock problems because an overloaded domain
	1262	will cease to receive further data. Similarly, on the transmit path,
	1263	it provides the application with feedback on the rate at which packets
	1264	are able to leave the system.
	1265
	1266	Flow control on rings is achieved by including a pair of producer
	1267	indexes on the shared ring page. Each side will maintain a private
	1268	consumer index indicating the next outstanding message. In this
	1269	manner, the domains cooperate to divide the ring into two message
	1270	lists, one in each direction. Notification is decoupled from the
	1271	immediate placement of new messages on the ring; the event channel
	1272	will be used to generate notification when {\em either} a certain
	1273	number of outstanding messages are queued, {\em or} a specified number
	1274	of nanoseconds have elapsed since the oldest message was placed on the
	1275	ring.
	1276
	1277	%% Not sure if my version is any better -- here is what was here
	1278	%% before: Synchronization between the backend domain and the guest is
	1279	%% achieved using counters held in shared memory that is accessible to
	1280	%% both. Each ring has associated producer and consumer indices
	1281	%% indicating the area in the ring that holds descriptors that contain
	1282	%% data. After receiving {\it n} packets or {\t nanoseconds} after
	1283	%% receiving the first packet, the hypervisor sends an event to the
	1284	%% domain.
	1285
	1286
	1287	\subsection{Network ring interface}
	1288
	1289	The network device uses two shared memory rings for communication: one
	1290	for transmit, one for receieve.
	1291
	1292	Transmit requests are described by the following structure:
	1293
	1294	\scriptsize
	1295	\begin{verbatim}
	1296	typedef struct netif_tx_request {
	1297	grant_ref_t gref; /* Reference to buffer page */
	1298	uint16_t offset; /* Offset within buffer page */
	1299	uint16_t flags; /* NETTXF_* */
	1300	uint16_t id; /* Echoed in response message. */
	1301	uint16_t size; /* Packet size in bytes. */
	1302	} netif_tx_request_t;
	1303	\end{verbatim}
	1304	\normalsize
	1305
	1306	\begin{description}
	1307	\item[gref] Grant reference for the network buffer
	1308	\item[offset] Offset to data
	1309	\item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
	1310	supported, to indicate that the protocol checksum field is
	1311	incomplete).
	1312	\item[id] Echoed to guest by the backend in the ring-level response so
	1313	that the guest can match it to this request
	1314	\item[size] Buffer size
	1315	\end{description}
	1316
	1317	Each transmit request is followed by a transmit response at some later
	1318	date. This is part of the shared-memory communication protocol and
	1319	allows the guest to (potentially) retire internal structures related
	1320	to the request. It does not imply a network-level response. This
	1321	structure is as follows:
	1322
	1323	\scriptsize
	1324	\begin{verbatim}
	1325	typedef struct netif_tx_response {
	1326	uint16_t id;
	1327	int16_t status;
	1328	} netif_tx_response_t;
	1329	\end{verbatim}
	1330	\normalsize
	1331
	1332	\begin{description}
	1333	\item[id] Echo of the ID field in the corresponding transmit request.
	1334	\item[status] Success / failure status of the transmit request.
	1335	\end{description}
	1336
	1337	Receive requests must be queued by the frontend, accompanied by a
	1338	donation of page-frames to the backend. The backend transfers page
	1339	frames full of data back to the guest
	1340
	1341	\scriptsize
	1342	\begin{verbatim}
	1343	typedef struct {
	1344	uint16_t id; /* Echoed in response message. */
	1345	grant_ref_t gref; /* Reference to incoming granted frame */
	1346	} netif_rx_request_t;
	1347	\end{verbatim}
	1348	\normalsize
	1349
	1350	\begin{description}
	1351	\item[id] Echoed by the frontend to identify this request when
	1352	responding.
	1353	\item[gref] Transfer reference - the backend will use this reference
	1354	to transfer a frame of network data to us.
	1355	\end{description}
	1356
	1357	Receive response descriptors are queued for each received frame. Note
	1358	that these may only be queued in reply to an existing receive request,
	1359	providing an in-built form of traffic throttling.
	1360
	1361	\scriptsize
	1362	\begin{verbatim}
	1363	typedef struct {
	1364	uint16_t id;
	1365	uint16_t offset; /* Offset in page of start of received packet */
	1366	uint16_t flags; /* NETRXF_* */
	1367	int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
	1368	} netif_rx_response_t;
	1369	\end{verbatim}
	1370	\normalsize
	1371
	1372	\begin{description}
	1373	\item[id] ID echoed from the original request, used by the guest to
	1374	match this response to the original request.
	1375	\item[offset] Offset to data within the transferred frame.
	1376	\item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
	1377	supported, to indicate that the protocol checksum field has already
	1378	been validated).
	1379	\item[status] Success / error status for this operation.
	1380	\end{description}
	1381
	1382	Note that the receive protocol includes a mechanism for guests to
	1383	receive incoming memory frames but there is no explicit transfer of
	1384	frames in the other direction. Guests are expected to return memory
	1385	to the hypervisor in order to use the network interface. They {\em
	1386	must} do this or they will exceed their maximum memory reservation and
	1387	will not be able to receive incoming frame transfers. When necessary,
	1388	the backend is able to replenish its pool of free network buffers by
	1389	claiming some of this free memory from the hypervisor.
	1390
	1391	\section{Block I/O}
	1392
	1393	All guest OS disk access goes through the virtual block device VBD
	1394	interface. This interface allows domains access to portions of block
	1395	storage devices visible to the the block backend device. The VBD
	1396	interface is a split driver, similar to the network interface
	1397	described above. A single shared memory ring is used between the
	1398	frontend and backend drivers for each virtual device, across which
	1399	IO requests and responses are sent.
	1400
	1401	Any block device accessible to the backend domain, including
	1402	network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
	1403	can be exported as a VBD. Each VBD is mapped to a device node in the
	1404	guest, specified in the guest's startup configuration.
	1405
	1406	\subsection{Data Transfer}
	1407
	1408	The per-(virtual)-device ring between the guest and the block backend
	1409	supports two messages:
	1410
	1411	\begin{description}
	1412	\item [{\small {\tt READ}}:] Read data from the specified block
	1413	device. The front end identifies the device and location to read
	1414	from and attaches pages for the data to be copied to (typically via
	1415	DMA from the device). The backend acknowledges completed read
	1416	requests as they finish.
	1417
	1418	\item [{\small {\tt WRITE}}:] Write data to the specified block
	1419	device. This functions essentially as {\small {\tt READ}}, except
	1420	that the data moves to the device instead of from it.
	1421	\end{description}
	1422
	1423	%% Rather than copying data, the backend simply maps the domain's
	1424	%% buffers in order to enable direct DMA to them. The act of mapping
	1425	%% the buffers also increases the reference counts of the underlying
	1426	%% pages, so that the unprivileged domain cannot try to return them to
	1427	%% the hypervisor, install them as page tables, or any other unsafe
	1428	%% behaviour.
	1429	%%
	1430	%% % block API here
	1431
	1432	\subsection{Block ring interface}
	1433
	1434	The block interface is defined by the structures passed over the
	1435	shared memory interface. These structures are either requests (from
	1436	the frontend to the backend) or responses (from the backend to the
	1437	frontend).
	1438
	1439	The request structure is defined as follows:
	1440
	1441	\scriptsize
	1442	\begin{verbatim}
	1443	typedef struct blkif_request {
	1444	uint8_t operation; /* BLKIF_OP_??? */
	1445	uint8_t nr_segments; /* number of segments */
	1446	blkif_vdev_t handle; /* only for read/write requests */
	1447	uint64_t id; /* private guest value, echoed in resp */
	1448	blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */
	1449	struct blkif_request_segment {
	1450	grant_ref_t gref; /* reference to I/O buffer frame */
	1451	/* @first_sect: first sector in frame to transfer (inclusive). */
	1452	/* @last_sect: last sector in frame to transfer (inclusive). */
	1453	uint8_t first_sect, last_sect;
	1454	} seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
	1455	} blkif_request_t;
	1456	\end{verbatim}
	1457	\normalsize
	1458
	1459	The fields are as follows:
	1460
	1461	\begin{description}
	1462	\item[operation] operation ID: one of the operations described above
	1463	\item[nr\_segments] number of segments for scatter / gather IO
	1464	described by this request
	1465	\item[handle] identifier for a particular virtual device on this
	1466	interface
	1467	\item[id] this value is echoed in the response message for this IO;
	1468	the guest may use it to identify the original request
	1469	\item[sector\_number] start sector on the virtal device for this
	1470	request
	1471	\item[frame\_and\_sects] This array contains structures encoding
	1472	scatter-gather IO to be performed:
	1473	\begin{description}
	1474	\item[gref] The grant reference for the foreign I/O buffer page.
	1475	\item[first\_sect] First sector to access within the buffer page (0 to 7).
	1476	\item[last\_sect] Last sector to access within the buffer page (0 to 7).
	1477	\end{description}
	1478	Data will be transferred into frames at an offset determined by the
	1479	value of {\tt first\_sect}.
	1480	\end{description}
	1481
	1482	\section{Virtual TPM}
	1483
	1484	Virtual TPM (VTPM) support provides TPM functionality to each virtual
	1485	machine that requests this functionality in its configuration file.
	1486	The interface enables domains to access therr own private TPM like it
	1487	was a hardware TPM built into the machine.
	1488
	1489	The virtual TPM interface is implemented as a split driver,
	1490	similar to the network and block interfaces described above.
	1491	The user domain hosting the frontend exports a character device /dev/tpm0
	1492	to user-level applications for communicating with the virtual TPM.
	1493	This is the same device interface that is also offered if a hardware TPM
	1494	is available in the system. The backend provides a single interface
	1495	/dev/vtpm where the virtual TPM is waiting for commands from all domains
	1496	that have located their backend in a given domain.
	1497
	1498	\subsection{Data Transfer}
	1499
	1500	A single shared memory ring is used between the frontend and backend
	1501	drivers. TPM requests and responses are sent in pages where a pointer
	1502	to those pages and other information is placed into the ring such that
	1503	the backend can map the pages into its memory space using the grant
	1504	table mechanism.
	1505
	1506	The backend driver has been implemented to only accept well-formed
	1507	TPM requests. To meet this requirement, the length inidicator in the
	1508	TPM request must correctly indicate the length of the request.
	1509	Otherwise an error message is automatically sent back by the device driver.
	1510
	1511	The virtual TPM implementation listenes for TPM request on /dev/vtpm. Since
	1512	it must be able to apply the TPM request packet to the virtual TPM instance
	1513	associated with the virtual machine, a 4-byte virtual TPM instance
	1514	identifier is prepended to each packet by the backend driver (in network
	1515	byte order) for internal routing of the request.
	1516
	1517	\subsection{Virtual TPM ring interface}
	1518
	1519	The TPM protocol is a strict request/response protocol and therefore
	1520	only one ring is used to send requests from the frontend to the backend
	1521	and responses on the reverse path.
	1522
	1523	The request/response structure is defined as follows:
	1524
	1525	\scriptsize
	1526	\begin{verbatim}
	1527	typedef struct {
	1528	unsigned long addr; /* Machine address of packet. */
	1529	grant_ref_t ref; /* grant table access reference. */
	1530	uint16_t unused; /* unused */
	1531	uint16_t size; /* Packet size in bytes. */
	1532	} tpmif_tx_request_t;
	1533	\end{verbatim}
	1534	\normalsize
	1535
	1536	The fields are as follows:
	1537
	1538	\begin{description}
	1539	\item[addr] The machine address of the page asscoiated with the TPM
	1540	request/response; a request/response may span multiple
	1541	pages
	1542	\item[ref] The grant table reference associated with the address.
	1543	\item[size] The size of the remaining packet; up to
	1544	PAGE{\textunderscore}SIZE bytes can be found in the
	1545	page referenced by 'addr'
	1546	\end{description}
	1547
	1548	The frontend initially allocates several pages whose addresses
	1549	are stored in the ring. Only these pages are used for exchange of
	1550	requests and responses.
	1551
	1552
	1553	\chapter{Further Information}
	1554
	1555	If you have questions that are not answered by this manual, the
	1556	sources of information listed below may be of interest to you. Note
	1557	that bug reports, suggestions and contributions related to the
	1558	software (or the documentation) should be sent to the Xen developers'
	1559	mailing list (address below).
	1560
	1561
	1562	\section{Other documentation}
	1563
	1564	If you are mainly interested in using (rather than developing for)
	1565	Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
	1566	directory of the Xen source distribution.
	1567
	1568	% Various HOWTOs are also available in {\tt docs/HOWTOS}.
	1569
	1570
	1571	\section{Online references}
	1572
	1573	The official Xen web site can be found at:
	1574	\begin{quote} {\tt http://www.xensource.com}
	1575	\end{quote}
	1576
	1577
	1578	This contains links to the latest versions of all online
	1579	documentation, including the latest version of the FAQ.
	1580
	1581	Information regarding Xen is also available at the Xen Wiki at
	1582	\begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote}
	1583	The Xen project uses Bugzilla as its bug tracking system. You'll find
	1584	the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.
	1585
	1586
	1587	\section{Mailing lists}
	1588
	1589	There are several mailing lists that are used to discuss Xen related
	1590	topics. The most widely relevant are listed below. An official page of
	1591	mailing lists and subscription information can be found at \begin{quote}
	1592	{\tt http://lists.xensource.com/} \end{quote}
	1593
	1594	\begin{description}
	1595	\item[xen-devel@lists.xensource.com] Used for development
	1596	discussions and bug reports. Subscribe at: \\
	1597	{\small {\tt http://lists.xensource.com/xen-devel}}
	1598	\item[xen-users@lists.xensource.com] Used for installation and usage
	1599	discussions and requests for help. Subscribe at: \\
	1600	{\small {\tt http://lists.xensource.com/xen-users}}
	1601	\item[xen-announce@lists.xensource.com] Used for announcements only.
	1602	Subscribe at: \\
	1603	{\small {\tt http://lists.xensource.com/xen-announce}}
	1604	\item[xen-changelog@lists.xensource.com] Changelog feed
	1605	from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
	1606	{\small {\tt http://lists.xensource.com/xen-changelog}}
	1607	\end{description}
	1608
	1609	\appendix
	1610
	1611
	1612	\chapter{Xen Hypercalls}
	1613	\label{a:hypercalls}
	1614
	1615	Hypercalls represent the procedural interface to Xen; this appendix
	1616	categorizes and describes the current set of hypercalls.
	1617
	1618	\section{Invoking Hypercalls}
	1619
	1620	Hypercalls are invoked in a manner analogous to system calls in a
	1621	conventional operating system; a software interrupt is issued which
	1622	vectors to an entry point within Xen. On x86/32 machines the
	1623	instruction required is {\tt int \$82}; the (real) IDT is setup so
	1624	that this may only be issued from within ring 1. The particular
	1625	hypercall to be invoked is contained in {\tt EAX} --- a list
	1626	mapping these values to symbolic hypercall names can be found
	1627	in {\tt xen/include/public/xen.h}.
	1628
	1629	On some occasions a set of hypercalls will be required to carry
	1630	out a higher-level function; a good example is when a guest
	1631	operating wishes to context switch to a new process which
	1632	requires updating various privileged CPU state. As an optimization
	1633	for these cases, there is a generic mechanism to issue a set of
	1634	hypercalls as a batch:
	1635
	1636	\begin{quote}
	1637	\hypercall{multicall(void *call\_list, int nr\_calls)}
	1638
	1639	Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
	1640	the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
	1641	call\_list}. Each entry contains the hypercall operation code followed
	1642	by up to 7 word-sized arguments.
	1643	\end{quote}
	1644
	1645	Note that multicalls are provided purely as an optimization; there is
	1646	no requirement to use them when first porting a guest operating
	1647	system.
	1648
	1649
	1650	\section{Virtual CPU Setup}
	1651
	1652	At start of day, a guest operating system needs to setup the virtual
	1653	CPU it is executing on. This includes installing vectors for the
	1654	virtual IDT so that the guest OS can handle interrupts, page faults,
	1655	etc. However the very first thing a guest OS must setup is a pair
	1656	of hypervisor callbacks: these are the entry points which Xen will
	1657	use when it wishes to notify the guest OS of an occurrence.
	1658
	1659	\begin{quote}
	1660	\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
	1661	event\_address, unsigned long failsafe\_selector, unsigned long
	1662	failsafe\_address) }
	1663
	1664	Register the normal (``event'') and failsafe callbacks for
	1665	event processing. In each case the code segment selector and
	1666	address within that segment are provided. The selectors must
	1667	have RPL 1; in XenLinux we simply use the kernel's CS for both
	1668	{\bf event\_selector} and {\bf failsafe\_selector}.
	1669
	1670	The value {\bf event\_address} specifies the address of the guest OSes
	1671	event handling and dispatch routine; the {\bf failsafe\_address}
	1672	specifies a separate entry point which is used only if a fault occurs
	1673	when Xen attempts to use the normal callback.
	1674
	1675	\end{quote}
	1676
	1677	On x86/64 systems the hypercall takes slightly different
	1678	arguments. This is because callback CS does not need to be specified
	1679	(since teh callbacks are entered via SYSRET), and also because an
	1680	entry address needs to be specified for SYSCALLs from guest user
	1681	space:
	1682
	1683	\begin{quote}
	1684	\hypercall{set\_callbacks(unsigned long event\_address, unsigned long
	1685	failsafe\_address, unsigned long syscall\_address)}
	1686	\end{quote}
	1687
	1688
	1689	After installing the hypervisor callbacks, the guest OS can
	1690	install a `virtual IDT' by using the following hypercall:
	1691
	1692	\begin{quote}
	1693	\hypercall{set\_trap\_table(trap\_info\_t *table)}
	1694
	1695	Install one or more entries into the per-domain
	1696	trap handler table (essentially a software version of the IDT).
	1697	Each entry in the array pointed to by {\bf table} includes the
	1698	exception vector number with the corresponding segment selector
	1699	and entry point. Most guest OSes can use the same handlers on
	1700	Xen as when running on the real hardware.
	1701
	1702
	1703	\end{quote}
	1704
	1705	A further hypercall is provided for the management of virtual CPUs:
	1706
	1707	\begin{quote}
	1708	\hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}
	1709
	1710	This hypercall can be used to bootstrap VCPUs, to bring them up and
	1711	down and to test their current status.
	1712
	1713	\end{quote}
	1714
	1715	\section{Scheduling and Timer}
	1716
	1717	Domains are preemptively scheduled by Xen according to the
	1718	parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
	1719	In addition, however, a domain may choose to explicitly
	1720	control certain behavior with the following hypercall:
	1721
	1722	\begin{quote}
	1723	\hypercall{sched\_op\_new(int cmd, void *extra\_args)}
	1724
	1725	Request scheduling operation from hypervisor. The following
	1726	sub-commands are available:
	1727
	1728	\begin{description}
	1729	\item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the
	1730	caller marked as runnable. No extra arguments are passed to this
	1731	command.
	1732	\item[SCHEDOP\_block] removes the calling domain from the run queue
	1733	and causes it to sleep until an event is delivered to it. No extra
	1734	arguments are passed to this command.
	1735	\item[SCHEDOP\_shutdown] is used to end the calling domain's
	1736	execution. The extra argument is a {\bf sched\_shutdown} structure
	1737	which indicates the reason why the domain suspended (e.g., for reboot,
	1738	halt, power-off).
	1739	\item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels
	1740	with an optional timeout (all of which are specified in the {\bf
	1741	sched\_poll} extra argument). The semantics are similar to the UNIX
	1742	{\bf poll} system call. The caller must have event-channel upcalls
	1743	masked when executing this command.
	1744	\end{description}
	1745	\end{quote}
	1746
	1747	{\bf sched\_op\_new} was not available prior to Xen 3.0.2. Older versions
	1748	provide only the following hypercall:
	1749
	1750	\begin{quote}
	1751	\hypercall{sched\_op(int cmd, unsigned long extra\_arg)}
	1752
	1753	This hypercall supports the following subset of {\bf sched\_op\_new} commands:
	1754
	1755	\begin{description}
	1756	\item[SCHEDOP\_yield] (extra argument is 0).
	1757	\item[SCHEDOP\_block] (extra argument is 0).
	1758	\item[SCHEDOP\_shutdown] (extra argument is numeric reason code).
	1759	\end{description}
	1760	\end{quote}
	1761
	1762	To aid the implementation of a process scheduler within a guest OS,
	1763	Xen provides a virtual programmable timer:
	1764
	1765	\begin{quote}
	1766	\hypercall{set\_timer\_op(uint64\_t timeout)}
	1767
	1768	Request a timer event to be sent at the specified system time (time
	1769	in nanoseconds since system boot).
	1770
	1771	\end{quote}
	1772
	1773	Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op}
	1774	allows block-with-timeout semantics.
	1775
	1776
	1777	\section{Page Table Management}
	1778
	1779	Since guest operating systems have read-only access to their page
	1780	tables, Xen must be involved when making any changes. The following
	1781	multi-purpose hypercall can be used to modify page-table entries,
	1782	update the machine-to-physical mapping table, flush the TLB, install
	1783	a new page-table base pointer, and more.
	1784
	1785	\begin{quote}
	1786	\hypercall{mmu\_update(mmu\_update\_t req, int count, int success\_count)}
	1787
	1788	Update the page table for the domain; a set of {\bf count} updates are
	1789	submitted for processing in a batch, with {\bf success\_count} being
	1790	updated to report the number of successful updates.
	1791
	1792	Each element of {\bf req[]} contains a pointer (address) and value;
	1793	the least significant 2-bits of the pointer are used to distinguish
	1794	the type of update requested as follows:
	1795	\begin{description}
	1796
	1797	\item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
	1798	page table entry to the associated value; Xen will check that the
	1799	update is safe, as described in Chapter~\ref{c:memory}.
	1800
	1801	\item[MMU\_MACHPHYS\_UPDATE:] update an entry in the
	1802	machine-to-physical table. The calling domain must own the machine
	1803	page in question (or be privileged).
	1804	\end{description}
	1805
	1806	\end{quote}
	1807
	1808	Explicitly updating batches of page table entries is extremely
	1809	efficient, but can require a number of alterations to the guest
	1810	OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
	1811	recommended for new OS ports.
	1812
	1813	Regardless of which page table update mode is being used, however,
	1814	there are some occasions (notably handling a demand page fault) where
	1815	a guest OS will wish to modify exactly one PTE rather than a
	1816	batch, and where that PTE is mapped into the current address space.
	1817	This is catered for by the following:
	1818
	1819	\begin{quote}
	1820	\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
	1821	unsigned long flags)}
	1822
	1823	Update the currently installed PTE that maps virtual address {\bf va}
	1824	to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the
	1825	modification is safe before applying it. The {\bf flags} determine
	1826	which kind of TLB flush, if any, should follow the update.
	1827
	1828	\end{quote}
	1829
	1830	Finally, sufficiently privileged domains may occasionally wish to manipulate
	1831	the pages of others:
	1832
	1833	\begin{quote}
	1834	\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
	1835	unsigned long flags, domid\_t domid)}
	1836
	1837	Identical to {\bf update\_va\_mapping} save that the pages being
	1838	mapped must belong to the domain {\bf domid}.
	1839
	1840	\end{quote}
	1841
	1842	An additional MMU hypercall provides an ``extended command''
	1843	interface. This provides additional functionality beyond the basic
	1844	table updating commands:
	1845
	1846	\begin{quote}
	1847
	1848	\hypercall{mmuext\_op(struct mmuext\_op op, int count, int success\_count, domid\_t domid)}
	1849
	1850	This hypercall is used to perform additional MMU operations. These
	1851	include updating {\tt cr3} (or just re-installing it for a TLB flush),
	1852	requesting various kinds of TLB flush, flushing the cache, installing
	1853	a new LDT, or pinning \& unpinning page-table pages (to ensure their
	1854	reference count doesn't drop to zero which would require a
	1855	revalidation of all entries). Some of the operations available are
	1856	restricted to domains with sufficient system privileges.
	1857
	1858	It is also possible for privileged domains to reassign page ownership
	1859	via an extended MMU operation, although grant tables are used instead
	1860	of this where possible; see Section~\ref{s:idc}.
	1861
	1862	\end{quote}
	1863
	1864	Finally, a hypercall interface is exposed to activate and deactivate
	1865	various optional facilities provided by Xen for memory management.
	1866
	1867	\begin{quote}
	1868	\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
	1869
	1870	Toggle various memory management modes (in particular writable page
	1871	tables).
	1872
	1873	\end{quote}
	1874
	1875	\section{Segmentation Support}
	1876
	1877	Xen allows guest OSes to install a custom GDT if they require it;
	1878	this is context switched transparently whenever a domain is
	1879	[de]scheduled. The following hypercall is effectively a
	1880	`safe' version of {\tt lgdt}:
	1881
	1882	\begin{quote}
	1883	\hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
	1884
	1885	Install a global descriptor table for a domain; {\bf frame\_list} is
	1886	an array of up to 16 machine page frames within which the GDT resides,
	1887	with {\bf entries} being the actual number of descriptor-entry
	1888	slots. All page frames must be mapped read-only within the guest's
	1889	address space, and the table must be large enough to contain Xen's
	1890	reserved entries (see {\bf xen/include/public/arch-x86\_32.h}).
	1891
	1892	\end{quote}
	1893
	1894	Many guest OSes will also wish to install LDTs; this is achieved by
	1895	using {\bf mmu\_update} with an extended command, passing the
	1896	linear address of the LDT base along with the number of entries. No
	1897	special safety checks are required; Xen needs to perform this task
	1898	simply since {\tt lldt} requires CPL 0.
	1899
	1900
	1901	Xen also allows guest operating systems to update just an
	1902	individual segment descriptor in the GDT or LDT:
	1903
	1904	\begin{quote}
	1905	\hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}
	1906
	1907	Update the GDT/LDT entry at machine address {\bf ma}; the new
	1908	8-byte descriptor is stored in {\bf desc}.
	1909	Xen performs a number of checks to ensure the descriptor is
	1910	valid.
	1911
	1912	\end{quote}
	1913
	1914	Guest OSes can use the above in place of context switching entire
	1915	LDTs (or the GDT) when the number of changing descriptors is small.
	1916
	1917	\section{Context Switching}
	1918
	1919	When a guest OS wishes to context switch between two processes,
	1920	it can use the page table and segmentation hypercalls described
	1921	above to perform the the bulk of the privileged work. In addition,
	1922	however, it will need to invoke Xen to switch the kernel (ring 1)
	1923	stack pointer:
	1924
	1925	\begin{quote}
	1926	\hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
	1927
	1928	Request kernel stack switch from hypervisor; {\bf ss} is the new
	1929	stack segment, which {\bf esp} is the new stack pointer.
	1930
	1931	\end{quote}
	1932
	1933	A useful hypercall for context switching allows ``lazy'' save and
	1934	restore of floating point state:
	1935
	1936	\begin{quote}
	1937	\hypercall{fpu\_taskswitch(int set)}
	1938
	1939	This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
	1940	control register; this means that the next attempt to use floating
	1941	point will cause a trap which the guest OS can trap. Typically it will
	1942	then save/restore the FP state, and clear the {\tt TS} bit, using the
	1943	same call.
	1944	\end{quote}
	1945
	1946	This is provided as an optimization only; guest OSes can also choose
	1947	to save and restore FP state on all context switches for simplicity.
	1948
	1949	Finally, a hypercall is provided for entering vm86 mode:
	1950
	1951	\begin{quote}
	1952	\hypercall{switch\_vm86}
	1953
	1954	This allows the guest to run code in vm86 mode, which is needed for
	1955	some legacy software.
	1956	\end{quote}
	1957
	1958	\section{Physical Memory Management}
	1959
	1960	As mentioned previously, each domain has a maximum and current
	1961	memory allocation. The maximum allocation, set at domain creation
	1962	time, cannot be modified. However a domain can choose to reduce
	1963	and subsequently grow its current allocation by using the
	1964	following call:
	1965
	1966	\begin{quote}
	1967	\hypercall{memory\_op(unsigned int op, void *arg)}
	1968
	1969	Increase or decrease current memory allocation (as determined by
	1970	the value of {\bf op}). The available operations are:
	1971
	1972	\begin{description}
	1973	\item[XENMEM\_increase\_reservation] Request an increase in machine
	1974	memory allocation; {\bf arg} must point to a {\bf
	1975	xen\_memory\_reservation} structure.
	1976	\item[XENMEM\_decrease\_reservation] Request a decrease in machine
	1977	memory allocation; {\bf arg} must point to a {\bf
	1978	xen\_memory\_reservation} structure.
	1979	\item[XENMEM\_maximum\_ram\_page] Request the frame number of the
	1980	highest-addressed frame of machine memory in the system. {\bf arg}
	1981	must point to an {\bf unsigned long} where this value will be
	1982	stored.
	1983	\item[XENMEM\_current\_reservation] Returns current memory reservation
	1984	of the specified domain.
	1985	\item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation
	1986	of the specified domain.
	1987	\end{description}
	1988
	1989	\end{quote}
	1990
	1991	In addition to simply reducing or increasing the current memory
	1992	allocation via a `balloon driver', this call is also useful for
	1993	obtaining contiguous regions of machine memory when required (e.g.
	1994	for certain PCI devices, or if using superpages).
	1995
	1996
	1997	\section{Inter-Domain Communication}
	1998	\label{s:idc}
	1999
	2000	Xen provides a simple asynchronous notification mechanism via
	2001	\emph{event channels}. Each domain has a set of end-points (or
	2002	\emph{ports}) which may be bound to an event source (e.g. a physical
	2003	IRQ, a virtual IRQ, or an port in another domain). When a pair of
	2004	end-points in two different domains are bound together, then a `send'
	2005	operation on one will cause an event to be received by the destination
	2006	domain.
	2007
	2008	The control and use of event channels involves the following hypercall:
	2009
	2010	\begin{quote}
	2011	\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
	2012
	2013	Inter-domain event-channel management; {\bf op} is a discriminated
	2014	union which allows the following 7 operations:
	2015
	2016	\begin{description}
	2017
	2018	\item[alloc\_unbound:] allocate a free (unbound) local
	2019	port and prepare for connection from a specified domain.
	2020	\item[bind\_virq:] bind a local port to a virtual
	2021	IRQ; any particular VIRQ can be bound to at most one port per domain.
	2022	\item[bind\_pirq:] bind a local port to a physical IRQ;
	2023	once more, a given pIRQ can be bound to at most one port per
	2024	domain. Furthermore the calling domain must be sufficiently
	2025	privileged.
	2026	\item[bind\_interdomain:] construct an interdomain event
	2027	channel; in general, the target domain must have previously allocated
	2028	an unbound port for this channel, although this can be bypassed by
	2029	privileged domains during domain setup.
	2030	\item[close:] close an interdomain event channel.
	2031	\item[send:] send an event to the remote end of a
	2032	interdomain event channel.
	2033	\item[status:] determine the current status of a local port.
	2034	\end{description}
	2035
	2036	For more details see
	2037	{\bf xen/include/public/event\_channel.h}.
	2038
	2039	\end{quote}
	2040
	2041	Event channels are the fundamental communication primitive between
	2042	Xen domains and seamlessly support SMP. However they provide little
	2043	bandwidth for communication {\sl per se}, and hence are typically
	2044	married with a piece of shared memory to produce effective and
	2045	high-performance inter-domain communication.
	2046
	2047	Safe sharing of memory pages between guest OSes is carried out by
	2048	granting access on a per page basis to individual domains. This is
	2049	achieved by using the {\tt grant\_table\_op} hypercall.
	2050
	2051	\begin{quote}
	2052	\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
	2053
	2054	Used to invoke operations on a grant reference, to setup the grant
	2055	table and to dump the tables' contents for debugging.
	2056
	2057	\end{quote}
	2058
	2059	\section{IO Configuration}
	2060
	2061	Domains with physical device access (i.e.\ driver domains) receive
	2062	limited access to certain PCI devices (bus address space and
	2063	interrupts). However many guest operating systems attempt to
	2064	determine the PCI configuration by directly access the PCI BIOS,
	2065	which cannot be allowed for safety.
	2066
	2067	Instead, Xen provides the following hypercall:
	2068
	2069	\begin{quote}
	2070	\hypercall{physdev\_op(void *physdev\_op)}
	2071
	2072	Set and query IRQ configuration details, set the system IOPL, set the
	2073	TSS IO bitmap.
	2074
	2075	\end{quote}
	2076
	2077
	2078	For examples of using {\tt physdev\_op}, see the
	2079	Xen-specific PCI code in the linux sparse tree.
	2080
	2081	\section{Administrative Operations}
	2082	\label{s:dom0ops}
	2083
	2084	A large number of control operations are available to a sufficiently
	2085	privileged domain (typically domain 0). These allow the creation and
	2086	management of new domains, for example. A complete list is given
	2087	below: for more details on any or all of these, please see
	2088	{\tt xen/include/public/dom0\_ops.h}
	2089
	2090
	2091	\begin{quote}
	2092	\hypercall{dom0\_op(dom0\_op\_t *op)}
	2093
	2094	Administrative domain operations for domain management. The options are:
	2095
	2096	\begin{description}
	2097	\item [DOM0\_GETMEMLIST:] get list of pages used by the domain
	2098
	2099	\item [DOM0\_SCHEDCTL:]
	2100
	2101	\item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
	2102
	2103	\item [DOM0\_CREATEDOMAIN:] create a new domain
	2104
	2105	\item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated
	2106	with a domain
	2107
	2108	\item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
	2109	queue.
	2110
	2111	\item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
	2112	once again.
	2113
	2114	\item [DOM0\_GETDOMAININFO:] get statistics about the domain
	2115
	2116	\item [DOM0\_SETDOMAININFO:] set VCPU-related attributes
	2117
	2118	\item [DOM0\_MSR:] read or write model specific registers
	2119
	2120	\item [DOM0\_DEBUG:] interactively invoke the debugger
	2121
	2122	\item [DOM0\_SETTIME:] set system time
	2123
	2124	\item [DOM0\_GETPAGEFRAMEINFO:]
	2125
	2126	\item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
	2127
	2128	\item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
	2129
	2130	\item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes
	2131
	2132	\item [DOM0\_PHYSINFO:] get information about the host machine
	2133
	2134	\item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
	2135
	2136	\item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
	2137
	2138	\item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
	2139
	2140	\item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
	2141	page frame info
	2142
	2143	\item [DOM0\_ADD\_MEMTYPE:] set MTRRs
	2144
	2145	\item [DOM0\_DEL\_MEMTYPE:] remove a memory type range
	2146
	2147	\item [DOM0\_READ\_MEMTYPE:] read MTRR
	2148
	2149	\item [DOM0\_PERFCCONTROL:] control Xen's software performance
	2150	counters
	2151
	2152	\item [DOM0\_MICROCODE:] update CPU microcode
	2153
	2154	\item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
	2155	IO port range (enable / disable a range for a particular domain)
	2156
	2157	\item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU
	2158
	2159	\item [DOM0\_GETVCPUINFO:] get current state for a VCPU
	2160	\item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain
	2161	info
	2162
	2163	\item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
	2164	needs to handle (e.g. noirqbalance)
	2165
	2166	\item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
	2167	map
	2168
	2169	\item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain
	2170
	2171	\item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain
	2172
	2173	\end{description}
	2174	\end{quote}
	2175
	2176	Most of the above are best understood by looking at the code
	2177	implementing them (in {\tt xen/common/dom0\_ops.c}) and in
	2178	the user-space tools that use them (mostly in {\tt tools/libxc}).
	2179
	2180	\section{Access Control Module Hypercalls}
	2181	\label{s:acmops}
	2182
	2183	Hypercalls relating to the management of the Access Control Module are
	2184	also restricted to domain 0 access for now. For more details on any or
	2185	all of these, please see {\tt xen/include/public/acm\_ops.h}. A
	2186	complete list is given below:
	2187
	2188	\begin{quote}
	2189
	2190	\hypercall{acm\_op(int cmd, void *args)}
	2191
	2192	This hypercall can be used to configure the state of the ACM, query
	2193	that state, request access control decisions and dump additional
	2194	information.
	2195
	2196	\begin{description}
	2197
	2198	\item [ACMOP\_SETPOLICY:] set the access control policy
	2199
	2200	\item [ACMOP\_GETPOLICY:] get the current access control policy and
	2201	status
	2202
	2203	\item [ACMOP\_DUMPSTATS:] get current access control hook invocation
	2204	statistics
	2205
	2206	\item [ACMOP\_GETSSID:] get security access control information for a
	2207	domain
	2208
	2209	\item [ACMOP\_GETDECISION:] get access decision based on the currently
	2210	enforced access control policy
	2211
	2212	\end{description}
	2213	\end{quote}
	2214
	2215	Most of the above are best understood by looking at the code
	2216	implementing them (in {\tt xen/common/acm\_ops.c}) and in the
	2217	user-space tools that use them (mostly in {\tt tools/security} and
	2218	{\tt tools/python/xen/lowlevel/acm}).
	2219
	2220
	2221	\section{Debugging Hypercalls}
	2222
	2223	A few additional hypercalls are mainly useful for debugging:
	2224
	2225	\begin{quote}
	2226	\hypercall{console\_io(int cmd, int count, char *str)}
	2227
	2228	Use Xen to interact with the console; operations are:
	2229
	2230	{CONSOLEIO\_write}: Output count characters from buffer str.
	2231
	2232	{CONSOLEIO\_read}: Input at most count characters into buffer str.
	2233	\end{quote}
	2234
	2235	A pair of hypercalls allows access to the underlying debug registers:
	2236	\begin{quote}
	2237	\hypercall{set\_debugreg(int reg, unsigned long value)}
	2238
	2239	Set debug register {\bf reg} to {\bf value}
	2240
	2241	\hypercall{get\_debugreg(int reg)}
	2242
	2243	Return the contents of the debug register {\bf reg}
	2244	\end{quote}
	2245
	2246	And finally:
	2247	\begin{quote}
	2248	\hypercall{xen\_version(int cmd)}
	2249
	2250	Request Xen version number.
	2251	\end{quote}
	2252
	2253	This is useful to ensure that user-space tools are in sync
	2254	with the underlying hypervisor.
	2255
	2256
	2257	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: