source: trunk/packages/xen-3.1/xen-3.1/docs/misc/blkif-drivers-explained.txt @ 34

Last change on this file since 34 was 34, checked in by hartmans, 18 years ago

Add xen and xen-common

File size: 21.2 KB
Line 
1=== How the Blkif Drivers Work ===
2Andrew Warfield
3andrew.warfield@cl.cam.ac.uk
4
5The intent of this is to explain at a fairly detailed level how the
6split device drivers work in Xen 1.3 (aka 2.0beta).  The intended
7audience for this, I suppose, is anyone who intends to work with the
8existing blkif interfaces and wants something to help them get up to
9speed with the code in a hurry.  Secondly though, I hope to break out
10the general mechanisms that are used in the drivers that are likely to
11be necessary to implement other drivers interfaces.
12
13As a point of warning before starting, it is worth mentioning that I
14anticipate much of the specifics described here changing in the near
15future.  There has been talk about making the blkif protocol
16a bit more efficient than it currently is.  Keir's addition of grant
17tables will change the current remapping code that is used when shared
18pages are initially set up.
19
20Also, writing other control interface types will likely need support
21from Xend, which at the moment has a steep learning curve... this
22should be addressed in the future.
23
24For more information on the driver model as a whole, read the
25"Reconstructing I/O" technical report
26(http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).
27
28==== High-level structure of a split-driver interface ====
29
30Why would you want to write a split driver in the first place?  As Xen
31is a virtual machine manager and focuses on isolation as an initial
32design principle, it is generally considered unwise to share physical
33access to devices across domains.  The reasons for this are obvious:
34when device resources are shared, misbehaving code or hardware can
35result in the failure of all of the client applications.  Moreover, as
36virtual machines in Xen are entire OSs, standard device drives that
37they might use cannot have multiple instantiations for a single piece
38of hardware.  In light of all this, the general approach in Xen is to
39give a single virtual machine hardware access to a device, and where
40other VMs want to share the device, export a higher-level interface to
41facilitate that sharing.  If you don't want to share, that's fine.
42There are currently Xen users actively exploring running two
43completely isolated X-Servers on a Xen host, each with it's own video
44card, keyboard, and mouse.  In these situations, the guests need only
45be given physical access to the necessary devices and left to go on
46their own.  However, for devices such as disks and network interfaces,
47where sharing is required, the split driver approach is a good
48solution.
49
50The structure is like this:
51
52   +--------------------------+  +--------------------------+
53   | Domain 0 (privileged)    |  | Domain 1 (unprivileged)  |
54   |                          |  |                          |
55   | Xend ( Application )     |  |                          |
56   | Blkif Backend Driver     |  | Blkif Frontend Driver    |
57   | Physical Device Driver   |  |                          |
58   +--------------------------+  +--------------------------+
59   +--------------------------------------------------------+
60   |                X       E       N                       |
61   +--------------------------------------------------------+
62
63
64The Blkif driver is in two parts, which we refer to as frontend (FE)
65and a backend (BE).  Together, they serve to proxy device requests
66between the guest operating system in an unprivileged domain, and the
67physical device driver in the physical domain.  An additional benefit
68to this approach is that the FE driver can provide a single interface
69for a whole class of physical devices.  The blkif interface mounts
70IDE, SCSI, and our own VBD-structured disks, independent of the
71physical driver underneath.  Moreover, supporting additional OSs only
72requires that a new FE driver be written to connect to the existing
73backend.
74
75==== Inter-Domain Communication Mechanisms ====
76
77===== Event Channels =====
78
79Before getting into the specifics of the block interface driver, it is
80worth discussing the mechanisms that are used to communicate between
81domains.  Two mechanisms are used to allow the construction of
82high-performance drivers: event channels and shared-memory rings.
83
84Event channels are an asynchronous interdomain notification
85mechanism.  Xen allows channels to be instantiated between two
86domains, and domains can request that a virtual irq be attached to
87notifications on a given channel.  The result of this is that the
88frontend domain can send a notification on an event channel, resulting
89in an interrupt entry into the backend at a later time.
90
91The event channel between two domains is instantiated in the Xend code
92during driver startup (described later).  Xend's channel.py
93(tools/python/xen/xend/server/channel.py) defines the function
94
95
96def eventChannel(dom1, dom2):
97    return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)
98
99
100which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
101which in turn generates a hypercall to Xen to patch the event channel
102between the domains.  Only a privileged domain can request the
103creation of an event channel.
104
105Once the event channel is created in Xend, its ends are passed to both the
106front and backend domains over the control channel.  The end that is
107passed to a domain is just an integer "port" uniquely identifying the
108event channel's local connection to that domain.  An example of this
109setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
110blkif_connect(), which receives several status change events as
111the driver starts up.  It is passed an event channel end in a
112BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:
113
114
115   blkif_evtchn = status->evtchn;
116   blkif_irq    = bind_evtchn_to_irq(blkif_evtchn);
117   if ( (rc = request_irq(blkif_irq, blkif_int,
118                          SA_SAMPLE_RANDOM, "blkif", NULL)) )
119       printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);
120
121
122This code associates a virtual irq with the event channel, and
123attaches the function blkif_int() as an interrupt handler for that
124irq.  blkif_int() simply handles the notification and returns, it does
125not need to interact with the channel at all.
126
127An example of generating a notification can also be seen in blkfront.c:
128
129
130static inline void flush_requests(void)
131{
132    DISABLE_SCATTERGATHER();
133    wmb(); /* Ensure that the frontend can see the requests. */
134    blk_ring->req_prod = req_prod;
135    notify_via_evtchn(blkif_evtchn);
136}
137}}}
138
139notify_via_evtchn() issues a hypercall to set the event waiting flag on
140the other domain's end of the channel.
141
142===== Communication Rings =====
143
144Event channels are strictly a notification mechanism between domains.
145To move large chunks of data back and forth, Xen allows domains to
146share pages of memory.  We use communication rings as a means of
147managing access to a shared memory page for message passing between
148domains.  These rings are not explicitly a mechanism of Xen, which is
149only concerned with the actual sharing of the page and not how it is
150used, they are however worth discussing as they are used in many
151places in the current code and are a useful model for communicating
152across a shared page.
153
154A shared page is set up by a front end guest first allocating and passing
155the address of a page in its own address space to the backend driver. 
156
157Consider the following code, also from blkfront.c.  Note:  this code
158is in blkif_disconnect().  The driver transitions from STATE_CLOSED
159to STATE_DISCONNECTED before becoming CONNECTED.  The state automata
160is in blkif_status().
161
162   blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
163   blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
164   ...
165   /* Construct an interface-CONNECT message for the domain controller. */
166   cmsg.type      = CMSG_BLKIF_FE;
167   cmsg.subtype   = CMSG_BLKIF_FE_INTERFACE_CONNECT;
168   cmsg.length    = sizeof(blkif_fe_interface_connect_t);
169   up.handle      = 0;
170   up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
171   memcpy(cmsg.msg, &up, sizeof(up)); 
172
173
174blk_ring will be the shared page.  The producer and consumer pointers
175are then initialised (these will be discussed soon), and then the
176machine address of the page is send to the backend via a control
177channel to Xend.  This control channel itself uses the notification
178and shared memory mechanisms described here, but is set up for each
179domain automatically at startup.
180
181The backend, which is a privileged domain then takes the page address
182and maps it into its own address space (in
183linux26/drivers/xen/blkback/interface.c:blkif_connect()):
184
185
186void blkif_connect(blkif_be_connect_t *connect)
187
188   ...
189   unsigned long shmem_frame = connect->shmem_frame;
190   ...
191
192   if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
193   {
194      connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
195      return;
196   }
197
198   prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
199   error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
200                                   shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
201                                   prot, domid);
202
203   ...
204
205   blkif->blk_ring_base = (blkif_ring_t *)vma->addr
206}}}
207
208The machine address of the page is passed in the shmem_frame field of
209the connect message.  This is then mapped into the virtual address
210space of the backend domain, and saved in the blkif structure
211representing this particular backend connection.
212
213NOTE:  New mechanisms will be added very shortly to allow domains to
214explicitly grant access to their pages to other domains.  This "grant
215table" support is in the process of being added to the tree, and will
216change the way a shared page is set up.  In particular, it will remove
217the need of the remapping domain to be privileged.
218
219Sending data across shared rings:
220
221Shared rings avoid the potential for write interference between
222domains in a very cunning way.  A ring is partitioned into a request
223and a response region, and domains only work within their own space.
224This can be thought of as a double producer-consumer ring -- the ring
225is described by four pointers into a circular buffer of fixed-size
226records.  Pointers may only advance, and may not pass one another.
227
228
229                         resp_cons----+
230                                      V
231           +----+----+----+----+----+----+----+
232           |    |    |  free(A)     |RSP1|RSP2|
233           +----+----+----+----+----+----+----+
234 req_prod->|    |       -------->        |RSP3|
235           +----+                        +----+
236           |REQ8|                        |    |<-resp_prod
237           +----+                        +----+
238           |REQ7|                        |    |
239           +----+                        +----+
240           |REQ6|       <--------        |    |
241           +----+----+----+----+----+----+----+
242           |REQ5|REQ4|    free(B)   |    |    |
243           +----+----+----+----+----+----+----+
244  req_cons---------^
245
246
247
248By adopting the convention that every request will receive a response,
249not all four pointers need be shared and flow control on the ring
250becomes very easy to manage.  Each domain manages its own
251consumer pointer, and the two producer pointers are visible to both
252(xen/include/public/io/blkif.h):
253
254
255/* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
256  #define BLKIF_RING_SIZE        64
257
258  ...
259
260/*
261 * We use a special capitalised type name because it is _essential_ that all
262 * arithmetic on indexes is done on an integer type of the correct size.
263 */
264typedef u32 BLKIF_RING_IDX;
265
266/*
267 * Ring indexes are 'free running'. That is, they are not stored modulo the
268 * size of the ring buffer. The following macro converts a free-running counter
269 * into a value that can directly index a ring-buffer array.
270 */
271#define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))
272
273typedef struct {
274    BLKIF_RING_IDX req_prod;  /*  0: Request producer. Updated by front-end. */
275    BLKIF_RING_IDX resp_prod; /*  4: Response producer. Updated by back-end. */
276    union {                   /*  8 */
277        blkif_request_t  req;
278        blkif_response_t resp;
279    } PACKED ring[BLKIF_RING_SIZE];
280} PACKED blkif_ring_t;
281
282
283
284As shown in the diagram above, the rules for using a shared memory
285ring are simple. 
286
287 1. A ring is full when a domain's producer and consumer pointers are
288    equal (e.g. req_prod == resp_cons).  In this situation, the
289    consumer pointer must be advanced.  Furthermore, if the consumer
290    pointer is equal to the other domain's producer pointer,
291    (e.g. resp_cons = resp_prod), then the other domain has all the
292    buffers.
293
2942. Producer pointers point to the next buffer that will be written to.
295   (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)
296
2973. Consumer pointers point to a valid message, so long as they are not
298   equal to the associated producer pointer.
299
3004. A domain should only ever write to the message pointed
301   to by its producer index, and read from the message at it's
302   consumer.  More generally, the domain may be thought of to have
303   exclusive access to the messages between its consumer and producer,
304   and should absolutely not read or write outside this region.
305
306   Thus the front end has exclusive access to the free(A) region
307   in the figure above, and the back end driver has exclusive
308   access to the free(B) region.
309
310In general, drivers keep a private copy of their producer pointer and
311then set the shared version when they are ready for the other end to
312process a set of messages.  Additionally, it is worth paying attention
313to the use of memory barriers (rmb/wmb) in the code, to ensure that
314rings that are shared across processors behave as expected.
315
316==== Structure of the Blkif Drivers ====
317
318Now that the communications primitives have been discussed, I'll
319quickly cover the general structure of the blkif driver.  This is
320intended to give a high-level idea of what is going on, in an effort
321to make reading the code a more approachable task.
322
323There are three key software components that are involved in the blkif
324drivers (not counting Xen itself).  The frontend and backend driver,
325and Xend, which coordinates their initial connection.  Xend may also
326be involved in control-channel signalling in some cases after startup,
327for instance to manage reconnection if the backend is restarted.
328
329===== Frontend Driver Structure =====
330
331The frontend domain uses a single event channel and a shared memory
332ring to trade control messages with the backend.  These are both setup
333during domain startup, which will be discussed shortly.  The shared
334memory ring is called blkif_ring, and the private ring indexes are
335resp_cons, and req_prod.  The ring is protected by blkif_io_lock.
336Additionally, the frontend keeps a list of outstanding requests in
337rec_ring[].  These are uniquely identified by a guest-local id number,
338which is associated with each request sent to the backend, and
339returned with the matching responses.  Information about the actual
340disks are stored in major_info[], of which only the first nr_vbds
341entries are valid.  Finally, the global 'recovery' indicates that the
342connection between the backend and frontend drivers has been broken
343(possibly due to a backend driver crash) and that the frontend is in
344recovery mode, in which case it will attempt to reconnect and reissue
345outstanding requests.
346
347The frontend driver is single-threaded and after setup is entered only
348through three points:  (1) read/write requests from the XenLinux guest
349that it is a part of, (2) interrupts from the backend driver on its
350event channel (blkif_int()), and (3) control messages from Xend
351(blkif_ctrlif_rx).
352
353===== Backend Driver Structure =====
354
355The backend driver is slightly more complex as it must manage any
356number of concurrent frontend connections.  For each domain it
357manages, the backend driver maintains a blkif structure, which
358describes all the connection and disk information associated with that
359particular domain.  This structure is associated with the interrupt
360registration, and allows the backend driver to have immediate context
361when it takes a notification from some domain.
362
363All of the blkif structures are stored in a hash table (blkif_hash),
364which is indexed by a hash of the domain id, and a "handle", really a
365per-domain blkif identifier, in case it wants to have multiple connections.
366
367The per-connection blkif structure is of type blkif_t.  It contains
368all of the communication details (event channel, irq, shared memory
369ring and indexes), and blk_ring_lock, which is the backend mutex on
370the shared ring.  The structure also contains vbd_rb, which is a
371red-black tree, containing an entry for each device/partition that is
372assigned to that domain.  This structure is filled by xend passing
373disk information to the backend at startup, and is protected by
374vbd_lock.  Finally, the blkif struct contains a status field, which
375describes the state of the connection.
376
377The backend driver spawns a kernel thread at startup
378(blkio_schedule()), which handles requests to and from the actual disk
379device drivers.  This scheduler thread maintains a list of blkif
380structures that have pending requests, and services them round-robin
381with a maximum per-round request limit.  blkifs are added to the list
382in the interrupt handler (blkif_be_int()) using
383add_to_blkdev_list_tail(), and removed in the scheduler loop after
384calling do_block_io_op(), which processes a batch of requests.  The
385scheduler thread is explicitly activated at several points in the code
386using maybe_trigger_blkio_schedule().
387
388Pending requests between the backend driver and the physical device
389drivers use another ring, pending_ring.  Requests are placed in this
390ring in the scheduler thread and issued to the device.  A completion
391callback, end_block_io_op, indicates that requests have been serviced
392and generates a response on the appropriate blkif ring.  pending
393reqs[] stores a list of outstanding requests with the physical drivers.
394
395So, control entries to the backend are (1) the blkio scheduler thread,
396which sends requests to the real device drivers, (2) end_block_io_op,
397which is called as serviced requests complete, (3) blkif_be_int()
398handles notifications from the frontend drivers in other domains, and
399(4) blkif_ctrlif_rx() handles control messages from xend.
400
401==== Driver Startup ====
402
403Prior to starting a new guest using the frontend driver, the backend
404will have been started in a privileged domain.  The backend
405initialisation code initialises all of its data structures, such as
406the blkif hash table, and starts the scheduler thread as a kernel
407thread. It then sends a driver status up message to let xend know it
408is ready to take frontend connections.
409
410When a new domain that uses the blkif frontend driver is started,
411there are a series of interactions between it, xend, and the specified
412backend driver.  These interactions are as follows:
413
414The domain configuration given to xend will specify the backend domain
415and disks that the new guest is to use.  Prior to actually running the
416domain, xend and the backend driver interact to setup the initial
417blkif record in the backend.
418
419(1) Xend sends a BLKIF_BE_CREATE message to backend.
420
421  Backend does blkif_create(), having been passed FE domid and handle.
422  It creates and initialises a new blkif struct, and puts it in the
423  hash table.
424  It then returns a STATUS_OK response to xend.
425
426(2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
427 
428  Backend adds a vbd entry in the red-black tree for the
429  specified (dom, handle) blkif entry.
430  Sends a STATUS_OK response.
431
432(3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.
433
434  Backend takes the physical device information passed in the
435  message and assigns them to the newly created vbd struct.
436
437(2) and (3) repeat as any additional devices are added to the domain.
438
439At this point, the backend has enough state to allow the frontend
440domain to start.  The domain is run, and eventually gets to the
441frontend driver initialisation code.  After setting up the frontend
442data structures, this code continues the communications with xend and
443the backend to negotiate a connection:
444
445(4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.
446
447  This message tells xend that the driver is up.  The init function
448  now spin-waits until driver setup is complete in order to prevent
449  Linux from attempting to boot before the disks are connected.
450
451(5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
452
453  This message specifies that the interface is now disconnected
454  (instead of closed).
455  The domain updates it's state, and allocates the shared blk_ring
456  page.  Next,
457
458(6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message
459
460  This message specifies the domain and handle, and includes the
461  address of the newly created page.
462
463(7) Xend sends the backend a BLKIF_BE_CONNECT message
464
465  The backend fills in the blkif connection information, maps the
466  shared page, and binds an irq to the event channel.
467 
468(8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
469
470  This message takes the frontend driver to a CONNECTED state, at
471  which point it binds an irq to the event channel and calls
472  xlvbd_init to initialise the individual block devices.
473
474The frontend Linux is stall spin waiting at this point, until all of
475the disks have been probed.  Messaging now is directly between the
476front and backend domain using the new shared ring and event channel.
477
478(9) The frontend sends a BLKIF_OP_PROBE directly to the backend.
479
480  This message includes a reference to an additional page, that the
481  backend can use for it's reply.  The backend responds with an array
482  of the domains disks (as vdisk_t structs) on the provided page.
483
484The frontend now initialises each disk, calling xlvbd_init_device()
485for each one.
Note: See TracBrowser for help on using the repository browser.