
		Xvisor Design Document

This is document gives an overview of the xvisor design. It gives very 
important insight required for writing architecture support for Xvisor.


	 	Chapter 1: Modeling Virtual Machines

A virtual machine (VM) is a software emulation/simulation of a physical machine
(i.e. a computer system) that can executes programs or operating systems.

Virtual machines are separated into two major categories (based on their use): 
  * System Virtual Machine: A system virtual machine provides a complete system
    platform which supports the execution of a complete operating system (OS). 
  * Process Virtual Machine: A process virtual machine is designed to run a 
    single program, which means that it supports a single process. 

An essential characteristic of a virtual machine is that the software running 
inside is limited to the resources and abstractions provided by the virtual 
machine—it cannot break out of its virtual world.

(Citation: Refer Wikipedia page on "Virtual Machine" and "Hypervisor")

Xvisor is a hardware assisted system virtualization software (i.e. implements
system virtual machine with hardware acceleration) running directly on a host
machine (i.e. physical machine/hardware). In short, we can say that Xvisor 
is a Native (or Type-1) Hypervisor (or Virtual Machine Monitor).

We refer system virtual machine instances as "Guest" instances and CPUs of 
system virtual machines as "VCPU" instances in Xvisor. Also, VCPU belonging to
a Guest is refered as "Normal VCPU" and VCPU not belonging to any guest is 
referred as "Orphan VCPU". Xvisor creates Orphan VCPUs for various background 
processing and running managment daemons. 

Any modern CPU architecure has atleast two priviledge modes: User Mode, and
Supervisor Mode. The User Mode has lowest priviledge and Supervisor Mode has
highest priviledge. Xvisor runs Normal VCPUs in User Mode and Orphan VCPUs in
Supervisor Mode (Note: Architecture specific code has to treat Normal and 
Orphan VCPUs differently). Just like any OS, the VCPU context in Xvisor has 
an architecture dependent part and an architecture independent part. 

The architecture dependent part of VCPU context consist of:
  1. User Registers: Registers which are updated by processor in user mode 
     (or unprivileged mode) only. This registers are usually general purpose 
     registers and status flags which are automatically updated by processor
     (e.g. comparision flags, overflow flag, zero flag, etc).
  2. Super Registers: Registers which are updated by processor in supervisor 
     mode (or privileged mode) only. Whenever VCPU tries to read/write such an
     register is updated we get an exception and we can return/update its 
     virtual value. In most of the cases there are also some additional data 
     structures (like MMU context, shadow TLB, shadow page table, ... etc) 
     in the super registers.
(Note: For Orphan VCPUs only User Registers are important)

The architecture independent part of VCPU context consist of:
  1. Unique VCPU number
  2. Unique VCPU Name (Only important for Orphan VCPUs)
  3. Pointer to VCPU Device Tree Node (Not available for Orphan VCPUs)
  4. Scheduling information (e.g. tick callback function, pending ticks, etc)
  5. Virtual interrupt managment information

In addtion to VCPUs, a Guest also has "Guest Address Space". The Guest Address
Space contains a set of "Guest Regions". Each Guest Region has a unique Guest 
Physical Address (i.e. Physical address at which region is accessible to Guest 
VCPUs) and Physical Size (i.e. Size of Guest Region). Further a Guest Region 
can be one of the two types:
  * Real Guest Region: A Real Guest Region gives direct access to a Host 
    Machine Device/Memory (e.g. RAM, UART, etc). It has to be mapped to a
    Host Physical Address (i.e. Physical address in Host Machine).
  * Virtual Guest Region: A Virtual Guest Region gives access to an emulated
    device (e.g. emulated PIC, emulated Timer, etc). This type of region is
    typically linked with an emulated device. The architecture specific code 
    is responsible for redirecting virtual guest region read/write access to
    the Xvisor device emulation framework. 

The figure 1 below gives a clear picture of the System Virtual Machine Model 
implemented by Xvisor.

+--------------------------+    +--------------------------+    +------------+
|          Guest_0         |    |          Guest_N         |    |            |
+--------------------------+    +--------------------------+    | +--------+ |
| +--------+    +--------+ |    | +--------+    +--------+ |    | |        | |
| |        |    |        | |    | |        |    |        | |    | | Orphan | |
| | VCPU_0 | .. | VCPU_M | |    | | VCPU_0 | .. | VCPU_K | |    | | VCPU_R | |
| |        |    |        | |    | |        |    |        | |    | |        | |
| +--------+    +--------+ | .. | +--------+    +--------+ |    | +--------+ |
+--------------------------+    +--------------------------+    |      .     |
|       Address Space      |    |       Address Space      |    |      .     |
|+--------+ +-----+ +-----+|    |+--------+ +-----+ +-----+|    |      .     |
|| Memory | | PIC | | PIT ||    || Memory | | PIC | | PIT ||    | +--------+ |
|+--------+ +-----+ +-----+|    |+--------+ +-----+ +-----+|    | |        | |
|+-----+ +------+ +-----+  |    |+-----+ +------+ +-----+  |    | | Orphan | |
|| ROM | | UART | | LCD |  |    || ROM | | UART | | LCD |  |    | | VCPU_0 | |
|+-----+ +------+ +-----+  |    |+-----+ +------+ +-----+  |    | |        | |
+--------------------------+    +--------------------------+    | +--------+ |
+---------------------------------------------------------------+            |
|                                                                            |
|                 eXtensible Versatile hypervISOR (Xvisor)                   |
|                                                                            |
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+
|                                                                            |
|            Host Machine (Host CPU + Host Memory + Host Devices)            |
|                                                                            |
+----------------------------------------------------------------------------+


	 	Chapter 2: Hypervisor Configuration

Xvisor maintians its configuration in the form of Tree data structure called 
"device tree". It is highly inspired from Device Tree Script (DTS) used by
of_platform of Linux kernel.

In Linux, if an architecture (e.g. PowerPC) is using of_platform then at 
booting time Linux kernel will expect a DTB file (Flattened device tree file) 
from the boot loader. The DTB file is binary file generated by compiling a DTS 
(Device tree script) using DTC (Device tree compiler). An of_platform enabled 
Linux kernel only probes those drivers which are compatible or matching to the 
devices mentioned in the device tree populated from DTB file. Unlike Linux
of_platform using DTB file is not mandatory for Xvisor. The Xvisor architecture
specific code (or board specific code more precisely) can populate the device
tree from plain text, XML file, DTB embedded in Xvisor binary, hard coded 
in source code, etc. In simpler words, device tree in Xvisor is just a data
structure used for managing hypervisor configuration.

(Note: For more information on device tree syntax used by PowerPC Linux Kernel 
refer https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.0.pdf)

Even thought Xvisor device tree just a data structure, following constraints
must be ensured while updating/populating Xvisor device tree:
  * Node Name: It must have characters from one of the following only,
      -> digit: [0-9]
      -> lowercase letter: [a-z]
      -> upercase letter: [A-Z]
      -> underscore: _
      -> dash: -
  * Attribute Name: It must have characters from one of the following only,
      -> digit: [0-9]
      -> lowercase letter: [a-z]
      -> upercase letter: [A-Z]
      -> underscore: _
      -> dash: -
      -> hash: #
  * Attribute String Value: A string attribute value must end with NULL 
    character (i.e. '\0' or character value 0). For a string list, each string 
    must be seperated by excatly one NULL character.
  * Attribute 32-bit unsigned Value: A 32-bit integer value must be represented
    in big-endian format or little-endian format based on the endianess of host
    CPU architecture. 
  * Attribute 64-bit unsigned Value: A 64-bit integer value must be represented
    in big-endian format or little-endian format based on the endianess of host
    CPU architecture. 
(Note: The Xvisor architecture specific code must ensure that the above 
constraints are satisfied while populating device tree)
(Note: For standard attributes used by Xvisor refer source code.)

The figure 2 below shows the device tree representation of the hypervisor setup
shown in figure 1 of chapter 1.

  (Root)
+--------+
|        |
+--------+
    |
    |     (General Configuration)
    |          +--------+
    +----------|  vmm   |
    |          +--------+
    |
    |      (Host Configuration)
    |          +--------+
    +----------|  host  |
    |          +--------+
    |              |
    |              |          (Host CPUs)
    |              |          +--------+
    |              |----------|  cpus  |
    |              |          +--------+
    |              |              |
    |              |              |          +--------+
    |              |              +----------|  cpu0  |
    |              |              |          +--------+
    |              |              |              .
    |              |              |              .
    |              |              |              .
    |              |              |          +--------+
    |              |              +----------|  cpuL  |
    |              |                         +--------+
    |              |
    |              |        (Host Hardware)
    |              |          +--------+
    |              +----------|  ....  |
    |                         +--------+
    |
    |     (Preconfigured Guests)
    |          +--------+
    +----------| guests |
               +--------+
                   |
                   |           (Guest)
                   |          +--------+
                   +----------| guest0 |
                   |          +--------+
                   |              |        (Guest VCPUs)
                   |              |          +--------+
                   |              |----------| vcpus  |
                   |              |          +--------+
                   |              |              |
                   |              |              |            (VCPU)
                   |              |              |          +--------+
                   |              |              +----------| vcpu0  |
                   |              |              |          +--------+
                   |              |              |              .
                   |              |              |              .
                   |              |              |              .
                   |              |              |            (VCPU)
                   |              |              |          +--------+
                   |              |              +----------| vcpuM  |
                   |              |                         +--------+
                   |              |
                   |              |     (Guest Address Space)
                   |              |          +--------+
                   |              +----------| aspace |
                   |                         +--------+
                   |                             |
                   |                             |        (Guest Region)
                   |                             |          ----------
                   |                             +----------| Memory |
                   |                             |          ----------
                   |                             |               
                   |                             |        (Guest Region)
                   |                             |          +--------+
                   |                             +----------|  PIC   |
                   |                             |          +--------+
                   |                             |               
                   |                             |        (Guest Region)
                   |                             |          +--------+
                   |                             +----------|  PIT   |
                   |                             |          +--------+
                   |                             |               
                   |                             |        (Guest Region)
                   |                             |          +--------+
                   |                             +----------|  UART  |
                   |                             |          +--------+
                   |                             |               
                   |                             |        (Guest Region)
                   |                             |          +--------+
                   |                             +----------|  LCD   |
                   |                             |          +--------+
                   |                             |               
                   |                             |        (Guest Region)
                   |                             |          ----------
                   |                             +----------|  ROM   |
                   |                                        +--------+
                   |
                   |           (Guest)
                   |          ----------
                   +----------| guestN |
                              ----------
                                  |        (Guest VCPUs)
                                  |          +--------+
                                  +----------| vcpus  |
                                  |          +--------+
                                  |              |
                                  |              |            (VCPU)
                                  |              |          +--------+
                                  |              +----------| vcpu0  |
                                  |              |          +--------+
                                  |              |              .
                                  |              |              .
                                  |              |              .
                                  |              |            (VCPU)
                                  |              |          +--------+
                                  |              +----------| vcpuK  |
                                  |                         +--------+
                                  |
                                  |     (Guest Address Space)
                                  |          +--------+
                                  +----------| aspace |
                                             +--------+
                                                 |
                                                 |        (Guest Region)
                                                 |          +--------+
                                                 +----------| Memory |
                                                 |          +--------+
                                                 |               
                                                 |        (Guest Region)
                                                 |          +--------+
                                                 +----------|  PIC   |
                                                 |          +--------+
                                                 |               
                                                 |        (Guest Region)
                                                 |          +--------+
                                                 +----------|  PIT   |
                                                 |          +--------+
                                                 |               
                                                 |        (Guest Region)
                                                 |          ----------
                                                 +----------|  UART  |
                                                 |          +--------+
                                                 |               
                                                 |        (Guest Region)
                                                 |          +--------+
                                                 +----------|  LCD   |
                                                 |          +--------+
                                                 |               
                                                 |        (Guest Region)
                                                 |          +--------+
                                                 +----------|  ROM   |
                                                            +--------+

By default, Xvisor will always support configuring device tree using DTS.
It also includes a DTC compiler taken from Linux kernel source code and 
a light-weight DTB parsing library (vmm_libfdt) which can be used by
architecture specific code (or board specific code) to populate device
tree for Xvisor.


	 	Chapter 3: Virtual CPU Scheduler

The VCPUs are a black box for Xvisor (i.e. any thing could be running on 
the VCPU) and exception/interrupt is the only way of entering into hypervisor 
mode when a VCPU is running directly on hardware, Hence VCPU scheduler will
always be invoked in exception/interrupt mode.

To simplfy context switching, we expect that the architecture specific 
exception/interrupt handler will save user register on interrupt stack and 
restore user registers from interrupt stack before interrupt return. Due to 
this simplification the user register context of current VCPU is available on
interrupt stack and saving/restoring user registers become simple memory 
copy operations.

To keep lower memory foot print, scheduler expects per host CPU interrupt 
stack in contrast to traditional UNIX systems which have per process 
interrupt/kernel stack. 

The VCPU scheduler use a round-robin strategy for fair share amoung VCPU, 
although VCPUs can have variable length time slices. Xvisor keeps track 
of time consumed by a VCPU through periodic calls to scheduler tick handler
(i.e. vmm_scheduler_tick()) from architecture specific code.

The possible scenarios in which a VCPU scheduler is invoked are:
  1. Just like any other OS, Xvisor also configures an architecture specific 
     timer interrupt. On every timer interrupt the architecture specific code 
     is expected to call scheduler tick handler (i.e. vmm_scheduler_tick()). 
     The tick handler invokes scheduler whenever time slice of a VCPU expires. 
     We call this situation as VCPU premption.
  2. If a VCPU misbehaves (i.e. does some invalid register/memory access) 
     then we get an exception/interrupt. The architecture specific code can 
     detect such situation and halt/pause the responsible VCPU. In this 
     situation the state of VCPU is changing in interrupt mode, hence we must 
     schedule some other VCPU which is runnable.
  3. The VCPU state can also be changed from some hypervisor thread or 
     managment terminal command. If the VCPU is already running when the state
     change occurs from a hypervisor thread  (can happen for SMP systems) then
     the state change will be reflected on next timer tick else the state 
     change will be reflected immediately.

In order to keep in track of VCPU state changes (by architecture specific code
or some hypervisor thread) and emulate interrupts for a VCPU, we must ensure 
that generic interrupt handler (i.e. vmm_scheduler_irq_process) is called at 
the end of every exception/interrupt.

The expected high-level steps involved in architecture specific 
exception/interrupt handler are:
  1. [Assembly] Switch stack to interrupt stack (if this is done automatically
     by hardware then skip this step)
  2. [Assembly] Save all user registers on interrupt stack
  3. [Assembly] Call the high-level c function with argument as pointer to 
     user registers on interrupt stack
  4. [C Code] If exception was generated by a host device then call host 
     interrupt subsystem to handle host device interrupt 
     (i.e. call vmm_host_irq_exec()) with following arguments:
     - cpu exception/interrupt number
     - pointer to user registers on interrupt stack
  5. [C Code] Do architecture specific processing for exception/interrupt.
  6. [C Code] Call generic interrupt handler (i.e. vmm_scheduler_irq_process) 
     with following arguments: 
     - pointer to user registers on interrupt stack
  7. [C Code] Return from high-level C function.
  8. [Assembly] Restore user registers from interrupt stack
  9. [Assembly] Do interrupt/exception return
(Note: To support nested interrupt/exception handler then modify the above 
steps such that generic interrupt handler (i.e. Step 6) is called only once 
in interrupt nesting).

The expected high-level steps involved in architecture specific VCPU context
switching (i.e. vmm_vcpu_regs_switch()) are as follows:
  1. Save user registers from stack (saved by architecture specific 
     exception/interrupt handler) to user registers of current VCPU 
     (i.e. 'uregs' of current VCPU).
  2. Restore user register of next VCPU (i.e. 'uregs' of current VCPU) to stack
     (will be restored by architecture specific exception/interrupt handler).
  3. Switch context of architecture specific CPU resources such as MMU, 
     Floating point subsystem, etc.
(Note: We don't need to save/restore supervisor registers because it is 
expected that whenever VCPU updates a supervisor register xvisor gets an 
exception. In the architecture specific code of exception handler we emulate 
the instruction updating supervisor register and keep the supervisor register 
of a VCPU in sync.)


	 	Chapter 4: Virtual CPU State Transitions

A VCPU can be in excatly one state at any give instance of time. Below is a 
brief description of all possible states:
  * UNKNOWN: VCPU does not belong to any Guest and is not Orphan VCPU. To 
    enforce lower memory foot print, we pre-allocate memory based on maximum 
    number of VCPUs and put them in this state.
  * RESET: VCPU is initialized and is waiting for someone to kick it to READY
    state. To create a new VCPU, the VCPU scheduler picks up a VCPU in UNKNOWN 
    state from pre-allocated VCPUs and intialize it. After initialization the
    newly created VCPU is put in RESET state.
  * READY: VCPU is ready to run on hardware.
  * RUNNING: VCPU is currently running on hardware.
  * PAUSED: VCPU has been stopped and can resume later. A VCPU is set in this 
    state (usually by architecture specific code) when it detects that the VCPU
    is idle and can be scheduled out.
  * HALTED: VCPU has been stopped and cannot resume. A VCPU is set in this 
    state (usually by architecture specific code) when some errorneous access
    is done by that VCPU.

A VCPU state change can occur from various locations such as architecture 
specific code, some hypervisor thread, scheduler, some emulated device, etc. 
Its not possible to describe all possible VCPU state transtion scenarios but, 
the figure 3 below can give a fair idea about it.

                                           +---------+
                               [Reset]     |         |      [Halt]     
                           +---------------|  HALTED |<-----------------+
                           |               |         |                  |
                           |               +---------+                  |
                           |                    A                       |
                           |                    | [Halt]                |
                           V                    |                       |
+---------+ [Create]  +---------+  [Kick]  +---------+ [Scheduler] +---------+
|         |---------->|         |--------->|         |------------>|         |
| UNKNOWN |           |  RESET  |          |  READY  |             | RUNNING |
|         |<----------|         |<---------|         |<------------|         |
+---------+ [Destroy] +---------+  [Reset] +---------+ [Scheduler] +---------+
                        A     A              A     |                 |     |
                        |     |     [Resume] |     | [Pause]         |     |
                        |     |              |     V                 |     |
                        |     |            +---------+               |     |
                        |     |            |         |               |     |
                        |     +------------|  PAUSED |<--------------+     |
                        |        [Reset]   |         |   [Pause]           |
                        |                  +---------+                     |
                        |                                                  |
                        +--------------------------------------------------+
                                             [Reset]


	 	Chapter 5: Virtual CPU Interrupts

TBD.


	 	Chapter 6: Hypervisor Threads (Hyperthreads)

TBD.


