This page handle the scheduling policy of processes between the different CPUs available on a SEAPATH hypervisor
SEAPATH default CPU isolation
SEAPATH aims to host virtual machine with real time needs. To achieve that, process scheduling must be tuned in order to offer the best performance to the VM.
...
Info |
---|
In the Ansible inventory of the hypervisors, these CPUs are defined by the `isolcpus` variables. |
Tuned
The Debian version of SEAPATH uses tuned (https://github.com/redhat-performance/tuned)
...
On Yocto, tuned is not used. Instead, all these configurations are done at compile time.
Scheduling virtual machines
SEAPATH virtual machines are managed by Qemu.
...
By default, all these threads will be managed by the Linux scheduler and thus run on the non isolated cores. But they can also be pinned to specific CPUs, what forced them to run on it.
Standard virtual machines
For a VM without any performance or real time needs, it is no use to handle any of the Qemu threads a particular way :
All threads will inherit a default priority and scheduling type (TS 19)
All threads will be handled by the Linux scheduler on the non isolated cores
Real time virtual machines
For a VM where performance and determinism is needed, here are our recommendations :
...
The vCPU scheduler type as to be FIFO (FF). A Real Time priority of one is enough.
TODO : put the link to VM configuration wiki page once writtenFor more information read page Virtual machines on SEAPATH.
Finer control with cgroup (optional)
Implementation in SEAPATH
The Linux kernel uses cgroups in order to isolate processes. These cgroups work in a hierarchy where each layer restrains the resources a process can access too. Systemd also uses this mechanism by grouping his processes in slices.
...
TODO : put the link to the inventories README once written
Utility of slices CPU isolation
Using these slices is useful to get a preset of CPU isolation for virtual machines. When placing a VM in either machine-rt or machine-nort slice it will be automatically scheduled on the CPUs of this slice.
It seems particularly useful when deploying many VMs at once.
...
Info |
---|
This new isolation layer protects from really advanced attacks. Because it has drawbacks (see below), the question remains open if you should or not activate this feature. |
Drawbacks
By activating CPU isolation on the machine slice, the management threads of the VM will be scheduled on the allowed CPU list of the slice. This new mechanism implies two things :
...
Info |
---|
The management thread scheduling is handled by the `emulatorpin` field in libivrt XML. |
TODO : put the link to VM configuration wiki page once writtenFor more information, read page Virtual machines on SEAPATH.
Specific configurations
NUMA
NUMA (Non-Uniform Memory Access) refers to machines that have the ability to contain several CPU sockets. Each of these sockets has its own cache memory, which means that accessing memory from one socket to another is much slower than accessing memory on its own socket.
...
If your system contains more than one NUMA cells, you must be careful to pin all the vCPU threads of one VM on the same NUMA cell. Otherwise, the data transfer between two cells will significantly slow down the VM.
Hyper-threading
Most of the modern CPUs support hyper-threading. This option can be enabled in the BIOS and double the number of CPUs available on the system. However, the newly created CPUs are not as fast and independent as classic ones.
...
Info |
---|
On most systems, logical CPUs are grouped in numerical order (0 with 1, 2 with 3 …) but this is not always the case. Always refer to `virsh capabilities` to check the exact architecture. |
Annex: list of tuned modifications
Below a list of all scheduling modifications done by tuned.
...
/sys/module/kvm/parameters/halt_poll_ns = 0
/sys/kernel/ktimer_lockless_check = 1
/sys/kernel/mm/ksm/run = 2Kernel parameters :
isolcpus=managed_irq,domain,{isolated_cores}
intel_pstate=disable
nosoftlockup
tsc=reliable
nohz=on
nohz_full={isolated_cores}
rcu_nocbs={isolated_cores}
irqaffinity={non_isolated_cores}
processor.max_cstate=1
intel_idle.max_cstate=1
cpufreq.default_governor=performance
rcu_nocb_pollkernel thread priorities :
group.ksoftirqd=0:f:2:*:^\[ksoftirqd
group.ktimers=0:f:2:*:^\[ktimers
group.rcuc=0:f:4:*:^\[rcuc
group.rcub=0:f:4:*:^\[rcub
group.ktimersoftd=0:f:3:*:^\[ktimersoftdconfigures irqbalance with isolated_cores list
configures workqueue with isolated_cores list
kernel.hung_task_timeout_secs = 600
kernel.nmi_watchdog = 0
kernel.sched_rt_runtime_us = -1
vm.stat_interval = 10
kernel.timer_migration = 0
Interrupt Requests
irqbalance
is a daemon for SMP (Symmetric shared-memory multiprocessing) systems to help balance the CPU load generated by interrupts across all of a systems CPUs. irqbalance
identifies the highest volume interrupt sources, and isolates each of them to a single unique CPU, so that load is spread as much as possible over an entire processor set, while minimizing cache miss rates for irq handlers.
The irqmask
define the environment variable IRQBALANCE_BANNED_CPUS
. It sould ignore some CPU and never assign interrupts (more details here, the irqbalance manual). This variable is a mask where the first CPU is the least signifiant bit and specified CPUs in cpusystem
should be set to 0. The workqueuemask
is his negation and it's used to configure the kernel.
Example:
...
Kernel configuration
The cluster is installed with a real-time kernel on each node. So, the kernel run with some parameters:
isolcpus
parameter: Isolate some CPUs from the scheduler. It's the value ofcpumachinesrt
.rcu_nocbs
parameter: Remove one or more CPUsfrom candidates for running callbacks. It's the value ofcpumachinesrt
.
VM configuration
The official documentation on the XML format of libvirt is here.
Resources
On the XML configuration of a virtual machine, the resource can be specified to know which slice should be used (more details here). So, the virtual machine will only have acces to the CPU associated with the slice.
Possible values:
/machine/nort
/machine/rt
Example, for a virtual machine with the real-time:
Code Block | ||
---|---|---|
| ||
<resource>
<partition>/machine/rt</partition>
</resource> |
CPU tunning
In the project, this element will be used to limite the virtual machine (more details here).
- The
emulatorpin
element specifies which of host physical CPUs the emulator, a subset of a domain not including vCPU or iothreads will be pinned to. - The
vcpupin
element specifies which of host's physical CPUs the domain vCPU will be pinned to. It's used to reserved one or more CPUs for a critical virtual machine. So, it's important not use this CPU on another VM. - The
vcpusched
element specifies the scheduler type for a particular vCPU. A priority can be setting. In the project, all values greats than 10, it's for the host; equals to 10, it's for the RCU and less than 10, it's to set the priority of the RT vCPU among themselves.