Linux-VServer - User contributions [en]

Memory Limits

2007-01-21T19:44:06Z

Meandtheshell: corrected a typo

A vserver kernel keeps track many resources used by each guest (context). Some of these relate to memory usage by the guest. You can place limits on these resources to prevent guests from using all the host memory and making the host unusable.

Two resources are particularly important in this regard:

* The '''Resident Set Size''' (<code>rss</code>) is the amount of pages currently present in RAM.
* The '''Address Space''' (<code>as</code>) is the total amount of memory (pages) mapped in each process in the context.

Both are measured in '''pages''', which are 4 kB each on Intel machines (i386). So a value of 200000 means a limit of 800,000 kB, a little less than 800 MB.

Each resource has a '''soft''' and a '''hard limit'''.

* If a guest exceeds the <code>rss</code> hard limit, the kernel will invoke the Out-of-Memory (OOM) killer to kill some process in the guest.
* The <code>rss</code> soft limit is shown inside the guest as the maximum available memory. If a guest exceeds the <code>rss</code> soft limit, it will get an extra "bonus" for the OOM killer (proportional to the oversize).
* If a guest exceeds the <code>as</code> hard limit, memory allocation attempts will return an error, but no process is killed.
* The <code>as</code> soft limit is not in utilized until now. In the future it may be used to penalizing guests over that limit or it could be used to force swapping on them and such ...

Bertl explained the difference between '''rss''' and '''as''' with the following example. If two processes share 100 MB of memory, then only 100 MB worth of virtual memory pages can be used at most, so the RSS use of the guest increases by 100 MB. However, two processes are using it, so the AS use increases by 200 MB.

This makes me think that limiting AS is less useful than limiting RSS, since it doesn't directly reflect real, limited resources (RAM and swap) on the host, that deprive other virtual machines of those resources. Bertl says that AS limits can be used to give guests a "gentle" warning that they are running out of memory, but I don't know how much more gentle it is, or how to set it accurately.

For example, 100 processes each mapping a 100 MB file would consume a total of 10 GB of address space (AS), but no more than 100 MB of resources on the host. But if you set the AS limit to 10 GB, then it will not stop one process from allocating 4 GB of RAM, which could kill the host or result in that process being killed by the OOM killer.

You can set the hard limit on a particular context, effective immediately, with this command:

<pre>
/usr/sbin/vlimit -c <xid> --<resource> <value>
</pre>

<xid> is the context ID of the guest, which you can determine with the <code>/usr/sbin/vserver-stat</code> command.

For example, if you want to change the '''rss''' hard limit for the vserver with <xid> 49000, and limit it to 10,000 pages (40 MB), you could use this command:

<pre>
/usr/sbin/vlimit -c 49000 --rss 10000
</pre>

You can change the soft limit instead by adding the -S parameter.

Changes made with the vlimit command are effective only until the vserver is stopped. To make permanent changes, write the value to this file:

<pre>
/etc/vservers/<name>/rlimits/<resource>.hard
</pre>

To set a soft limit, use the same file name without the <code>.hard</code> extension. The <code>rlimits</code> directory is not created by default, so you may need to create it yourself.

Changes to these files take effect only when the vserver is started. To make immediate and permanent changes to a running vserver, you need to run vlimit '''and''' update the rlimits file.

The safest setting, to prevent any guest from interfering with any other, is to set the total of all RSS hard limits (across all running guests) to be less than the total virtual memory (RAM and swap) on the host. It should be sufficiently less to leave room for processes running on the host, and some disk cache, perhaps 100 MB.

However, this is very conservative, since it assumes the worst case where all guests are using the maximum amount of memory at one time. In practice, you can usually get away with contended resources, i.e. allowing guests to use more than this value.

Memory Limits

2007-01-21T19:42:47Z

Meandtheshell: what is "as soft limit" for ...

A vserver kernel keeps track many resources used by each guest (context). Some of these relate to memory usage by the guest. You can place limits on these resources to prevent guests from using all the host memory and making the host unusable.

Two resources are particularly important in this regard:

* The '''Resident Set Size''' (<code>rss</code>) is the amount of pages currently present in RAM.
* The '''Address Space''' (<code>as</code>) is the total amount of memory (pages) mapped in each process in the context.

Both are measured in '''pages''', which are 4 kB each on Intel machines (i386). So a value of 200000 means a limit of 800,000 kB, a little less than 800 MB.

Each resource has a '''soft''' and a '''hard limit'''.

* If a guest exceeds the <code>rss</code> hard limit, the kernel will invoke the Out-of-Memory (OOM) killer to kill some process in the guest.
* The <code>rss</code> soft limit is shown inside the guest as the maximum available memory. If a guest exceeds the <code>rss</code> soft limit, it will get an extra "bonus" for the OOM killer (proportional to the oversize).
* If a guest exceeds the <code>as</code> hard limit, memory allocation attempts will return an error, but no process is killed.
* The <code>as</code> soft limit is not in utilized unil now. In the future it may be used to penalizing guests over that limit or it could be used to force swapping on them and such ...

Bertl explained the difference between '''rss''' and '''as''' with the following example. If two processes share 100 MB of memory, then only 100 MB worth of virtual memory pages can be used at most, so the RSS use of the guest increases by 100 MB. However, two processes are using it, so the AS use increases by 200 MB.

This makes me think that limiting AS is less useful than limiting RSS, since it doesn't directly reflect real, limited resources (RAM and swap) on the host, that deprive other virtual machines of those resources. Bertl says that AS limits can be used to give guests a "gentle" warning that they are running out of memory, but I don't know how much more gentle it is, or how to set it accurately.

For example, 100 processes each mapping a 100 MB file would consume a total of 10 GB of address space (AS), but no more than 100 MB of resources on the host. But if you set the AS limit to 10 GB, then it will not stop one process from allocating 4 GB of RAM, which could kill the host or result in that process being killed by the OOM killer.

You can set the hard limit on a particular context, effective immediately, with this command:

<pre>
/usr/sbin/vlimit -c <xid> --<resource> <value>
</pre>

<xid> is the context ID of the guest, which you can determine with the <code>/usr/sbin/vserver-stat</code> command.

For example, if you want to change the '''rss''' hard limit for the vserver with <xid> 49000, and limit it to 10,000 pages (40 MB), you could use this command:

<pre>
/usr/sbin/vlimit -c 49000 --rss 10000
</pre>

You can change the soft limit instead by adding the -S parameter.

Changes made with the vlimit command are effective only until the vserver is stopped. To make permanent changes, write the value to this file:

<pre>
/etc/vservers/<name>/rlimits/<resource>.hard
</pre>

To set a soft limit, use the same file name without the <code>.hard</code> extension. The <code>rlimits</code> directory is not created by default, so you may need to create it yourself.

Changes to these files take effect only when the vserver is started. To make immediate and permanent changes to a running vserver, you need to run vlimit '''and''' update the rlimits file.

The safest setting, to prevent any guest from interfering with any other, is to set the total of all RSS hard limits (across all running guests) to be less than the total virtual memory (RAM and swap) on the host. It should be sufficiently less to leave room for processes running on the host, and some disk cache, perhaps 100 MB.

However, this is very conservative, since it assumes the worst case where all guests are using the maximum amount of memory at one time. In practice, you can usually get away with contended resources, i.e. allowing guests to use more than this value.

Paper

2007-01-12T20:02:18Z

Meandtheshell: /* File Attributes */

== Abstract ==

A soft partitioning concept based on ''Security Contexts'' which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources.

A VPS provides an almost identical operating environment as a conventional Linux Server. All services, such as ssh, mail, Web and databases, can be started on such a VPS, without (or in special cases with only minimal) modification, just like on any real server.

Each virtual server has its own user account database and root password and is isolated from other virtual servers, except for the fact that they share the same hardware resources.

== Introduction ==

Over the years, computers have become sufficiently powerful to use virtualization to create the illusion of many smaller virtual machines, each running a separate operating system instance.

There are several kinds of Virtual Machines (VMs) which provide similar features, but differ in the degree of abstraction and the methods used for virtualization.

Most of them accomplish what they do by ''emulating'' some real or fictional hardware, which in turn requires ''real'' resources from the Host (the machine running the VMs). This approach, used by most System Emulators (like QEMU, Bochs, ...), allows the emulator to run an arbitrary Guest Operating System, even for a different Architecture (CPU and Hardware). No modifications need to be made to the Guest OS because it isn't aware of the fact that it isn't running on real hardware.

Some System Emulators require small modifications or specialized drivers to be added to Host or Guest to improve performance and minimize the overhead required for the hardware emulation. Although this significantly improves efficiency, there are still large amounts of resources being wasted in caches and mediation between Guest and Host (examples for this approach are UML and Xen).

But suppose you do not want to run many different Operating Systems simultaneously on a single box? Most applications running on a server do not require hardware access or kernel level code, and could easily share a machine with others, if they could be separated and secured...

== The Concept ==

At a basic level, a Linux Server consists of three building blocks: Hardware, Kernel and Applications. The Hardware usually depends on the provider or system maintainer, and, while it has a big influence on the overall performance, it cannot be changed that easily, and will likely differ from one setup to another.

The main purpose of the Kernel is to build an abstraction layer on top of the hardware to allow processes (Applications) to work with and operate on resources (Data) without knowing the details of the underlying hardware. Ideally, those processes would be completely hardware agnostic, by being written in an interpreted language and therefore not requiring any hardware-specific knowledge.

Given that a system has enough resources to drive ten times the number of applications a single Linux server would usually require, why not put ten servers on that box, which will then share the available resources in an efficient manner?

Most server applications (e.g. httpd) will assume that it is the only application providing a particular service, and usually will also assume a certain filesystem layout and environment. This dictates that similar or identical services running on the same physical server, but for example, only differing in their addresses, have to be coordinated. This typically requires a great deal of administrative work which can lead to reduced system stability and security.

The basic concept of the Linux-VServer solution is to separate the user-space environment into distinct units (sometimes called Virtual Private Servers) in such a way that each VPS looks and feels like a real server to the processes contained within.

Although different Linux Distributions use (sometimes heavily) patched kernels to provide special support for unusual hardware or extra functionality, most Linux Distributions are not tied to a special kernel.

Linux-VServer uses this fact to allow several distributions, to be run simultaneously on a single, shared kernel, without direct access to the hardware, and share the resources in a very efficient way.

== Existing Infrastructure ==

Recent Linux Kernels already provide many security features that are utilized by Linux-VServer to do its work. Especially features such as the Linux Capability System, Resource Limits, File Attributes and the Change Root Environment. The following sections will give a short overview about each of these.

=== Linux Capability System ===

In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.

==== POSIX Capabilities ====

A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.

When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).

For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.

The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.

This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.

The implementation in Linux stopped at this point, whereas POSIX Capabilities[U5] requires the addition of capability sets to files too, to replace the SUID flag (at least for executables)

==== Capability Overview ====

The list of POSIX Capabilities used with Linux is long, and the 32 available bits are almost used up. While the detailed list of all capabilities can be found in /usr/include/linux/capability.h on most Linux systems, an overview of important capabilities is given here.

{| class="wikitablenowrap"
! [0] CAP_CHOWN
| change file ownership and group.
|-
! [5] CAP_KILL
| send a signal to a process with a different real or effective user ID
|-
! [6] CAP_SETGID
| permit setgid(2), setgroups(2), and forged gids on socket credentials passing
|-
! [7] CAP_SETUID
| permit set*uid(2), and forged uids on socket credentials passing
|-
! [8] CAP_SETPCAP
| transfer/remove any capability in permitted set to/from any pid
|-
! [9] CAP_LINUX_IMMUTABLE
| allow modification of S_IMMUTABLE and S_APPEND file attributes
|-
! [11] CAP_NET_BROADCAST
| permit broadcasting and listening to multicast
|-
! [12] CAP_NET_ADMIN
| permit interface configuration, IP firewall, masquerading, accounting, socket debugging, routing tables, bind to any address, enter promiscuous mode, multicasting, ...
|-
! [13] CAP_NET_RAW
| permit usage of RAW and PACKET sockets
|-
! [16] CAP_SYS_MODULE
| insert and remove kernel modules
|-
! [18] CAP_SYS_CHROOT
| permit chroot(2)
|-
! [19] CAP_SYS_PTRACE
| permit ptrace() of any process
|-
! [21] CAP_SYS_ADMIN
| this list would be too long, it basically allows to do everything else, not mentioned in another capability.
|-
! [22] CAP_SYS_BOOT
| permit reboot(2)
|-
! [23] CAP_SYS_NICE
| allow raising priority and setting priority on other processes, modify scheduling
|-
! [24] CAP_SYS_RESOURCE
| override resource limits, quota, reserved space on fs, ...
|-
! [27] CAP_MKNOD
| permit the privileged aspects of mknod(2)
|}

=== Resource Limits ===

Resources for each process can be limited by specifying a Resource Limit. Similar to the Linux Capabilities, there are two different limits, a Soft Limit and a Hard Limit.

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from zero up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value, as long as the soft limit stays below the hard limit.

==== Limit-able Resource Overview ====

The list of all defined resource limits can be found in /usr/include/asm/resource.h on most Linux systems, an overview of relevant resource limits is given here.

{| class="wikitablenowrap"
|-
! [0] RLIMIT_CPU
| CPU time in seconds. process is sent a SIGXCPU signal after reaching the soft limit, and SIGKILL on hard limit.
|-
! [4] RLIMIT_CORE
| maximum size of core files generated
|-
! [5] RLIMIT_RSS
| number of pages the process's resident set can consume (the number of virtual pages resident in RAM)
|-
! [6] RLIMIT_NPROC
| The maximum number of processes that can be created for the real user ID of the calling process.
|-
! [7] RLIMIT_NOFILE
| Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
|-
! [8] RLIMIT_MEMLOCK
| The maximum number of virtual memory pages that may be locked into RAM using mlock() and mlockall().
|-
! [9] RLIMIT_AS
| The maximum number of virtual memory pages available to the process (address space limit). \
|}

=== File Attributes ===

Originally, this feature was only available with ext2, but now all major filesystems implement a basic set of File Attributes that permit certain properties to be changed. Here again is a short overview of the possible attributes, and what they mean.

{| class="wikitablenowrap"
!
! Macro Name
! Meaning
|-
! s
! SECRM
| When a file with this attribute set is deleted, its blocks are zeroed and written back to the disk.
|-
! u
! UNRM
| When a file with this attribute set is deleted, its contents are saved.
|-
! c
! COMPR
| Files marked with this attribute are automatically compressed on write and uncompressed on read. (not implemented yet)
|-
! i
! IMMUTABLE
| A file with this attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
|-
! a
! APPEND
| Files with this attribute set can only be opened in append mode for writing.
|-
! d
! NODUMP
| If this flag is set, the file is not candidate for backup with the dump utility.
|-
! S
! SYNC
| Updates to the file contents are done synchronously.
|-
! A
! NOATIME
| Prevents updating the atime record on files when they are accessed or modified.
|-
! t
! NOTAIL
| A file with the t attribute will not have a partial block fragment at the end of the file merged with other files.
|-
! D
! DIRSYNC
| Changes to a directory having this attribute set will be done synchronously.
|}

The first column in the above table denotes command line options one might supply to ''lsattr'' respectively ''chattr''. The below screedump gives a notion about what we are talking:
<pre>
max@pc1:~$ cd /tmp/
max@pc1:/tmp$ touch my_file
max@pc1:/tmp$ lsattr my_file
------------------ my_file
max@pc1:/tmp$ chattr +a my_file
chattr: Operation not permitted while setting flags on my_file
max@pc1:/tmp$ su
Password:
pc1:/tmp# chattr +a my_file && lsattr my_file
-----a------------ my_file
pc1:/tmp# exit
exit
max@pc1:/tmp$
</pre>

As you might have noticed, one needs to gain root permissions in the upper showcase (the underlying file system was ext3). For more information just issue ''man 1 chattr'' in your command line interface.

Information regarding file attributes can be found in the kernel source code. Every file system uses a subset of all known attributes (which are used depends on the file system).

One thing can be said for sure -- the file attributes listed in the kernel source code are defined -- those are not listed are not defined and in turn can not be used for a particular file system (e.g. ext3 (Extended File System version 3)). However, many of those file attributes defined and understood by the kernel have no effect. Most file systems define those flags in a specific (specific for a particular file system) header file found within the kernel source tree. They also define a so called '''User Modifiable Mask''' (those are the flags the user can change with the ''ioctls'').

Those flags have partially different meaning depending on the node type (i.e. dir, inode, fifo, pipe, device) and it is not trivial to say if a filesystem makes use of any user modifiable flag -- things like immutable are easy to verify (from user space) but how to verify e.g. NOTAIL from user space? Usually only source code review will show if it is implemented and used.

For example, if that didn't change, the COMPR is defined, and well understood by ext2/3 but there is no implementation there, i.e. nothing is compressed.

=== The chroot(1) Command ===

chroot allows you to run a command with a different directory acting as the root directory. This means that all filesystem lookups are done with '/' referring to the substitute root directory and not to the original one.

While the Linux chroot implementation isn't very secure, it increases the isolation of processes with regards to the filesystem, and, if used properly, can create a filesystem "jail" for a single process or a restricted user, daemon or service.

== Required Modifications ==

This chapter will describe the essential Kernel modifications to implement something like Linux-VServer.

=== Context Separation ===

The separation mentioned in the Concepts section requires some modifications to the kernel to allow for the notion of Contexts.
The purpose of this "Context" is to hide all processes outside of its scope, and prohibit any unwanted interaction between a process inside the context and a process belonging to another context.

This separation requires the extension of some existing data structures in order for them to become aware of contexts and to differentiate between identical uids used in different virtual servers.

It also requires the definition of a default context that is used when the host system is booted, and to work around the issues resulting from some false assumptions made by some user-space tools (like pstree) that the init process has to exist and to be running under id '1'.

To simplify administration, the Host Context isn't treated any differently than any other context as far as process isolation is concerned. To allow for process overview, a special Spectator context has been defined to peek at all processes at once.

=== Network Separation ===

While the Context Separation is sufficient to isolate groups of processes, a different kind of separation, or rather a limitation, is required to confine processes to a subset of available network addresses.

Several issues have to be considered when doing so; for example, the fact that bindings to special addresses like IPADDR_ANY or the local host address have to be handled in a very special way.

Currently, Linux-VServer doesn't make use of virtual network devices (and maybe never will) to minimize the resulting overhead. Therefore socket binding and packet transmission have been adjusted.

=== The Chroot Barrier ===

One major problem of the chroot() system used in Linux lies within the fact that this information is volatile, and will be changed on the next chroot() Syscall.

One simple method to escape from a chroot-ed environment is as follows: First, create or open a file and retain the file-descriptor, then chroot into a subdirectory at equal or lower level with regards to the file. This causes the root to be moved down in the filesystem. Next, use fchdir() on the file-descriptor to escape from that new root. This will consequently escape from the old root as well, as this was lost in the last chroot() Syscall.

While early Linux-VServer versions tried to fix this by "funny" methods, recent versions use a special marking, known as the Chroot Barrier, on the parent directory of each VPS to prevent unauthorized modification and escape from confinement.

=== Upper Bound for Caps ===

Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask.

The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.

=== Resource Isolation ===

Most resources are somewhat shared among the different contexts. Some require more additional isolation than others, either to avoid security issues or to allow for improved accounting.

Those resources are:

* shared memory, IPC
* user and process IDs
* file xid tagging
* Unix ptys
* sockets

=== Filesystem XID Tagging ===

Although it can be disabled completely, this modification is required for more robust filesystem level security and context isolation. It is also mandatory for Context Disk Limits and Per Context Quota Support on a shared partition.

The concept of adding a context id (xid) to each file to make the context ownership persistent sounds simple, but the actual implementation is non-trivial - mainly because adding this information either requires a change to the on disk representation of the filesystem or the application of some tricks.

One non-intrusive approach to avoid modification of the underlying filesystem is to use the upper bits of existing fields, like those for UID and GID to store the additional XID.

Once context information is available for each inode, it is a logical step to extend the access controls to check against context too.
Currently all inode access restrictions have been extended to check for the context id, with special exceptions for the Host Context and the Spectator Context.

Untagged files belong to the Host Context and are silently treated as if they belong to the current context, which is required for Unification. If such a file is modified from inside a context, it silently migrates to the new one, changing its xid.

The following Tagging Methods are implemented:
{| class="wikitablenowrap"
! UID32/GID32 or EXTERNAL
| This format uses currently unused space within the disk inode to store the context information. As of now, this is only defined for ext2/ext3 but will be also defined for xfs, reiserfs, and jfs as soon as possible. Advantage: Full 32bit uid/gid values.
|-
! UID32/GID16
| This format uses the upper half of the group id to store the context information. This is done transparently, except if the format is changed without prior file conversion. Advantage: works on all 32bit U/GID FSs. Drawback: GID is reduced to 16 bits.
|-
! UID24/GID24
| This format uses the upper quarter of user and group id to store the context information, again transparently. This allows for about 16 million user and group ids, which should suffice for the majority of all applications. Advantage: works on all 32bit U/GID FSs. Drawback: UID and GID are reduced to 24 bits.
|}

== Additional Modifications ==

In addition to the bare minimum, there are a number of modifications that are not mandatory, but have proven extremely useful over time.

=== Context Flags ===

It was very soon discovered that some features require a flag, a kind of switch to turn them on and off separately for each Linux-VServer, so a simple flag-word was added.

This flag-word supports quite a number of flags, a flag-word mask, which allows to tell what flags are available, and a special trigger mechanism, providing one-time flags, set on startup, that can only be cleared once, usually causing a special action or event.

Here is a list of planned and mostly implemented Context Flags, available in the development branch of Linux-VServer:

{| class="wikitablenowrap"
! [0] VXF_INFO_LOCK
| (legacy, obsoleted)
|-
! [1] VXF_INFO_SCHED
| schedule all processes in a context as if they where one. (legacy, obsoleted)
|-
! [2] VXF_INFO_NPROC
| limit the number of processes in a context to the initial NPROC value. (legacy, obsoleted)
|-
! [3] VXF_INFO_PRIVATE
| do not allow to join this context from outside. (legacy)
|-
! [4] VXF_INFO_INIT
| show the init process with pid '1' (legacy)
|-
! [5] VXF_INFO_HIDE
| (legacy, obsoleted)
|-
! [6] VXF_INFO_ULIMIT
| (legacy, obsoleted)
|-
! [7] VXF_INFO_NSPACE
| (legacy, obsoleted)
|-
! [8] VXF_SCHED_HARD
| activate the Hard CPU scheduling
|-
! [9] VXF_SCHED_PRIO
| use the context token bucket for calculating the process priorities
|-
! [10] VXF_SCHED_PAUSE
| put all processes in this context on the hold queue, not scheduling them any longer
|-
! [16] VXF_VIRT_MEM
| virtualize the memory information so that the VM and RSS limits are used for meminfo and friends
|-
! [17] VXF_VIRT_UPTIME
| virtualize the uptime, beginning with the time of context creation
|-
! [18] VXF_VIRT_CPU
|
|-
! [24] VXF_HIDE_MOUNT
| show empty proc/{pid}/mounts
|-
! [25] VXF_HIDE_NETIF
| hide network interfaces and addresses not permitted by the network context
|}

=== Context Capabilities ===

As the Linux Capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.

The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.

In general there are two ways to use those capabilities:

* Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality.\ For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.

* Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different.\ For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define a CCAP_MOUNT to allow mounts for certain contexts.
The difference between the Context Flags and the Context Caps is more an abstract logical separation than a functional one, because they are handled very similar.

Again, a list of the Context Capabilities and their purpose:

{| class="wikitablenowrap"
! [0] VXC_SET_UTSNAME
| allow the context to change the host and domain name with the appropriate kernel Syscall
|-
! [1] VXC_SET_RLIMIT
| allow the context to modify the resource limits (within the vserver limits).
|-
! [8] VXC_RAW_ICMP
| allow raw icmp packets in a secure way (this makes ping work from inside)
|-
! [16] VXC_SECURE_MOUNT
| permit secure mounts, which at the moment means that the nodev mount option is added.
|}

=== Context Accounting ===

Some properties of a context are useful to the admin, either for keeping an overview of the resources, to get a feeling for the capacity of the host, or for billing them in some way to a customer.

There are two different kinds of accountable properties, those having a current value which represents the state of the system (for example the speed of a vehicle), and those which monotonically increase over time (like the mileage).

Most of the state type of properties also qualify for applying some limits, so they are handled specially. this is described in more detail in the following section.

Good candidates for Context Accounting are:

* Amount of CPU Time spent
* Number of Forks done
* Socket Messages by Type
* Network Packets Transmitted and Received

=== Context Limits ===

Most properties related to system resources, might it be the memory consumption, the number of processes or file-handles, or the current network bandwidth, qualify for imposing limits on them.

To provide a general framework for all kinds of limits, Context Limits allow the configuration of three different values for each limit-able resource: the minimum, a soft limit and a hard limit (maximum).

At the time this is written, only the hard limits are supported and not all of them are actually enforced, but here is a list of current and planned Context Limits:

* process limits
* scheduler limits
* memory limits
* per-context disk limits
* per-context user/group quota

Additionally the context limit system keeps track of observed maxima and resource limit hits, to provide some feedback for the administrator.

=== Virtualization ===

One major difference between the Linux-VServer approach and Virtual Machines is that you do not have the virtualization part as a side-effect, so you have to do that by hand where it makes sense.

For example, a Virtual Machine does not need to think about uptime, because naturally the running OS was started somewhere in the past and will not have any problem to tell the time it thinks it began running.

A context can also store the time when it was created, but that will be different from the systems uptime, so in addition, there has to be some function, which adjusts the values passed from kernel to user-space depending on the context the process belongs to.

This is what for Linux-VServer is known as Virtualization (actually it's more faking some values passed to and from the kernel to make the processes think that they are on a different machine).

Currently modified for the purpose of Virtualization are:

* System Uptime
* Host and Domain Name
* Machine Type and Kernel Version
* Context Memory Availability
* Context Disk Space

=== Improved Security ===

Proc-FS Security provides a mechanism to protect dynamic entries in the proc filesystem from being seen in every context.
The system consists of three flags for each Proc-FS entry: Admin, Watch and Hide.

The Hide flag enables or disables the entire feature, so any combination with the Hide flag cleared will mean total visibility.
The Admin and Watch flags determine where the hidden entry remains visible; so for example if Admin and Hidden are set, the Host Context will be the only one able to see this specific entry.

=== Kernel Helper ===

For some purposes, it makes sense to have an user-space tool to act on behalf of the kernel, when a process inside a context requests something usually available on a real server, but naturally not available inside a context.

The best, and currently only example for this is the Reboot Helper, which handles the reboot() system call, invoked from inside a context on behalf of the Kernel. It is executed, in Host side user-space to take appropriate actions - either reboot or just shutdown (halt) the specified context.

While the helper is designed to be flexible and handle different things in a similar way there are no other users of this helper at the moment. It might be replaced by an event interface in near future.

== Features and Bonus Material ==

=== Unification ===

Because one of the central objectives for Linux-VServer is to reduce the overall resource usage wherever possible, a truly great idea was born to share files between different contexts without interfering with the usual administrative tasks or reducing the level of security created by the isolation.

Files common to more than one context, which are not very likely going to change, like libraries or binaries, can be hard linked on a shared filesystem, thus reducing the amount of disk space, inode caches, and even memory mappings for shared libraries.

The only drawback is that without additional measures, a malicious context would be able to deliberately or accidentally destroy or modify such shared files, which in turn would harm the other contexts.

One step is to make the shared files immutable by using the Immutable File Attribute (and removing the Linux Capability required to modify this attribute). However an additional attribute is required to allow removal of such immutable shared files, to allow for updates of libraries or executables from inside a context.

Such hard linked, immutable but unlink-able files belonging to more than one context are called unified and the process of finding common files and preparing them in this way is called Unification.

The reason for doing this is reduced resource consumption, not simplified administration. While a typical Linux Server install will consume about 500MB of disk space, 10 unified servers will only need about 700MB and as a bonus use less memory for caching.

=== Private Namespaces ===

A recent addition to the Linux-VServer branch was the introduction of Private Namespaces. This uses the already existing Virtual Filesystem Layer of the Linux kernel to create a separate view of the filesystem for the processes belonging to a context.

The major advantage over the shared namespace used by default is that any modifications to the namespace layout (like mounts) do not affect other contexts, not even the Host Context.

Obviously the drawback of that approach is that entering such a Private Namespace isn't as trivial as changing the root directory, but with proper kernel support this will completely replace the chroot() in the future.

=== The Linux-VServer Proc-FS ===

A structured, dynamically generated subtree of the well-known Proc-FS - actually two of them - has been created to allow for inspecting the different values of Security and Network Contexts.

<pre>
/proc/virtual
.../info

/proc/virtual/<pid>
.../info
.../status
.../sched
.../cvirt
.../cacct
.../limit
</pre>

=== Token Bucket Extensions ===

While the basic idea of Linux-VServer is a peaceful coexistence of all contexts, sharing the common resources in a respectful way, it is sometimes useful to control the resource distribution for resource hungry processes.

The basic principle of a Token Bucket is not very new. It is given here as an example for the Hard CPU Limit. The same principle also applies to scheduler priorities, network bandwidth limitation and resource control in general.

The Hard CPU Limit uses this mechanism in the following way: consider a bucket of a certain size S which is filled with a specified amount of tokens R every interval T, until the bucket is "full" - excess tokens are spilled. At each timer tick, a running process consumes exactly one token from the bucket, unless the bucket is empty, in which case the process is put on a hold queue until the bucket has been refilled with a minimum M of tokens. The process is then rescheduled.

A major advantage of a Token Bucket is that a certain amount of tokens can be accumulated in times of quiescence, which later can be used to burst when resources are required.

Where a per-process Token Bucket would allow for a CPU resource limitation of a single process, a Context Token Bucket allows to control the CPU usage of all confined processes.

Another approach, which is also implemented, is to use the current fill level of the bucket to adjust the process priority, thus reducing the priority of processes belonging to excessive contexts.

=== Context Disk Limits ===

This Feature requires the use of XID Tagged Files, and allows for independent Disk Limits for different contexts on a shared partition.
The number of inodes and blocks for each filesystem is accounted, if an XID-Hash was added for the Context-Filesystem combo.

Those values, including current usage, maximum and reserved space, will be shown for filesystem queries, creating the illusion that the shared filesystem has a different usage and size, for each context.

=== Per-Context Quota ===

Similar to the Context Disk Limits, Per-Context Quota uses separate quota hashes for different Contexts on a shared filesystem. This is not required to allow for Linux-VServer quota on separate partitions.

=== The VRoot Proxy Device ===

Quota operations (ioctls) require some access to the block device, which for security reasons is not available inside a VPS.

=== Stealth ===

For some applications, for example the preparation of a honey-pot or an especially realistic imitation of a real server for educational purposes, it can make sense to make the context indistinguishable from a real server.

However, since other freely available alternatives like QEMU or UML are much better at this, and require much less effort, this is not a central issue in Linux-VServer development.

== Linux-VServer Security ==

Now that we know what the Linux-VServer framework provides and how some features work, let's have a word on security, because you should not rely on the framework to be secure per definition. Instead, you should exactly know what you are doing.

=== Secure Capabilities ===

Currently the following Linux Capabilities are considered secure for VPS use. If others are added, it will probably open some security hole.

* CAP_CHOWN
* CAP_DAC_OVERRIDE
* CAP_DAC_READ_SEARCH
* CAP_FOWNER
* CAP_FSETID
* CAP_KILL
* CAP_SETGID
* CAP_SETUID
* CAP_NET_BIND_SERVICE
* CAP_SYS_CHROOT
* CAP_SYS_PTRACE
* CAP_SYS_BOOT
* CAP_SYS_TTY_CONFIG
* CAP_LEASE

CAP_NET_RAW for example is not considered secure although it is often used to allow the broken ping command to work, although there are better alternatives like the userspace ping command poink[U7] or the VXC_RAW_ICMP Context Capability.

=== The Chroot Barrier ===

Ensuring that the Barrier flag is set on the parent directory of each VPS is vital if you do not want VPS root to escape from the confinement and walk your Host's root filesystem.

=== Secure Device Nodes ===

The /dev directory of a VPS should not contain more than the following devices and the one directory for the unix pts tree.

* c 1 7 full
* c 1 3 null
* c 5 2 ptmx
* c 1 8 random
* c 5 0 tty
* c 1 9 urandom
* c 1 5 zero
* d pts

Of course other device nodes like console, mem and kmem, even block and character devices can be added, but some expertise is required in order to ensure no security holes are opened.

=== Secure Proc-FS Entries ===

There has been no detailed evaluation of secure and unsecure entries in the proc filesystem, but there have been some incidents where unprotected (not protected via Linux Capabilities) writable proc entries caused mayhem.

For example, /proc/sysrq-trigger is something which should not be accessible inside a VPS without a very good reason.

== Field of Application ==

The primary goal of this project is to create virtual servers sharing the same machine. A virtual server operates like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, and SQL servers.

=== Administrative Separation ===

This allows a clever provider to sell something called Virtual Private Server, which uses less resources than other virtualization techniques, which in turn allows to put more units on a single machine.

The list of providers doing so is relatively long, and so this is rightfully considered the main area of application.

=== Service Separation ===

Separating different or similar services which otherwise would interfere with each other, either because they are poorly designed or because they are simply incapable of peaceful coexistence for whatever reason, can be easily done with Linux-VServer.

But even on the old-fashioned real server machines, putting some extremely exposed or untrusted, because unknown or proprietary, services into some kind of jail can improve maintainability and security a lot.

=== Enhancing Security ===

While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server.

The service in the main environment would be:

* Unreachable from the network.
* Able to log messages from the virtual server in a secure way. The virtual server would be unable to change/erase the logs.\ Even a cracked virtual server would not be able the edit the log.
* Able to run intrusion detection facilities, potentially spying the state of the virtual server without being accessible or noticed.\ For example, tripwire could run there and it would be impossible to circumvent its operation or trick it.

Another option is to put the firewall in a virtual server, and pull in the DMZ, containing each service in a separate VPS. On proper configuration, this setup can reduce the number of required machines drastically, without impacting performance.

=== Easy Maintenance ===

One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for a virtual server installation.

The main server acts as a host and takes care of all the details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server with very few manipulations.

For example, to move the virtual server from one physical computer to another, it sufficient to do the following:

* shutdown the running server
* copy it over to the other machine
* copy the configuration
* start the virtual server on the new machine

No adjustments to user setup, password database or hardware configuration are required, as long as both machines are binary compatible.

=== Fail-over Scenarios ===

Pushing the limit a little further, replication technology could be used to keep an up-to-the-minute copy of the filesystem of a running Virtual Server. This would permit a very fast fail-over if the running server goes offline for whatever reason.

All the known methods to accomplish this, starting with network replication via rsync, or drbd, via network devices, or shared disk arrays, to distributed filesystems, can be utilized to reduce the down-time and improve overall efficiency.

=== For Testing ===

Consider a software tool or package which should be built for several versions of a specific distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0) or even for different distributions.

This is easily solved with Linux-VServer. Given plenty of disk space, the different distributions can be installed and running side by side, simplifying the task of switching from one to another.

Of course this can be accomplished by chroot() alone, but with Linux-VServer it's a much more realistic simulation.

== Performance and Stability ==

''(work in progress)''

=== Impact of Linux-VServer on the Host ===

seems to be 0% ...

=== Overhead inside a Context ===

seems to be less than 2% ...

=== Size of the Kernel Patch ===

Comparison of the different patches ...

{| class="wikitablenowrap"
! patch
! hunks
! +
! -
|-
| patch-2.4.24-vs1.00.diff
| 178
| 1112
| 135
|-
| patch-2.4.24-vs1.20.diff
| 216
| 2035
| 178
|-
| patch-2.4.24-vs1.26.diff
| 225
| 2118
| 180
|-
| patch-2.4.25-vs1.27.diff
| 252
| 2166
| 201
|-
| patch-2.4.26-vs1.28.diff
| 254
| 2183
| 202
|-
| patch-2.6.6-vs1.9.0.diff
| 494
| 5699
| 303
|-
| patch-2.6.6-vs1.9.1.diff
| 497
| 5878
| 307
|-
| patch-2.6.7-vs1.9.2.diff
| 618
| 6836
| 348
|-
| uml-patch-2.4.26-1.diff
| 449
| 36885
| 48
|}

== Non Intel i386 Hardware ==

Linux-VServer was designed to be mostly architecture agnostic, therefore only a small part, the syscall definition itself, is architecture specific. Nevertheless some architectures have private copies of basically architecture independent code for whatever reason, and therefore small modifications are often required.

The following architectures are supported and some of them are even tested:

* alpha
* ia32 / ia64 / xbox
* x86_64 (AMD64)
* mips / mips64
* hppa / hppa64
* ppc / ppc64
* sparc / sparc64
* s390
* uml

Adding a new architecture is relatively simple although extensive testing is required to make sure that every feature is working as expected (and of course, the hardware ;).

== Linux Kernel Intro ==

While almost all of the described features reside in the Linux Kernel, nifty Userspace Tools are required to activate and control the new functionality.

Those Userspace Tools in general communicate with the Linux Kernel via System Calls (or Syscall for short).
This chapter will give a short overview how Linux Kernel and User Space is organized and how Syscalls, a simple method of communication between processes and kernel, work.

=== Kernel and User Space ===

In Linux and similar Operating Systems, User and Kernel Space is separated, and address space is divided into two parts. Kernel space is where the kernel code resides, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.

Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.

Of course, there are some helpers which do the transfer to and from user space.

<pre>
copy_to_user(void *to, const void *from, long n);
copy_from_user(void *to, const void *from, long n);
</pre>

get_user() and put_user() Get or put the given byte, word, or long from or to user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer.

=== Linux Syscalls ===

Most libc calls rely on system calls, which are the simplest kernel functions a user program can call.

These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically link-able kernel code.

Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the _system_call() function), and the actual demultiplexing process occurs.

How does _system_call() work ?

First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses.

This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call.

System call numbers can be found in /usr/include/sys/syscall.h.

They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned.

Otherwise, the system call actually exists and the corresponding entry in the table is the memory address of the system call code.

== Kernel Side Implementation ==

While this chapter is mainly of interest to kernel developers it might be fun to take a small peek behind the curtain to get a glimpse how everything really works.

=== The Syscall Command Switch ===

For a long time Linux-VServer used a few different Syscalls to accomplish different aspects of the work, but very soon the number of required commands grew large, and the Syscalls started to have magic values, selecting the desired behavior.

Not too long ago, a single syscall was reserved for Linux-VServer, and while the opinion on that might differ from developer to developer, it was generally considered a good decision not to have more than one syscall.

The advantage of different Syscalls would be simpler handling of the Syscalls on different architectures; however, this hasn't been a problem so far, as the data passed to and from the kernel has strong typed fields conforming to the C99 types.

Regardless, the availability of one system call required the creation of a multiplexor, which decides, based on some selector, what specific command is to be executed, and then passes on the remaining arguments to that command, which does the actual work.

<pre>
extern asmlinkage long
sys_vserver(uint32_t cmd, uint32_t id, void __user *data)
</pre>

The Linux-VServer syscall is passed three arguments regardless of what actual command is specified: a command (cmd), a number (id), and a user-space data-structure of yet unknown size.

To allow for some structure for debugging purposes and some kind of command versioning, the cmd is split into three parts: the lower 12 bit contain a version number, then 4 bits are reserved, the upper 16 bits are divided into 8 bit command and 6 bit category, again reserving 2 bits for the future.

There are 64 Categories with up to 256 commands in each category, allowing for 4096 revisions of each command, which is far more than will ever be required.

Here is an overview of the categories already defined, and their numerical value:

<pre>
Syscall Matrix V2.6

|VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL|
|STATS |DESTROY|ALTER |CHANGE |LIMIT |TEST | | | |
|INFO |SETUP | |MOVE | | | | | |
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SYSTEM |VERSION|VSETUP |VHOST | | | | |DEVICES| |
HOST | 00| 01| 02| 03| 04| 05| | 06| 07|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
CPU | |VPROC |PROCALT|PROCMIG|PROCTRL| | |SCHED. | |
PROCESS| 08| 09| 10| 11| 12| 13| | 14| 15|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
MEMORY | | | | | | | |SWAP | |
| 16| 17| 18| 19| 20| 21| | 22| 23|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
NETWORK| |VNET |NETALT |NETMIG |NETCTL | | |SERIAL | |
| 24| 25| 26| 27| 28| 29| | 30| 31|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
DISK | | | | | | | |INODE | |
VFS | 32| 33| 34| 35| 36| 37| | 38| 39|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
OTHER | | | | | | | |VINFO | |
| 40| 41| 42| 43| 44| 45| | 46| 47|
=======+=======+=======+=======+=======+=======+=======+ +=======+=======+
SPECIAL| | | | |FLAGS | | | | |
| 48| 49| 50| 51| 52| 53| | 54| 55|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SPECIAL| | | | |RLIMIT |SYSCALL| | |COMPAT |
| 56| 57| 58| 59| 60|TEST 61| | 62| 63|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
</pre>

The definition of those Commands is simplified by some macros, so for example the commands to get and set the Context Flags are defined like this:

<pre>
#define VCMD_get_cflags VC_CMD(FLAGS, 1, 0)
#define VCMD_set_cflags VC_CMD(FLAGS, 2, 0)

extern int vc_get_cflags(uint32_t, void __user *);
extern int vc_set_cflags(uint32_t, void __user *);
</pre>

Note that the command itself is not passed to the actual command implementation, only the id and the pointer to user-space data.

=== Utilized Data Structures ===

There are many different data structures used by different parts of the implementation; while only a few examples are given here, all utilized structures can be found in the source.

==== The Context Data Structure ====

The Context Data Structure consists of a few fields required to manage the contexts, and handle context destruction, as well as future hierarchical contexts.

Logically separated sections of that structure, like for the scheduler or the context limits are defined in separate structures, and incorporated into the main one.

<pre>
struct vx_info {
struct list_head vx_list; /* linked list of contexts */
xid_t vx_id; /* context id */
atomic_t vx_refcount; /* refcount */
struct vx_info *vx_parent; /* parent context */

struct namespace *vx_namespace; /* private namespace */
struct fs_struct *vx_fs; /* private namespace fs */
uint64_t vx_flags; /* context flags */
uint64_t vx_bcaps; /* bounding caps (system) */
uint64_t vx_ccaps; /* context caps (vserver) */

pid_t vx_initpid; /* PID of fake init process */

struct _vx_limit limit; /* vserver limits */
struct _vx_sched sched; /* vserver scheduler */
struct _vx_cvirt cvirt; /* virtual/bias stuff */
struct _vx_cacct cacct; /* context accounting */

char vx_name[65]; /* vserver name */
};
</pre>

Here as example the Scheduler Substructure:
<pre>
struct _vx_sched {
spinlock_t tokens_lock; /* lock for this structure */

int fill_rate; /* Fill rate: add X tokens ... */
int interval; /* Divisor: ... each Y jiffies */
atomic_t tokens; /* current number of tokens */
int tokens_min; /* Limit: minimum for unhold */
int tokens_max; /* Limit: no more than N tokens */
uint32_t jiffies; /* bias: integral multiple of Y */

uint64_t ticks; /* token tick events */
cpumask_t cpus_allowed; /* cpu mask for context */
};
</pre>

The main idea behind this separation is that each substructure belongs to a logically distinct part of the implementation which provides an init and cleanup function for this structure, thus simplifying maintainability and readability of those structures.

==== The Scheduler Command Data ====

As an example for the data structure used to control a specific part of the context from user-space, here is a scheduler command and the utilized data structure to set the properties:

<pre>
#define VCMD_set_sched VC_CMD(SCHED, 1, 2)

struct vcmd_set_sched_v2 {
int32_t fill_rate; /* Fill rate: add X tokens ... */
int32_t interval; /* Divisor: ... each Y jiffies */
int32_t tokens; /* current number of tokens */
int32_t tokens_min; /* Limit: minimum for unhold */
int32_t tokens_max; /* Limit: no more than N tokens */
uint64_t cpu_mask; /* Mask: allowed cpus */
};
</pre>

==== Example Accounting: Sockets ====

Basically all the accounting and limit stuff are defined as macros or inline functions capable of handling the different resources, hiding the underlying implementation wherever possible.

<pre>
#define vx_acc_sock(v,f,p,s) \
__vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__)

static inline void __vx_acc_sock(struct vx_info *vxi,
int family, int pos, int size, char *file, int line)
{
if (vxi) {
int type = vx_sock_type(family);

atomic_inc(&vxi->cacct.sock[type][pos].count);
atomic_add(size, &vxi->cacct.sock[type][pos].total);
}
}

#define vx_sock_recv(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s))
#define vx_sock_send(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s))
#define vx_sock_fail(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))
</pre>

And this general definition is then used where appropriate, for example in the __sock_sendmsg() function like this:

<pre>
len = sock->ops->sendmsg(iocb, sock, msg, size);
if (sock->sk) {
if (len == size)
vx_sock_send(sock->sk, size);
else
vx_sock_fail(sock->sk, size);
}
</pre>

==== Example Limits: Virtual Memory ====

<pre>
#define vx_pages_avail(m, p, r) \
__vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__)

static inline int __vx_pages_avail(struct vx_info *vxi,
int res, int pages, char *file, int line)
{
if (!vxi)
return 1;
if (vxi->limit.rlim[res] == RLIM_INFINITY)
return 1;
if (atomic_read(&vxi->limit.res[res]) +
pages < vxi->limit.rlim[res])
return 1;
return 0;
}

#define vx_vmpages_avail(m,p) vx_pages_avail(m, p, RLIMIT_AS)
#define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK)
#define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)
</pre>

And again the test against those limits at certain places, for example here in copy_process()

<pre>
/* check vserver memory */
if (p->mm && !(clone_flags & CLONE_VM)) {
if (vx_vmpages_avail(p->mm, p->mm->total_vm))
vx_pages_add(p->mm->mm_vx_info,
RLIMIT_AS, p->mm->total_vm);
else
goto bad_fork_free;
}
</pre>

==== Example Virtualization: Uptime ====

<pre>
void vx_vsi_uptime(struct timespec *uptime)
{
struct vx_info *vxi = current->vx_info;

set_normalized_timespec(uptime,
uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec,
uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec);
return;
}

if (vx_flags(VXF_VIRT_UPTIME, 0))
vx_vsi_uptime(&uptime, &idle);
</pre>

== Future Directions ==

''(work in progress)''

=== Hierarchical Contexts ===

=== Security Branch ===

=== Stealth Branch ===

Paper

2007-01-12T19:57:21Z

Meandtheshell: I added content and restructured some parts.

== Abstract ==

A soft partitioning concept based on ''Security Contexts'' which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources.

A VPS provides an almost identical operating environment as a conventional Linux Server. All services, such as ssh, mail, Web and databases, can be started on such a VPS, without (or in special cases with only minimal) modification, just like on any real server.

Each virtual server has its own user account database and root password and is isolated from other virtual servers, except for the fact that they share the same hardware resources.

== Introduction ==

Over the years, computers have become sufficiently powerful to use virtualization to create the illusion of many smaller virtual machines, each running a separate operating system instance.

There are several kinds of Virtual Machines (VMs) which provide similar features, but differ in the degree of abstraction and the methods used for virtualization.

Most of them accomplish what they do by ''emulating'' some real or fictional hardware, which in turn requires ''real'' resources from the Host (the machine running the VMs). This approach, used by most System Emulators (like QEMU, Bochs, ...), allows the emulator to run an arbitrary Guest Operating System, even for a different Architecture (CPU and Hardware). No modifications need to be made to the Guest OS because it isn't aware of the fact that it isn't running on real hardware.

Some System Emulators require small modifications or specialized drivers to be added to Host or Guest to improve performance and minimize the overhead required for the hardware emulation. Although this significantly improves efficiency, there are still large amounts of resources being wasted in caches and mediation between Guest and Host (examples for this approach are UML and Xen).

But suppose you do not want to run many different Operating Systems simultaneously on a single box? Most applications running on a server do not require hardware access or kernel level code, and could easily share a machine with others, if they could be separated and secured...

== The Concept ==

At a basic level, a Linux Server consists of three building blocks: Hardware, Kernel and Applications. The Hardware usually depends on the provider or system maintainer, and, while it has a big influence on the overall performance, it cannot be changed that easily, and will likely differ from one setup to another.

The main purpose of the Kernel is to build an abstraction layer on top of the hardware to allow processes (Applications) to work with and operate on resources (Data) without knowing the details of the underlying hardware. Ideally, those processes would be completely hardware agnostic, by being written in an interpreted language and therefore not requiring any hardware-specific knowledge.

Given that a system has enough resources to drive ten times the number of applications a single Linux server would usually require, why not put ten servers on that box, which will then share the available resources in an efficient manner?

Most server applications (e.g. httpd) will assume that it is the only application providing a particular service, and usually will also assume a certain filesystem layout and environment. This dictates that similar or identical services running on the same physical server, but for example, only differing in their addresses, have to be coordinated. This typically requires a great deal of administrative work which can lead to reduced system stability and security.

The basic concept of the Linux-VServer solution is to separate the user-space environment into distinct units (sometimes called Virtual Private Servers) in such a way that each VPS looks and feels like a real server to the processes contained within.

Although different Linux Distributions use (sometimes heavily) patched kernels to provide special support for unusual hardware or extra functionality, most Linux Distributions are not tied to a special kernel.

Linux-VServer uses this fact to allow several distributions, to be run simultaneously on a single, shared kernel, without direct access to the hardware, and share the resources in a very efficient way.

== Existing Infrastructure ==

Recent Linux Kernels already provide many security features that are utilized by Linux-VServer to do its work. Especially features such as the Linux Capability System, Resource Limits, File Attributes and the Change Root Environment. The following sections will give a short overview about each of these.

=== Linux Capability System ===

In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.

==== POSIX Capabilities ====

A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.

When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).

For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.

The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.

This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.

The implementation in Linux stopped at this point, whereas POSIX Capabilities[U5] requires the addition of capability sets to files too, to replace the SUID flag (at least for executables)

==== Capability Overview ====

The list of POSIX Capabilities used with Linux is long, and the 32 available bits are almost used up. While the detailed list of all capabilities can be found in /usr/include/linux/capability.h on most Linux systems, an overview of important capabilities is given here.

{| class="wikitablenowrap"
! [0] CAP_CHOWN
| change file ownership and group.
|-
! [5] CAP_KILL
| send a signal to a process with a different real or effective user ID
|-
! [6] CAP_SETGID
| permit setgid(2), setgroups(2), and forged gids on socket credentials passing
|-
! [7] CAP_SETUID
| permit set*uid(2), and forged uids on socket credentials passing
|-
! [8] CAP_SETPCAP
| transfer/remove any capability in permitted set to/from any pid
|-
! [9] CAP_LINUX_IMMUTABLE
| allow modification of S_IMMUTABLE and S_APPEND file attributes
|-
! [11] CAP_NET_BROADCAST
| permit broadcasting and listening to multicast
|-
! [12] CAP_NET_ADMIN
| permit interface configuration, IP firewall, masquerading, accounting, socket debugging, routing tables, bind to any address, enter promiscuous mode, multicasting, ...
|-
! [13] CAP_NET_RAW
| permit usage of RAW and PACKET sockets
|-
! [16] CAP_SYS_MODULE
| insert and remove kernel modules
|-
! [18] CAP_SYS_CHROOT
| permit chroot(2)
|-
! [19] CAP_SYS_PTRACE
| permit ptrace() of any process
|-
! [21] CAP_SYS_ADMIN
| this list would be too long, it basically allows to do everything else, not mentioned in another capability.
|-
! [22] CAP_SYS_BOOT
| permit reboot(2)
|-
! [23] CAP_SYS_NICE
| allow raising priority and setting priority on other processes, modify scheduling
|-
! [24] CAP_SYS_RESOURCE
| override resource limits, quota, reserved space on fs, ...
|-
! [27] CAP_MKNOD
| permit the privileged aspects of mknod(2)
|}

=== Resource Limits ===

Resources for each process can be limited by specifying a Resource Limit. Similar to the Linux Capabilities, there are two different limits, a Soft Limit and a Hard Limit.

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from zero up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value, as long as the soft limit stays below the hard limit.

==== Limit-able Resource Overview ====

The list of all defined resource limits can be found in /usr/include/asm/resource.h on most Linux systems, an overview of relevant resource limits is given here.

{| class="wikitablenowrap"
|-
! [0] RLIMIT_CPU
| CPU time in seconds. process is sent a SIGXCPU signal after reaching the soft limit, and SIGKILL on hard limit.
|-
! [4] RLIMIT_CORE
| maximum size of core files generated
|-
! [5] RLIMIT_RSS
| number of pages the process's resident set can consume (the number of virtual pages resident in RAM)
|-
! [6] RLIMIT_NPROC
| The maximum number of processes that can be created for the real user ID of the calling process.
|-
! [7] RLIMIT_NOFILE
| Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
|-
! [8] RLIMIT_MEMLOCK
| The maximum number of virtual memory pages that may be locked into RAM using mlock() and mlockall().
|-
! [9] RLIMIT_AS
| The maximum number of virtual memory pages available to the process (address space limit). \
|}

=== File Attributes ===

Originally, this feature was only available with ext2, but now all major filesystems implement a basic set of File Attributes that permit certain properties to be changed. Here again is a short overview of the possible attributes, and what they mean.

{| class="wikitablenowrap"
! chattr/lsattr option
! Macro Name
! Meaning
|-
! s
! SECRM
| When a file with this attribute set is deleted, its blocks are zeroed and written back to the disk.
|-
! u
! UNRM
| When a file with this attribute set is deleted, its contents are saved.
|-
! c
! COMPR
| Files marked with this attribute are automatically compressed on write and uncompressed on read. (not implemented yet)
|-
! i
! IMMUTABLE
| A file with this attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
|-
! a
! APPEND
| Files with this attribute set can only be opened in append mode for writing.
|-
! d
! NODUMP
| If this flag is set, the file is not candidate for backup with the dump utility.
|-
! S
! SYNC
| Updates to the file contents are done synchronously.
|-
! A
! NOATIME
| Prevents updating the atime record on files when they are accessed or modified.
|-
! t
! NOTAIL
| A file with the t attribute will not have a partial block fragment at the end of the file merged with other files.
|-
! D
! DIRSYNC
| Changes to a directory having this attribute set will be done synchronously.
|}

The first column in the above table denotes command line options one might supply to ''lsattr'' respectively ''chattr''. The below screedump gives a notion about what we are talking:
<pre>
max@pc1:~$ cd /tmp/
max@pc1:/tmp$ touch my_file
max@pc1:/tmp$ lsattr my_file
------------------ my_file
max@pc1:/tmp$ chattr +a my_file
chattr: Operation not permitted while setting flags on my_file
max@pc1:/tmp$ su
Password:
pc1:/tmp# chattr +a my_file && lsattr my_file
-----a------------ my_file
pc1:/tmp# exit
exit
max@pc1:/tmp$
</pre>

As you might have noticed, one needs to gain root permissions in the upper showcase (the underlying file system was ext3). For more information just issue ''man 1 chattr'' in your command line interface.

Information regarding file attributes can be found in the kernel source code. Every file system uses a subset of all known attributes (which are used depends on the file system).

One thing can be said for sure -- the file attributes listed in the kernel source code are defined -- those are not listed are not defined and in turn can not be used for a particular file system (e.g. ext3 (Extended File System version 3)). However, many of those file attributes defined and understood by the kernel have no effect. Most file systems define those flags in a specific (specific for a particular file system) header file found within the kernel source tree. They also define a so called '''User Modifiable Mask''' (those are the flags the user can change with the ''ioctls'').

Those flags have partially different meaning depending on the node type (i.e. dir, inode, fifo, pipe, device) and it is not trivial to say if a filesystem makes use of any user modifiable flag -- things like immutable are easy to verify (from user space) but how to verify e.g. NOTAIL from user space? Usually only source code review will show if it is implemented and used.

For example, if that didn't change, the COMPR is defined, and well understood by ext2/3 but there is no implementation there, i.e. nothing is compressed.

=== The chroot(1) Command ===

chroot allows you to run a command with a different directory acting as the root directory. This means that all filesystem lookups are done with '/' referring to the substitute root directory and not to the original one.

While the Linux chroot implementation isn't very secure, it increases the isolation of processes with regards to the filesystem, and, if used properly, can create a filesystem "jail" for a single process or a restricted user, daemon or service.

== Required Modifications ==

This chapter will describe the essential Kernel modifications to implement something like Linux-VServer.

=== Context Separation ===

The separation mentioned in the Concepts section requires some modifications to the kernel to allow for the notion of Contexts.
The purpose of this "Context" is to hide all processes outside of its scope, and prohibit any unwanted interaction between a process inside the context and a process belonging to another context.

This separation requires the extension of some existing data structures in order for them to become aware of contexts and to differentiate between identical uids used in different virtual servers.

It also requires the definition of a default context that is used when the host system is booted, and to work around the issues resulting from some false assumptions made by some user-space tools (like pstree) that the init process has to exist and to be running under id '1'.

To simplify administration, the Host Context isn't treated any differently than any other context as far as process isolation is concerned. To allow for process overview, a special Spectator context has been defined to peek at all processes at once.

=== Network Separation ===

While the Context Separation is sufficient to isolate groups of processes, a different kind of separation, or rather a limitation, is required to confine processes to a subset of available network addresses.

Several issues have to be considered when doing so; for example, the fact that bindings to special addresses like IPADDR_ANY or the local host address have to be handled in a very special way.

Currently, Linux-VServer doesn't make use of virtual network devices (and maybe never will) to minimize the resulting overhead. Therefore socket binding and packet transmission have been adjusted.

=== The Chroot Barrier ===

One major problem of the chroot() system used in Linux lies within the fact that this information is volatile, and will be changed on the next chroot() Syscall.

One simple method to escape from a chroot-ed environment is as follows: First, create or open a file and retain the file-descriptor, then chroot into a subdirectory at equal or lower level with regards to the file. This causes the root to be moved down in the filesystem. Next, use fchdir() on the file-descriptor to escape from that new root. This will consequently escape from the old root as well, as this was lost in the last chroot() Syscall.

While early Linux-VServer versions tried to fix this by "funny" methods, recent versions use a special marking, known as the Chroot Barrier, on the parent directory of each VPS to prevent unauthorized modification and escape from confinement.

=== Upper Bound for Caps ===

Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask.

The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.

=== Resource Isolation ===

Most resources are somewhat shared among the different contexts. Some require more additional isolation than others, either to avoid security issues or to allow for improved accounting.

Those resources are:

* shared memory, IPC
* user and process IDs
* file xid tagging
* Unix ptys
* sockets

=== Filesystem XID Tagging ===

Although it can be disabled completely, this modification is required for more robust filesystem level security and context isolation. It is also mandatory for Context Disk Limits and Per Context Quota Support on a shared partition.

The concept of adding a context id (xid) to each file to make the context ownership persistent sounds simple, but the actual implementation is non-trivial - mainly because adding this information either requires a change to the on disk representation of the filesystem or the application of some tricks.

One non-intrusive approach to avoid modification of the underlying filesystem is to use the upper bits of existing fields, like those for UID and GID to store the additional XID.

Once context information is available for each inode, it is a logical step to extend the access controls to check against context too.
Currently all inode access restrictions have been extended to check for the context id, with special exceptions for the Host Context and the Spectator Context.

Untagged files belong to the Host Context and are silently treated as if they belong to the current context, which is required for Unification. If such a file is modified from inside a context, it silently migrates to the new one, changing its xid.

The following Tagging Methods are implemented:
{| class="wikitablenowrap"
! UID32/GID32 or EXTERNAL
| This format uses currently unused space within the disk inode to store the context information. As of now, this is only defined for ext2/ext3 but will be also defined for xfs, reiserfs, and jfs as soon as possible. Advantage: Full 32bit uid/gid values.
|-
! UID32/GID16
| This format uses the upper half of the group id to store the context information. This is done transparently, except if the format is changed without prior file conversion. Advantage: works on all 32bit U/GID FSs. Drawback: GID is reduced to 16 bits.
|-
! UID24/GID24
| This format uses the upper quarter of user and group id to store the context information, again transparently. This allows for about 16 million user and group ids, which should suffice for the majority of all applications. Advantage: works on all 32bit U/GID FSs. Drawback: UID and GID are reduced to 24 bits.
|}

== Additional Modifications ==

In addition to the bare minimum, there are a number of modifications that are not mandatory, but have proven extremely useful over time.

=== Context Flags ===

It was very soon discovered that some features require a flag, a kind of switch to turn them on and off separately for each Linux-VServer, so a simple flag-word was added.

This flag-word supports quite a number of flags, a flag-word mask, which allows to tell what flags are available, and a special trigger mechanism, providing one-time flags, set on startup, that can only be cleared once, usually causing a special action or event.

Here is a list of planned and mostly implemented Context Flags, available in the development branch of Linux-VServer:

{| class="wikitablenowrap"
! [0] VXF_INFO_LOCK
| (legacy, obsoleted)
|-
! [1] VXF_INFO_SCHED
| schedule all processes in a context as if they where one. (legacy, obsoleted)
|-
! [2] VXF_INFO_NPROC
| limit the number of processes in a context to the initial NPROC value. (legacy, obsoleted)
|-
! [3] VXF_INFO_PRIVATE
| do not allow to join this context from outside. (legacy)
|-
! [4] VXF_INFO_INIT
| show the init process with pid '1' (legacy)
|-
! [5] VXF_INFO_HIDE
| (legacy, obsoleted)
|-
! [6] VXF_INFO_ULIMIT
| (legacy, obsoleted)
|-
! [7] VXF_INFO_NSPACE
| (legacy, obsoleted)
|-
! [8] VXF_SCHED_HARD
| activate the Hard CPU scheduling
|-
! [9] VXF_SCHED_PRIO
| use the context token bucket for calculating the process priorities
|-
! [10] VXF_SCHED_PAUSE
| put all processes in this context on the hold queue, not scheduling them any longer
|-
! [16] VXF_VIRT_MEM
| virtualize the memory information so that the VM and RSS limits are used for meminfo and friends
|-
! [17] VXF_VIRT_UPTIME
| virtualize the uptime, beginning with the time of context creation
|-
! [18] VXF_VIRT_CPU
|
|-
! [24] VXF_HIDE_MOUNT
| show empty proc/{pid}/mounts
|-
! [25] VXF_HIDE_NETIF
| hide network interfaces and addresses not permitted by the network context
|}

=== Context Capabilities ===

As the Linux Capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.

The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.

In general there are two ways to use those capabilities:

* Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality.\ For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.

* Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different.\ For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define a CCAP_MOUNT to allow mounts for certain contexts.
The difference between the Context Flags and the Context Caps is more an abstract logical separation than a functional one, because they are handled very similar.

Again, a list of the Context Capabilities and their purpose:

{| class="wikitablenowrap"
! [0] VXC_SET_UTSNAME
| allow the context to change the host and domain name with the appropriate kernel Syscall
|-
! [1] VXC_SET_RLIMIT
| allow the context to modify the resource limits (within the vserver limits).
|-
! [8] VXC_RAW_ICMP
| allow raw icmp packets in a secure way (this makes ping work from inside)
|-
! [16] VXC_SECURE_MOUNT
| permit secure mounts, which at the moment means that the nodev mount option is added.
|}

=== Context Accounting ===

Some properties of a context are useful to the admin, either for keeping an overview of the resources, to get a feeling for the capacity of the host, or for billing them in some way to a customer.

There are two different kinds of accountable properties, those having a current value which represents the state of the system (for example the speed of a vehicle), and those which monotonically increase over time (like the mileage).

Most of the state type of properties also qualify for applying some limits, so they are handled specially. this is described in more detail in the following section.

Good candidates for Context Accounting are:

* Amount of CPU Time spent
* Number of Forks done
* Socket Messages by Type
* Network Packets Transmitted and Received

=== Context Limits ===

Most properties related to system resources, might it be the memory consumption, the number of processes or file-handles, or the current network bandwidth, qualify for imposing limits on them.

To provide a general framework for all kinds of limits, Context Limits allow the configuration of three different values for each limit-able resource: the minimum, a soft limit and a hard limit (maximum).

At the time this is written, only the hard limits are supported and not all of them are actually enforced, but here is a list of current and planned Context Limits:

* process limits
* scheduler limits
* memory limits
* per-context disk limits
* per-context user/group quota

Additionally the context limit system keeps track of observed maxima and resource limit hits, to provide some feedback for the administrator.

=== Virtualization ===

One major difference between the Linux-VServer approach and Virtual Machines is that you do not have the virtualization part as a side-effect, so you have to do that by hand where it makes sense.

For example, a Virtual Machine does not need to think about uptime, because naturally the running OS was started somewhere in the past and will not have any problem to tell the time it thinks it began running.

A context can also store the time when it was created, but that will be different from the systems uptime, so in addition, there has to be some function, which adjusts the values passed from kernel to user-space depending on the context the process belongs to.

This is what for Linux-VServer is known as Virtualization (actually it's more faking some values passed to and from the kernel to make the processes think that they are on a different machine).

Currently modified for the purpose of Virtualization are:

* System Uptime
* Host and Domain Name
* Machine Type and Kernel Version
* Context Memory Availability
* Context Disk Space

=== Improved Security ===

Proc-FS Security provides a mechanism to protect dynamic entries in the proc filesystem from being seen in every context.
The system consists of three flags for each Proc-FS entry: Admin, Watch and Hide.

The Hide flag enables or disables the entire feature, so any combination with the Hide flag cleared will mean total visibility.
The Admin and Watch flags determine where the hidden entry remains visible; so for example if Admin and Hidden are set, the Host Context will be the only one able to see this specific entry.

=== Kernel Helper ===

For some purposes, it makes sense to have an user-space tool to act on behalf of the kernel, when a process inside a context requests something usually available on a real server, but naturally not available inside a context.

The best, and currently only example for this is the Reboot Helper, which handles the reboot() system call, invoked from inside a context on behalf of the Kernel. It is executed, in Host side user-space to take appropriate actions - either reboot or just shutdown (halt) the specified context.

While the helper is designed to be flexible and handle different things in a similar way there are no other users of this helper at the moment. It might be replaced by an event interface in near future.

== Features and Bonus Material ==

=== Unification ===

Because one of the central objectives for Linux-VServer is to reduce the overall resource usage wherever possible, a truly great idea was born to share files between different contexts without interfering with the usual administrative tasks or reducing the level of security created by the isolation.

Files common to more than one context, which are not very likely going to change, like libraries or binaries, can be hard linked on a shared filesystem, thus reducing the amount of disk space, inode caches, and even memory mappings for shared libraries.

The only drawback is that without additional measures, a malicious context would be able to deliberately or accidentally destroy or modify such shared files, which in turn would harm the other contexts.

One step is to make the shared files immutable by using the Immutable File Attribute (and removing the Linux Capability required to modify this attribute). However an additional attribute is required to allow removal of such immutable shared files, to allow for updates of libraries or executables from inside a context.

Such hard linked, immutable but unlink-able files belonging to more than one context are called unified and the process of finding common files and preparing them in this way is called Unification.

The reason for doing this is reduced resource consumption, not simplified administration. While a typical Linux Server install will consume about 500MB of disk space, 10 unified servers will only need about 700MB and as a bonus use less memory for caching.

=== Private Namespaces ===

A recent addition to the Linux-VServer branch was the introduction of Private Namespaces. This uses the already existing Virtual Filesystem Layer of the Linux kernel to create a separate view of the filesystem for the processes belonging to a context.

The major advantage over the shared namespace used by default is that any modifications to the namespace layout (like mounts) do not affect other contexts, not even the Host Context.

Obviously the drawback of that approach is that entering such a Private Namespace isn't as trivial as changing the root directory, but with proper kernel support this will completely replace the chroot() in the future.

=== The Linux-VServer Proc-FS ===

A structured, dynamically generated subtree of the well-known Proc-FS - actually two of them - has been created to allow for inspecting the different values of Security and Network Contexts.

<pre>
/proc/virtual
.../info

/proc/virtual/<pid>
.../info
.../status
.../sched
.../cvirt
.../cacct
.../limit
</pre>

=== Token Bucket Extensions ===

While the basic idea of Linux-VServer is a peaceful coexistence of all contexts, sharing the common resources in a respectful way, it is sometimes useful to control the resource distribution for resource hungry processes.

The basic principle of a Token Bucket is not very new. It is given here as an example for the Hard CPU Limit. The same principle also applies to scheduler priorities, network bandwidth limitation and resource control in general.

The Hard CPU Limit uses this mechanism in the following way: consider a bucket of a certain size S which is filled with a specified amount of tokens R every interval T, until the bucket is "full" - excess tokens are spilled. At each timer tick, a running process consumes exactly one token from the bucket, unless the bucket is empty, in which case the process is put on a hold queue until the bucket has been refilled with a minimum M of tokens. The process is then rescheduled.

A major advantage of a Token Bucket is that a certain amount of tokens can be accumulated in times of quiescence, which later can be used to burst when resources are required.

Where a per-process Token Bucket would allow for a CPU resource limitation of a single process, a Context Token Bucket allows to control the CPU usage of all confined processes.

Another approach, which is also implemented, is to use the current fill level of the bucket to adjust the process priority, thus reducing the priority of processes belonging to excessive contexts.

=== Context Disk Limits ===

This Feature requires the use of XID Tagged Files, and allows for independent Disk Limits for different contexts on a shared partition.
The number of inodes and blocks for each filesystem is accounted, if an XID-Hash was added for the Context-Filesystem combo.

Those values, including current usage, maximum and reserved space, will be shown for filesystem queries, creating the illusion that the shared filesystem has a different usage and size, for each context.

=== Per-Context Quota ===

Similar to the Context Disk Limits, Per-Context Quota uses separate quota hashes for different Contexts on a shared filesystem. This is not required to allow for Linux-VServer quota on separate partitions.

=== The VRoot Proxy Device ===

Quota operations (ioctls) require some access to the block device, which for security reasons is not available inside a VPS.

=== Stealth ===

For some applications, for example the preparation of a honey-pot or an especially realistic imitation of a real server for educational purposes, it can make sense to make the context indistinguishable from a real server.

However, since other freely available alternatives like QEMU or UML are much better at this, and require much less effort, this is not a central issue in Linux-VServer development.

== Linux-VServer Security ==

Now that we know what the Linux-VServer framework provides and how some features work, let's have a word on security, because you should not rely on the framework to be secure per definition. Instead, you should exactly know what you are doing.

=== Secure Capabilities ===

Currently the following Linux Capabilities are considered secure for VPS use. If others are added, it will probably open some security hole.

* CAP_CHOWN
* CAP_DAC_OVERRIDE
* CAP_DAC_READ_SEARCH
* CAP_FOWNER
* CAP_FSETID
* CAP_KILL
* CAP_SETGID
* CAP_SETUID
* CAP_NET_BIND_SERVICE
* CAP_SYS_CHROOT
* CAP_SYS_PTRACE
* CAP_SYS_BOOT
* CAP_SYS_TTY_CONFIG
* CAP_LEASE

CAP_NET_RAW for example is not considered secure although it is often used to allow the broken ping command to work, although there are better alternatives like the userspace ping command poink[U7] or the VXC_RAW_ICMP Context Capability.

=== The Chroot Barrier ===

Ensuring that the Barrier flag is set on the parent directory of each VPS is vital if you do not want VPS root to escape from the confinement and walk your Host's root filesystem.

=== Secure Device Nodes ===

The /dev directory of a VPS should not contain more than the following devices and the one directory for the unix pts tree.

* c 1 7 full
* c 1 3 null
* c 5 2 ptmx
* c 1 8 random
* c 5 0 tty
* c 1 9 urandom
* c 1 5 zero
* d pts

Of course other device nodes like console, mem and kmem, even block and character devices can be added, but some expertise is required in order to ensure no security holes are opened.

=== Secure Proc-FS Entries ===

There has been no detailed evaluation of secure and unsecure entries in the proc filesystem, but there have been some incidents where unprotected (not protected via Linux Capabilities) writable proc entries caused mayhem.

For example, /proc/sysrq-trigger is something which should not be accessible inside a VPS without a very good reason.

== Field of Application ==

The primary goal of this project is to create virtual servers sharing the same machine. A virtual server operates like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, and SQL servers.

=== Administrative Separation ===

This allows a clever provider to sell something called Virtual Private Server, which uses less resources than other virtualization techniques, which in turn allows to put more units on a single machine.

The list of providers doing so is relatively long, and so this is rightfully considered the main area of application.

=== Service Separation ===

Separating different or similar services which otherwise would interfere with each other, either because they are poorly designed or because they are simply incapable of peaceful coexistence for whatever reason, can be easily done with Linux-VServer.

But even on the old-fashioned real server machines, putting some extremely exposed or untrusted, because unknown or proprietary, services into some kind of jail can improve maintainability and security a lot.

=== Enhancing Security ===

While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server.

The service in the main environment would be:

* Unreachable from the network.
* Able to log messages from the virtual server in a secure way. The virtual server would be unable to change/erase the logs.\ Even a cracked virtual server would not be able the edit the log.
* Able to run intrusion detection facilities, potentially spying the state of the virtual server without being accessible or noticed.\ For example, tripwire could run there and it would be impossible to circumvent its operation or trick it.

Another option is to put the firewall in a virtual server, and pull in the DMZ, containing each service in a separate VPS. On proper configuration, this setup can reduce the number of required machines drastically, without impacting performance.

=== Easy Maintenance ===

One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for a virtual server installation.

The main server acts as a host and takes care of all the details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server with very few manipulations.

For example, to move the virtual server from one physical computer to another, it sufficient to do the following:

* shutdown the running server
* copy it over to the other machine
* copy the configuration
* start the virtual server on the new machine

No adjustments to user setup, password database or hardware configuration are required, as long as both machines are binary compatible.

=== Fail-over Scenarios ===

Pushing the limit a little further, replication technology could be used to keep an up-to-the-minute copy of the filesystem of a running Virtual Server. This would permit a very fast fail-over if the running server goes offline for whatever reason.

All the known methods to accomplish this, starting with network replication via rsync, or drbd, via network devices, or shared disk arrays, to distributed filesystems, can be utilized to reduce the down-time and improve overall efficiency.

=== For Testing ===

Consider a software tool or package which should be built for several versions of a specific distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0) or even for different distributions.

This is easily solved with Linux-VServer. Given plenty of disk space, the different distributions can be installed and running side by side, simplifying the task of switching from one to another.

Of course this can be accomplished by chroot() alone, but with Linux-VServer it's a much more realistic simulation.

== Performance and Stability ==

''(work in progress)''

=== Impact of Linux-VServer on the Host ===

seems to be 0% ...

=== Overhead inside a Context ===

seems to be less than 2% ...

=== Size of the Kernel Patch ===

Comparison of the different patches ...

{| class="wikitablenowrap"
! patch
! hunks
! +
! -
|-
| patch-2.4.24-vs1.00.diff
| 178
| 1112
| 135
|-
| patch-2.4.24-vs1.20.diff
| 216
| 2035
| 178
|-
| patch-2.4.24-vs1.26.diff
| 225
| 2118
| 180
|-
| patch-2.4.25-vs1.27.diff
| 252
| 2166
| 201
|-
| patch-2.4.26-vs1.28.diff
| 254
| 2183
| 202
|-
| patch-2.6.6-vs1.9.0.diff
| 494
| 5699
| 303
|-
| patch-2.6.6-vs1.9.1.diff
| 497
| 5878
| 307
|-
| patch-2.6.7-vs1.9.2.diff
| 618
| 6836
| 348
|-
| uml-patch-2.4.26-1.diff
| 449
| 36885
| 48
|}

== Non Intel i386 Hardware ==

Linux-VServer was designed to be mostly architecture agnostic, therefore only a small part, the syscall definition itself, is architecture specific. Nevertheless some architectures have private copies of basically architecture independent code for whatever reason, and therefore small modifications are often required.

The following architectures are supported and some of them are even tested:

* alpha
* ia32 / ia64 / xbox
* x86_64 (AMD64)
* mips / mips64
* hppa / hppa64
* ppc / ppc64
* sparc / sparc64
* s390
* uml

Adding a new architecture is relatively simple although extensive testing is required to make sure that every feature is working as expected (and of course, the hardware ;).

== Linux Kernel Intro ==

While almost all of the described features reside in the Linux Kernel, nifty Userspace Tools are required to activate and control the new functionality.

Those Userspace Tools in general communicate with the Linux Kernel via System Calls (or Syscall for short).
This chapter will give a short overview how Linux Kernel and User Space is organized and how Syscalls, a simple method of communication between processes and kernel, work.

=== Kernel and User Space ===

In Linux and similar Operating Systems, User and Kernel Space is separated, and address space is divided into two parts. Kernel space is where the kernel code resides, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.

Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.

Of course, there are some helpers which do the transfer to and from user space.

<pre>
copy_to_user(void *to, const void *from, long n);
copy_from_user(void *to, const void *from, long n);
</pre>

get_user() and put_user() Get or put the given byte, word, or long from or to user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer.

=== Linux Syscalls ===

Most libc calls rely on system calls, which are the simplest kernel functions a user program can call.

These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically link-able kernel code.

Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the _system_call() function), and the actual demultiplexing process occurs.

How does _system_call() work ?

First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses.

This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call.

System call numbers can be found in /usr/include/sys/syscall.h.

They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned.

Otherwise, the system call actually exists and the corresponding entry in the table is the memory address of the system call code.

== Kernel Side Implementation ==

While this chapter is mainly of interest to kernel developers it might be fun to take a small peek behind the curtain to get a glimpse how everything really works.

=== The Syscall Command Switch ===

For a long time Linux-VServer used a few different Syscalls to accomplish different aspects of the work, but very soon the number of required commands grew large, and the Syscalls started to have magic values, selecting the desired behavior.

Not too long ago, a single syscall was reserved for Linux-VServer, and while the opinion on that might differ from developer to developer, it was generally considered a good decision not to have more than one syscall.

The advantage of different Syscalls would be simpler handling of the Syscalls on different architectures; however, this hasn't been a problem so far, as the data passed to and from the kernel has strong typed fields conforming to the C99 types.

Regardless, the availability of one system call required the creation of a multiplexor, which decides, based on some selector, what specific command is to be executed, and then passes on the remaining arguments to that command, which does the actual work.

<pre>
extern asmlinkage long
sys_vserver(uint32_t cmd, uint32_t id, void __user *data)
</pre>

The Linux-VServer syscall is passed three arguments regardless of what actual command is specified: a command (cmd), a number (id), and a user-space data-structure of yet unknown size.

To allow for some structure for debugging purposes and some kind of command versioning, the cmd is split into three parts: the lower 12 bit contain a version number, then 4 bits are reserved, the upper 16 bits are divided into 8 bit command and 6 bit category, again reserving 2 bits for the future.

There are 64 Categories with up to 256 commands in each category, allowing for 4096 revisions of each command, which is far more than will ever be required.

Here is an overview of the categories already defined, and their numerical value:

<pre>
Syscall Matrix V2.6

|VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL|
|STATS |DESTROY|ALTER |CHANGE |LIMIT |TEST | | | |
|INFO |SETUP | |MOVE | | | | | |
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SYSTEM |VERSION|VSETUP |VHOST | | | | |DEVICES| |
HOST | 00| 01| 02| 03| 04| 05| | 06| 07|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
CPU | |VPROC |PROCALT|PROCMIG|PROCTRL| | |SCHED. | |
PROCESS| 08| 09| 10| 11| 12| 13| | 14| 15|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
MEMORY | | | | | | | |SWAP | |
| 16| 17| 18| 19| 20| 21| | 22| 23|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
NETWORK| |VNET |NETALT |NETMIG |NETCTL | | |SERIAL | |
| 24| 25| 26| 27| 28| 29| | 30| 31|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
DISK | | | | | | | |INODE | |
VFS | 32| 33| 34| 35| 36| 37| | 38| 39|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
OTHER | | | | | | | |VINFO | |
| 40| 41| 42| 43| 44| 45| | 46| 47|
=======+=======+=======+=======+=======+=======+=======+ +=======+=======+
SPECIAL| | | | |FLAGS | | | | |
| 48| 49| 50| 51| 52| 53| | 54| 55|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SPECIAL| | | | |RLIMIT |SYSCALL| | |COMPAT |
| 56| 57| 58| 59| 60|TEST 61| | 62| 63|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
</pre>

The definition of those Commands is simplified by some macros, so for example the commands to get and set the Context Flags are defined like this:

<pre>
#define VCMD_get_cflags VC_CMD(FLAGS, 1, 0)
#define VCMD_set_cflags VC_CMD(FLAGS, 2, 0)

extern int vc_get_cflags(uint32_t, void __user *);
extern int vc_set_cflags(uint32_t, void __user *);
</pre>

Note that the command itself is not passed to the actual command implementation, only the id and the pointer to user-space data.

=== Utilized Data Structures ===

There are many different data structures used by different parts of the implementation; while only a few examples are given here, all utilized structures can be found in the source.

==== The Context Data Structure ====

The Context Data Structure consists of a few fields required to manage the contexts, and handle context destruction, as well as future hierarchical contexts.

Logically separated sections of that structure, like for the scheduler or the context limits are defined in separate structures, and incorporated into the main one.

<pre>
struct vx_info {
struct list_head vx_list; /* linked list of contexts */
xid_t vx_id; /* context id */
atomic_t vx_refcount; /* refcount */
struct vx_info *vx_parent; /* parent context */

struct namespace *vx_namespace; /* private namespace */
struct fs_struct *vx_fs; /* private namespace fs */
uint64_t vx_flags; /* context flags */
uint64_t vx_bcaps; /* bounding caps (system) */
uint64_t vx_ccaps; /* context caps (vserver) */

pid_t vx_initpid; /* PID of fake init process */

struct _vx_limit limit; /* vserver limits */
struct _vx_sched sched; /* vserver scheduler */
struct _vx_cvirt cvirt; /* virtual/bias stuff */
struct _vx_cacct cacct; /* context accounting */

char vx_name[65]; /* vserver name */
};
</pre>

Here as example the Scheduler Substructure:
<pre>
struct _vx_sched {
spinlock_t tokens_lock; /* lock for this structure */

int fill_rate; /* Fill rate: add X tokens ... */
int interval; /* Divisor: ... each Y jiffies */
atomic_t tokens; /* current number of tokens */
int tokens_min; /* Limit: minimum for unhold */
int tokens_max; /* Limit: no more than N tokens */
uint32_t jiffies; /* bias: integral multiple of Y */

uint64_t ticks; /* token tick events */
cpumask_t cpus_allowed; /* cpu mask for context */
};
</pre>

The main idea behind this separation is that each substructure belongs to a logically distinct part of the implementation which provides an init and cleanup function for this structure, thus simplifying maintainability and readability of those structures.

==== The Scheduler Command Data ====

As an example for the data structure used to control a specific part of the context from user-space, here is a scheduler command and the utilized data structure to set the properties:

<pre>
#define VCMD_set_sched VC_CMD(SCHED, 1, 2)

struct vcmd_set_sched_v2 {
int32_t fill_rate; /* Fill rate: add X tokens ... */
int32_t interval; /* Divisor: ... each Y jiffies */
int32_t tokens; /* current number of tokens */
int32_t tokens_min; /* Limit: minimum for unhold */
int32_t tokens_max; /* Limit: no more than N tokens */
uint64_t cpu_mask; /* Mask: allowed cpus */
};
</pre>

==== Example Accounting: Sockets ====

Basically all the accounting and limit stuff are defined as macros or inline functions capable of handling the different resources, hiding the underlying implementation wherever possible.

<pre>
#define vx_acc_sock(v,f,p,s) \
__vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__)

static inline void __vx_acc_sock(struct vx_info *vxi,
int family, int pos, int size, char *file, int line)
{
if (vxi) {
int type = vx_sock_type(family);

atomic_inc(&vxi->cacct.sock[type][pos].count);
atomic_add(size, &vxi->cacct.sock[type][pos].total);
}
}

#define vx_sock_recv(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s))
#define vx_sock_send(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s))
#define vx_sock_fail(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))
</pre>

And this general definition is then used where appropriate, for example in the __sock_sendmsg() function like this:

<pre>
len = sock->ops->sendmsg(iocb, sock, msg, size);
if (sock->sk) {
if (len == size)
vx_sock_send(sock->sk, size);
else
vx_sock_fail(sock->sk, size);
}
</pre>

==== Example Limits: Virtual Memory ====

<pre>
#define vx_pages_avail(m, p, r) \
__vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__)

static inline int __vx_pages_avail(struct vx_info *vxi,
int res, int pages, char *file, int line)
{
if (!vxi)
return 1;
if (vxi->limit.rlim[res] == RLIM_INFINITY)
return 1;
if (atomic_read(&vxi->limit.res[res]) +
pages < vxi->limit.rlim[res])
return 1;
return 0;
}

#define vx_vmpages_avail(m,p) vx_pages_avail(m, p, RLIMIT_AS)
#define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK)
#define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)
</pre>

And again the test against those limits at certain places, for example here in copy_process()

<pre>
/* check vserver memory */
if (p->mm && !(clone_flags & CLONE_VM)) {
if (vx_vmpages_avail(p->mm, p->mm->total_vm))
vx_pages_add(p->mm->mm_vx_info,
RLIMIT_AS, p->mm->total_vm);
else
goto bad_fork_free;
}
</pre>

==== Example Virtualization: Uptime ====

<pre>
void vx_vsi_uptime(struct timespec *uptime)
{
struct vx_info *vxi = current->vx_info;

set_normalized_timespec(uptime,
uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec,
uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec);
return;
}

if (vx_flags(VXF_VIRT_UPTIME, 0))
vx_vsi_uptime(&uptime, &idle);
</pre>

== Future Directions ==

''(work in progress)''

=== Hierarchical Contexts ===

=== Security Branch ===

=== Stealth Branch ===

Paper

2006-09-25T13:38:04Z

Meandtheshell: grammar

== Abstract ==

A soft partitioning concept based on ''Security Contexts'' which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources.

A VPS provides an almost identical operating environment as a conventional Linux Server. All services, such as ssh, mail, Web and databases, can be started on such a VPS, without (or in special cases with only minimal) modification, just like on any real server.

Each virtual server has its own user account database and root password and is isolated from other virtual servers, except for the fact that they share the same hardware resources.

== Introduction ==

Over the years, computers have become sufficiently powerful to use virtualization to create the illusion of many smaller virtual machines, each running a separate operating system instance.

There are several kinds of Virtual Machines (VMs) which provide similar features, but differ in the degree of abstraction and the methods used for virtualization.

Most of them accomplish what they do by ''emulating'' some real or fictional hardware, which in turn requires ''real'' resources from the Host (the machine running the VMs). This approach, used by most System Emulators (like QEMU, Bochs, ...), allows the emulator to run an arbitrary Guest Operating System, even for a different Architecture (CPU and Hardware). No modifications need to be made to the Guest OS because it isn't aware of the fact that it isn't running on real hardware.

Some System Emulators require small modifications or specialized drivers to be added to Host or Guest to improve performance and minimize the overhead required for the hardware emulation. Although this significantly improves efficiency, there are still large amounts of resources being wasted in caches and mediation between Guest and Host (examples for this approach are UML and Xen).

But suppose you do not want to run many different Operating Systems simultaneously on a single box? Most applications running on a server do not require hardware access or kernel level code, and could easily share a machine with others, if they could be separated and secured...

== The Concept ==

At a basic level, a Linux Server consists of three building blocks: Hardware, Kernel and Applications. The Hardware usually depends on the provider or system maintainer, and, while it has a big influence on the overall performance, it cannot be changed that easily, and will likely differ from one setup to another.

The main purpose of the Kernel is to build an abstraction layer on top of the hardware to allow processes (Applications) to work with and operate on resources (Data) without knowing the details of the underlying hardware. Ideally, those processes would be completely hardware agnostic, by being written in an interpreted language and therefore not requiring any hardware-specific knowledge.

Given that a system has enough resources to drive ten times the number of applications a single Linux server would usually require, why not put ten servers on that box, which will then share the available resources in an efficient manner?

Most server applications (e.g. httpd) will assume that it is the only application providing a particular service, and usually will also assume a certain filesystem layout and environment. This dictates that similar or identical services running on the same physical server, but for example, only differing in their addresses, have to be coordinated. This typically requires a great deal of administrative work which can lead to reduced system stability and security.

The basic concept of the Linux-VServer solution is to separate the user-space environment into distinct units (sometimes called Virtual Private Servers) in such a way that each VPS looks and feels like a real server to the processes contained within.

Although different Linux Distributions use (sometimes heavily) patched kernels to provide special support for unusual hardware or extra functionality, most Linux Distributions are not tied to a special kernel.

Linux-VServer uses this fact to allow several distributions, to be run simultaneously on a single, shared kernel, without direct access to the hardware, and share the resources in a very efficient way.

== Existing Infrastructure ==

Recent Linux Kernels already provide many security features that are utilized by Linux-VServer to do its work. Especially features such as the Linux Capability System, Resource Limits, File Attributes and the Change Root Environment. The following sections will give a short overview about each of these.

=== Linux Capability System ===

In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.

==== POSIX Capabilities ====

A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.

When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).

For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.

The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.

This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.

The implementation in Linux stopped at this point, whereas POSIX Capabilities[U5] requires the addition of capability sets to files too, to replace the SUID flag (at least for executables)

==== Capability Overview ====

The list of POSIX Capabilities used with Linux is long, and the 32 available bits are almost used up. While the detailed list of all capabilities can be found in /usr/include/linux/capability.h on most Linux systems, an overview of important capabilities is given here.

{| class="wikitablenowrap"
! [0] CAP_CHOWN
| change file ownership and group.
|-
! [5] CAP_KILL
| send a signal to a process with a different real or effective user ID
|-
! [6] CAP_SETGID
| permit setgid(2), setgroups(2), and forged gids on socket credentials passing
|-
! [7] CAP_SETUID
| permit set*uid(2), and forged uids on socket credentials passing
|-
! [8] CAP_SETPCAP
| transfer/remove any capability in permitted set to/from any pid
|-
! [9] CAP_LINUX_IMMUTABLE
| allow modification of S_IMMUTABLE and S_APPEND file attributes
|-
! [11] CAP_NET_BROADCAST
| permit broadcasting and listening to multicast
|-
! [12] CAP_NET_ADMIN
| permit interface configuration, IP firewall, masquerading, accounting, socket debugging, routing tables, bind to any address, enter promiscuous mode, multicasting, ...
|-
! [13] CAP_NET_RAW
| permit usage of RAW and PACKET sockets
|-
! [16] CAP_SYS_MODULE
| insert and remove kernel modules
|-
! [18] CAP_SYS_CHROOT
| permit chroot(2)
|-
! [19] CAP_SYS_PTRACE
| permit ptrace() of any process
|-
! [21] CAP_SYS_ADMIN
| this list would be too long, it basically allows to do everything else, not mentioned in another capability.
|-
! [22] CAP_SYS_BOOT
| permit reboot(2)
|-
! [23] CAP_SYS_NICE
| allow raising priority and setting priority on other processes, modify scheduling
|-
! [24] CAP_SYS_RESOURCE
| override resource limits, quota, reserved space on fs, ...
|-
! [27] CAP_MKNOD
| permit the privileged aspects of mknod(2)
|}

=== Resource Limits ===

Resources for each process can be limited by specifying a Resource Limit. Similar to the Linux Capabilities, there are two different limits, a Soft Limit and a Hard Limit.

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from zero up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value, as long as the soft limit stays below the hard limit.

==== Limit-able Resource Overview ====

The list of all defined resource limits can be found in /usr/include/asm/resource.h on most Linux systems, an overview of relevant resource limits is given here.

{| class="wikitablenowrap"
|-
! [0] RLIMIT_CPU
| CPU time in seconds. process is sent a SIGXCPU signal after reaching the soft limit, and SIGKILL on hard limit.
|-
! [4] RLIMIT_CORE
| maximum size of core files generated
|-
! [5] RLIMIT_RSS
| number of pages the process's resident set can consume (the number of virtual pages resident in RAM)
|-
! [6] RLIMIT_NPROC
| The maximum number of processes that can be created for the real user ID of the calling process.
|-
! [7] RLIMIT_NOFILE
| Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
|-
! [8] RLIMIT_MEMLOCK
| The maximum number of virtual memory pages that may be locked into RAM using mlock() and mlockall().
|-
! [9] RLIMIT_AS
| The maximum number of virtual memory pages available to the process (address space limit). \
|}

=== File Attributes ===

Originally, this feature was only available with ext2, but now all major filesystems implement a basic set of File Attributes that permit certain properties to be changed. Here again is a short overview of the possible attributes, and what they mean.

{| class="wikitablenowrap"
! s SECRM
| When a file with this attribute set is deleted, its blocks are zeroed and written back to the disk.
|-
! u UNRM
| When a file with this attribute set is deleted, its contents are saved.
|-
! c COMPR
| files marked with this attribute are automatically compressed on write and uncompressed on read. (not implemented yet)
|-
! i IMMUTABLE
| A file with this attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
|-
! a APPEND
| files with this attribute set can only be opened in append mode for writing.
|-
! d NODUMP
| if this flag is set, the file is not candidate for backup with the dump utility.
|-
! S SYNC
| updates to the file contents are done synchronously.
|-
! A NOATIME
| prevents updating the atime record on files when they are accessed or modified.
|-
! t NOTAIL
| A file with the t attribute will not have a partial block fragment at the end of the file merged with other files.
|-
! D DIRSYNC
| changes to a directory having this attribute set will be done synchronously.
|}

=== The chroot(1) Command ===

chroot allows you to run a command with a different directory acting as the root directory. This means that all filesystem lookups are done with '/' referring to the substitute root directory and not to the original one.

While the Linux chroot implementation isn't very secure, it increases the isolation of processes with regards to the filesystem, and, if used properly, can create a filesystem "jail" for a single process or a restricted user, daemon or service.

== Required Modifications ==

This chapter will describe the essential Kernel modifications to implement something like Linux-VServer.

=== Context Separation ===

The separation mentioned in the Concepts section requires some modifications to the kernel to allow for the notion of Contexts.
The purpose of this "Context" is to hide all processes outside of its scope, and prohibit any unwanted interaction between a process inside the context and a process belonging to another context.

This separation requires the extension of some existing data structures in order for them to become aware of contexts and to differentiate between identical uids used in different virtual servers.

It also requires the definition of a default context that is used when the host system is booted, and to work around the issues resulting from some false assumptions made by some user-space tools (like pstree) that the init process has to exist and to be running under id '1'.

To simplify administration, the Host Context isn't treated any differently than any other context as far as process isolation is concerned. To allow for process overview, a special Spectator context has been defined to peek at all processes at once.

=== Network Separation ===

While the Context Separation is sufficient to isolate groups of processes, a different kind of separation, or rather a limitation, is required to confine processes to a subset of available network addresses.

Several issues have to be considered when doing so; for example, the fact that bindings to special addresses like IPADDR_ANY or the local host address have to be handled in a very special way.

Currently, Linux-VServer doesn't make use of virtual network devices (and maybe never will) to minimize the resulting overhead. Therefore socket binding and packet transmission have been adjusted.

=== The Chroot Barrier ===

One major problem of the chroot() system used in Linux lies within the fact that this information is volatile, and will be changed on the next chroot() Syscall.

One simple method to escape from a chroot-ed environment is as follows: First, create or open a file and retain the file-descriptor, then chroot into a subdirectory at equal or lower level with regards to the file. This causes the root to be moved down in the filesystem. Next, use fchdir() on the file-descriptor to escape from that new root. This will consequently escape from the old root as well, as this was lost in the last chroot() Syscall.

While early Linux-VServer versions tried to fix this by "funny" methods, recent versions use a special marking, known as the Chroot Barrier, on the parent directory of each VPS to prevent unauthorized modification and escape from confinement.

=== Upper Bound for Caps ===

Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask.

The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.

=== Resource Isolation ===

Most resources are somewhat shared among the different contexts. Some require more additional isolation than others, either to avoid security issues or to allow for improved accounting.

Those resources are:

* shared memory, IPC
* user and process IDs
* file xid tagging
* Unix ptys
* sockets

=== Filesystem XID Tagging ===

Although it can be disabled completely, this modification is required for more robust filesystem level security and context isolation. It is also mandatory for Context Disk Limits and Per Context Quota Support on a shared partition.

The concept of adding a context id (xid) to each file to make the context ownership persistent sounds simple, but the actual implementation is non-trivial - mainly because adding this information either requires a change to the on disk representation of the filesystem or the application of some tricks.

One non-intrusive approach to avoid modification of the underlying filesystem is to use the upper bits of existing fields, like those for UID and GID to store the additional XID.

Once context information is available for each inode, it is a logical step to extend the access controls to check against context too.
Currently all inode access restrictions have been extended to check for the context id, with special exceptions for the Host Context and the Spectator Context.

Untagged files belong to the Host Context and are silently treated as if they belong to the current context, which is required for Unification. If such a file is modified from inside a context, it silently migrates to the new one, changing its xid.

The following Tagging Methods are implemented:
{| class="wikitablenowrap"
! UID32/GID32 or EXTERNAL
| This format uses currently unused space within the disk inode to store the context information. As of now, this is only defined for ext2/ext3 but will be also defined for xfs, reiserfs, and jfs as soon as possible. Advantage: Full 32bit uid/gid values.
|-
! UID32/GID16
| This format uses the upper half of the group id to store the context information. This is done transparently, except if the format is changed without prior file conversion. Advantage: works on all 32bit U/GID FSs. Drawback: GID is reduced to 16 bits.
|-
! UID24/GID24
| This format uses the upper quarter of user and group id to store the context information, again transparently. This allows for about 16 million user and group ids, which should suffice for the majority of all applications. Advantage: works on all 32bit U/GID FSs. Drawback: UID and GID are reduced to 24 bits.
|}

== Additional Modifications ==

In addition to the bare minimum, there are a number of modifications that are not mandatory, but have proven extremely useful over time.

=== Context Flags ===

It was very soon discovered that some features require a flag, a kind of switch to turn them on and off separately for each Linux-VServer, so a simple flag-word was added.

This flag-word supports quite a number of flags, a flag-word mask, which allows to tell what flags are available, and a special trigger mechanism, providing one-time flags, set on startup, that can only be cleared once, usually causing a special action or event.

Here is a list of planned and mostly implemented Context Flags, available in the development branch of Linux-VServer:

{| class="wikitablenowrap"
! [0] VXF_INFO_LOCK
| (legacy, obsoleted)
|-
! [1] VXF_INFO_SCHED
| schedule all processes in a context as if they where one. (legacy, obsoleted)
|-
! [2] VXF_INFO_NPROC
| limit the number of processes in a context to the initial NPROC value. (legacy, obsoleted)
|-
! [3] VXF_INFO_PRIVATE
| do not allow to join this context from outside. (legacy)
|-
! [4] VXF_INFO_INIT
| show the init process with pid '1' (legacy)
|-
! [5] VXF_INFO_HIDE
| (legacy, obsoleted)
|-
! [6] VXF_INFO_ULIMIT
| (legacy, obsoleted)
|-
! [7] VXF_INFO_NSPACE
| (legacy, obsoleted)
|-
! [8] VXF_SCHED_HARD
| activate the Hard CPU scheduling
|-
! [9] VXF_SCHED_PRIO
| use the context token bucket for calculating the process priorities
|-
! [10] VXF_SCHED_PAUSE
| put all processes in this context on the hold queue, not scheduling them any longer
|-
! [16] VXF_VIRT_MEM
| virtualize the memory information so that the VM and RSS limits are used for meminfo and friends
|-
! [17] VXF_VIRT_UPTIME
| virtualize the uptime, beginning with the time of context creation
|-
! [18] VXF_VIRT_CPU
|
|-
! [24] VXF_HIDE_MOUNT
| show empty proc/{pid}/mounts
|-
! [25] VXF_HIDE_NETIF
| hide network interfaces and addresses not permitted by the network context
|}

=== Context Capabilities ===

As the Linux Capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.

The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.

In general there are two ways to use those capabilities:

* Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality.\ For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.

* Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different.\ For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define a CCAP_MOUNT to allow mounts for certain contexts.
The difference between the Context Flags and the Context Caps is more an abstract logical separation than a functional one, because they are handled very similar.

Again, a list of the Context Capabilities and their purpose:

{| class="wikitablenowrap"
! [0] VXC_SET_UTSNAME
| allow the context to change the host and domain name with the appropriate kernel Syscall
|-
! [1] VXC_SET_RLIMIT
| allow the context to modify the resource limits (within the vserver limits).
|-
! [8] VXC_RAW_ICMP
| allow raw icmp packets in a secure way (this makes ping work from inside)
|-
! [16] VXC_SECURE_MOUNT
| permit secure mounts, which at the moment means that the nodev mount option is added.
|}

=== Context Accounting ===

Some properties of a context are useful to the admin, either for keeping an overview of the resources, to get a feeling for the capacity of the host, or for billing them in some way to a customer.

There are two different kinds of accountable properties, those having a current value which represents the state of the system (for example the speed of a vehicle), and those which monotonically increase over time (like the mileage).

Most of the state type of properties also qualify for applying some limits, so they are handled specially. this is described in more detail in the following section.

Good candidates for Context Accounting are:

* Amount of CPU Time spent
* Number of Forks done
* Socket Messages by Type
* Network Packets Transmitted and Received

=== Context Limits ===

Most properties related to system resources, might it be the memory consumption, the number of processes or file-handles, or the current network bandwidth, qualify for imposing limits on them.

To provide a general framework for all kinds of limits, Context Limits allow the configuration of three different values for each limit-able resource: the minimum, a soft limit and a hard limit (maximum).

At the time this is written, only the hard limits are supported and not all of them are actually enforced, but here is a list of current and planned Context Limits:

* process limits
* scheduler limits
* memory limits
* per-context disk limits
* per-context user/group quota

Additionally the context limit system keeps track of observed maxima and resource limit hits, to provide some feedback for the administrator.

=== Virtualization ===

One major difference between the Linux-VServer approach and Virtual Machines is that you do not have the virtualization part as a side-effect, so you have to do that by hand where it makes sense.

For example, a Virtual Machine does not need to think about uptime, because naturally the running OS was started somewhere in the past and will not have any problem to tell the time it thinks it began running.

A context can also store the time when it was created, but that will be different from the systems uptime, so in addition, there has to be some function, which adjusts the values passed from kernel to user-space depending on the context the process belongs to.

This is what for Linux-VServer is known as Virtualization (actually it's more faking some values passed to and from the kernel to make the processes think that they are on a different machine).

Currently modified for the purpose of Virtualization are:

* System Uptime
* Host and Domain Name
* Machine Type and Kernel Version
* Context Memory Availability
* Context Disk Space

=== Improved Security ===

Proc-FS Security provides a mechanism to protect dynamic entries in the proc filesystem from being seen in every context.
The system consists of three flags for each Proc-FS entry: Admin, Watch and Hide.

The Hide flag enables or disables the entire feature, so any combination with the Hide flag cleared will mean total visibility.
The Admin and Watch flags determine where the hidden entry remains visible; so for example if Admin and Hidden are set, the Host Context will be the only one able to see this specific entry.

=== Kernel Helper ===

For some purposes, it makes sense to have an user-space tool to act on behalf of the kernel, when a process inside a context requests something usually available on a real server, but naturally not available inside a context.

The best, and currently only example for this is the Reboot Helper, which handles the reboot() system call, invoked from inside a context on behalf of the Kernel. It is executed, in Host side user-space to take appropriate actions - either reboot or just shutdown (halt) the specified context.

While the helper is designed to be flexible and handle different things in a similar way there are no other users of this helper at the moment. It might be replaced by an event interface in near future.

== Features and Bonus Material ==

=== Unification ===

Because one of the central objectives for Linux-VServer is to reduce the overall resource usage wherever possible, a truly great idea was born to share files between different contexts without interfering with the usual administrative tasks or reducing the level of security created by the isolation.

Files common to more than one context, which are not very likely going to change, like libraries or binaries, can be hard linked on a shared filesystem, thus reducing the amount of disk space, inode caches, and even memory mappings for shared libraries.

The only drawback is that without additional measures, a malicious context would be able to deliberately or accidentally destroy or modify such shared files, which in turn would harm the other contexts.

One step is to make the shared files immutable by using the Immutable File Attribute (and removing the Linux Capability required to modify this attribute). However an additional attribute is required to allow removal of such immutable shared files, to allow for updates of libraries or executables from inside a context.

Such hard linked, immutable but unlink-able files belonging to more than one context are called unified and the process of finding common files and preparing them in this way is called Unification.

The reason for doing this is reduced resource consumption, not simplified administration. While a typical Linux Server install will consume about 500MB of disk space, 10 unified servers will only need about 700MB and as a bonus use less memory for caching.

=== Private Namespaces ===

A recent addition to the Linux-VServer branch was the introduction of Private Namespaces. This uses the already existing Virtual Filesystem Layer of the Linux kernel to create a separate view of the filesystem for the processes belonging to a context.

The major advantage over the shared namespace used by default is that any modifications to the namespace layout (like mounts) do not affect other contexts, not even the Host Context.

Obviously the drawback of that approach is that entering such a Private Namespace isn't as trivial as changing the root directory, but with proper kernel support this will completely replace the chroot() in the future.

=== The Linux-VServer Proc-FS ===

A structured, dynamically generated subtree of the well-known Proc-FS - actually two of them - has been created to allow for inspecting the different values of Security and Network Contexts.

<pre>
/proc/virtual
.../info

/proc/virtual/<pid>
.../info
.../status
.../sched
.../cvirt
.../cacct
.../limit
</pre>

=== Token Bucket Extensions ===

While the basic idea of Linux-VServer is a peaceful coexistence of all contexts, sharing the common resources in a respectful way, it is sometimes useful to control the resource distribution for resource hungry processes.

The basic principle of a Token Bucket is not very new. It is given here as an example for the Hard CPU Limit. The same principle also applies to scheduler priorities, network bandwidth limitation and resource control in general.

The Hard CPU Limit uses this mechanism in the following way: consider a bucket of a certain size S which is filled with a specified amount of tokens R every interval T, until the bucket is "full" - excess tokens are spilled. At each timer tick, a running process consumes exactly one token from the bucket, unless the bucket is empty, in which case the process is put on a hold queue until the bucket has been refilled with a minimum M of tokens. The process is then rescheduled.

A major advantage of a Token Bucket is that a certain amount of tokens can be accumulated in times of quiescence, which later can be used to burst when resources are required.

Where a per-process Token Bucket would allow for a CPU resource limitation of a single process, a Context Token Bucket allows to control the CPU usage of all confined processes.

Another approach, which is also implemented, is to use the current fill level of the bucket to adjust the process priority, thus reducing the priority of processes belonging to excessive contexts.

=== Context Disk Limits ===

This Feature requires the use of XID Tagged Files, and allows for independent Disk Limits for different contexts on a shared partition.
The number of inodes and blocks for each filesystem is accounted, if an XID-Hash was added for the Context-Filesystem combo.

Those values, including current usage, maximum and reserved space, will be shown for filesystem queries, creating the illusion that the shared filesystem has a different usage and size, for each context.

=== Per-Context Quota ===

Similar to the Context Disk Limits, Per-Context Quota uses separate quota hashes for different Contexts on a shared filesystem. This is not required to allow for Linux-VServer quota on separate partitions.

=== The VRoot Proxy Device ===

Quota operations (ioctls) require some access to the block device, which for security reasons is not available inside a VPS.

=== Stealth ===

For some applications, for example the preparation of a honey-pot or an especially realistic imitation of a real server for educational purposes, it can make sense to make the context indistinguishable from a real server.

However, since other freely available alternatives like QEMU or UML are much better at this, and require much less effort, this is not a central issue in Linux-VServer development.

== Linux-VServer Security ==

Now that we know what the Linux-VServer framework provides and how some features work, let's have a word on security, because you should not rely on the framework to be secure per definition. Instead, you should exactly know what you are doing.

=== Secure Capabilities ===

Currently the following Linux Capabilities are considered secure for VPS use. If others are added, it will probably open some security hole.

* CAP_CHOWN
* CAP_DAC_OVERRIDE
* CAP_DAC_READ_SEARCH
* CAP_FOWNER
* CAP_FSETID
* CAP_KILL
* CAP_SETGID
* CAP_SETUID
* CAP_NET_BIND_SERVICE
* CAP_SYS_CHROOT
* CAP_SYS_PTRACE
* CAP_SYS_BOOT
* CAP_SYS_TTY_CONFIG
* CAP_LEASE

CAP_NET_RAW for example is not considered secure although it is often used to allow the broken ping command to work, although there are better alternatives like the userspace ping command poink[U7] or the VXC_RAW_ICMP Context Capability.

=== The Chroot Barrier ===

Ensuring that the Barrier flag is set on the parent directory of each VPS is vital if you do not want VPS root to escape from the confinement and walk your Host's root filesystem.

=== Secure Device Nodes ===

The /dev directory of a VPS should not contain more than the following devices and the one directory for the unix pts tree.

* c 1 7 full
* c 1 3 null
* c 5 2 ptmx
* c 1 8 random
* c 5 0 tty
* c 1 9 urandom
* c 1 5 zero
* d pts

Of course other device nodes like console, mem and kmem, even block and character devices can be added, but some expertise is required in order to ensure no security holes are opened.

=== Secure Proc-FS Entries ===

There has been no detailed evaluation of secure and unsecure entries in the proc filesystem, but there have been some incidents where unprotected (not protected via Linux Capabilities) writable proc entries caused mayhem.

For example, /proc/sysrq-trigger is something which should not be accessible inside a VPS without a very good reason.

== Field of Application ==

The primary goal of this project is to create virtual servers sharing the same machine. A virtual server operates like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, and SQL servers.

=== Administrative Separation ===

This allows a clever provider to sell something called Virtual Private Server, which uses less resources than other virtualization techniques, which in turn allows to put more units on a single machine.

The list of providers doing so is relatively long, and so this is rightfully considered the main area of application.

=== Service Separation ===

Separating different or similar services which otherwise would interfere with each other, either because they are poorly designed or because they are simply incapable of peaceful coexistence for whatever reason, can be easily done with Linux-VServer.

But even on the old-fashioned real server machines, putting some extremely exposed or untrusted, because unknown or proprietary, services into some kind of jail can improve maintainability and security a lot.

=== Enhancing Security ===

While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server.

The service in the main environment would be:

* Unreachable from the network.
* Able to log messages from the virtual server in a secure way. The virtual server would be unable to change/erase the logs.\ Even a cracked virtual server would not be able the edit the log.
* Able to run intrusion detection facilities, potentially spying the state of the virtual server without being accessible or noticed.\ For example, tripwire could run there and it would be impossible to circumvent its operation or trick it.

Another option is to put the firewall in a virtual server, and pull in the DMZ, containing each service in a separate VPS. On proper configuration, this setup can reduce the number of required machines drastically, without impacting performance.

=== Easy Maintenance ===

One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for a virtual server installation.

The main server acts as a host and takes care of all the details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server with very few manipulations.

For example, to move the virtual server from one physical computer to another, it sufficient to do the following:

* shutdown the running server
* copy it over to the other machine
* copy the configuration
* start the virtual server on the new machine

No adjustments to user setup, password database or hardware configuration are required, as long as both machines are binary compatible.

=== Fail-over Scenarios ===

Pushing the limit a little further, replication technology could be used to keep an up-to-the-minute copy of the filesystem of a running Virtual Server. This would permit a very fast fail-over if the running server goes offline for whatever reason.

All the known methods to accomplish this, starting with network replication via rsync, or drbd, via network devices, or shared disk arrays, to distributed filesystems, can be utilized to reduce the down-time and improve overall efficiency.

=== For Testing ===

Consider a software tool or package which should be built for several versions of a specific distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0) or even for different distributions.

This is easily solved with Linux-VServer. Given plenty of disk space, the different distributions can be installed and running side by side, simplifying the task of switching from one to another.

Of course this can be accomplished by chroot() alone, but with Linux-VServer it's a much more realistic simulation.

== Performance and Stability ==

''(work in progress)''

=== Impact of Linux-VServer on the Host ===

seems to be 0% ...

=== Overhead inside a Context ===

seems to be less than 2% ...

=== Size of the Kernel Patch ===

Comparison of the different patches ...

{| class="wikitablenowrap"
! patch
! hunks
! +
! -
|-
| patch-2.4.24-vs1.00.diff
| 178
| 1112
| 135
|-
| patch-2.4.24-vs1.20.diff
| 216
| 2035
| 178
|-
| patch-2.4.24-vs1.26.diff
| 225
| 2118
| 180
|-
| patch-2.4.25-vs1.27.diff
| 252
| 2166
| 201
|-
| patch-2.4.26-vs1.28.diff
| 254
| 2183
| 202
|-
| patch-2.6.6-vs1.9.0.diff
| 494
| 5699
| 303
|-
| patch-2.6.6-vs1.9.1.diff
| 497
| 5878
| 307
|-
| patch-2.6.7-vs1.9.2.diff
| 618
| 6836
| 348
|-
| uml-patch-2.4.26-1.diff
| 449
| 36885
| 48
|}

== Non Intel i386 Hardware ==

Linux-VServer was designed to be mostly architecture agnostic, therefore only a small part, the syscall definition itself, is architecture specific. Nevertheless some architectures have private copies of basically architecture independent code for whatever reason, and therefore small modifications are often required.

The following architectures are supported and some of them are even tested:

* alpha
* ia32 / ia64 / xbox
* x86_64 (AMD64)
* mips / mips64
* hppa / hppa64
* ppc / ppc64
* sparc / sparc64
* s390
* uml

Adding a new architecture is relatively simple although extensive testing is required to make sure that every feature is working as expected (and of course, the hardware ;).

== Linux Kernel Intro ==

While almost all of the described features reside in the Linux Kernel, nifty Userspace Tools are required to activate and control the new functionality.

Those Userspace Tools in general communicate with the Linux Kernel via System Calls (or Syscall for short).
This chapter will give a short overview how Linux Kernel and User Space is organized and how Syscalls, a simple method of communication between processes and kernel, work.

=== Kernel and User Space ===

In Linux and similar Operating Systems, User and Kernel Space is separated, and address space is divided into two parts. Kernel space is where the kernel code resides, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.

Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.

Of course, there are some helpers which do the transfer to and from user space.

<pre>
copy_to_user(void *to, const void *from, long n);
copy_from_user(void *to, const void *from, long n);
</pre>

get_user() and put_user() Get or put the given byte, word, or long from or to user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer.

=== Linux Syscalls ===

Most libc calls rely on system calls, which are the simplest kernel functions a user program can call.

These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically link-able kernel code.

Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the _system_call() function), and the actual demultiplexing process occurs.

How does _system_call() work ?

First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses.

This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call.

System call numbers can be found in /usr/include/sys/syscall.h.

They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned.

Otherwise, the system call actually exists and the corresponding entry in the table is the memory address of the system call code.

== Kernel Side Implementation ==

While this chapter is mainly of interest to kernel developers it might be fun to take a small peek behind the curtain to get a glimpse how everything really works.

=== The Syscall Command Switch ===

For a long time Linux-VServer used a few different Syscalls to accomplish different aspects of the work, but very soon the number of required commands grew large, and the Syscalls started to have magic values, selecting the desired behavior.

Not too long ago, a single syscall was reserved for Linux-VServer, and while the opinion on that might differ from developer to developer, it was generally considered a good decision not to have more than one syscall.

The advantage of different Syscalls would be simpler handling of the Syscalls on different architectures; however, this hasn't been a problem so far, as the data passed to and from the kernel has strong typed fields conforming to the C99 types.

Regardless, the availability of one system call required the creation of a multiplexor, which decides, based on some selector, what specific command is to be executed, and then passes on the remaining arguments to that command, which does the actual work.

<pre>
extern asmlinkage long
sys_vserver(uint32_t cmd, uint32_t id, void __user *data)
</pre>

The Linux-VServer syscall is passed three arguments regardless of what actual command is specified: a command (cmd), a number (id), and a user-space data-structure of yet unknown size.

To allow for some structure for debugging purposes and some kind of command versioning, the cmd is split into three parts: the lower 12 bit contain a version number, then 4 bits are reserved, the upper 16 bits are divided into 8 bit command and 6 bit category, again reserving 2 bits for the future.

There are 64 Categories with up to 256 commands in each category, allowing for 4096 revisions of each command, which is far more than will ever be required.

Here is an overview of the categories already defined, and their numerical value:

<pre>
Syscall Matrix V2.6

|VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL|
|STATS |DESTROY|ALTER |CHANGE |LIMIT |TEST | | | |
|INFO |SETUP | |MOVE | | | | | |
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SYSTEM |VERSION|VSETUP |VHOST | | | | |DEVICES| |
HOST | 00| 01| 02| 03| 04| 05| | 06| 07|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
CPU | |VPROC |PROCALT|PROCMIG|PROCTRL| | |SCHED. | |
PROCESS| 08| 09| 10| 11| 12| 13| | 14| 15|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
MEMORY | | | | | | | |SWAP | |
| 16| 17| 18| 19| 20| 21| | 22| 23|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
NETWORK| |VNET |NETALT |NETMIG |NETCTL | | |SERIAL | |
| 24| 25| 26| 27| 28| 29| | 30| 31|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
DISK | | | | | | | |INODE | |
VFS | 32| 33| 34| 35| 36| 37| | 38| 39|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
OTHER | | | | | | | |VINFO | |
| 40| 41| 42| 43| 44| 45| | 46| 47|
=======+=======+=======+=======+=======+=======+=======+ +=======+=======+
SPECIAL| | | | |FLAGS | | | | |
| 48| 49| 50| 51| 52| 53| | 54| 55|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SPECIAL| | | | |RLIMIT |SYSCALL| | |COMPAT |
| 56| 57| 58| 59| 60|TEST 61| | 62| 63|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
</pre>

The definition of those Commands is simplified by some macros, so for example the commands to get and set the Context Flags are defined like this:

<pre>
#define VCMD_get_cflags VC_CMD(FLAGS, 1, 0)
#define VCMD_set_cflags VC_CMD(FLAGS, 2, 0)

extern int vc_get_cflags(uint32_t, void __user *);
extern int vc_set_cflags(uint32_t, void __user *);
</pre>

Note that the command itself is not passed to the actual command implementation, only the id and the pointer to user-space data.

=== Utilized Data Structures ===

There are many different data structures used by different parts of the implementation; while only a few examples are given here, all utilized structures can be found in the source.

==== The Context Data Structure ====

The Context Data Structure consists of a few fields required to manage the contexts, and handle context destruction, as well as future hierarchical contexts.

Logically separated sections of that structure, like for the scheduler or the context limits are defined in separate structures, and incorporated into the main one.

<pre>
struct vx_info {
struct list_head vx_list; /* linked list of contexts */
xid_t vx_id; /* context id */
atomic_t vx_refcount; /* refcount */
struct vx_info *vx_parent; /* parent context */

struct namespace *vx_namespace; /* private namespace */
struct fs_struct *vx_fs; /* private namespace fs */
uint64_t vx_flags; /* context flags */
uint64_t vx_bcaps; /* bounding caps (system) */
uint64_t vx_ccaps; /* context caps (vserver) */

pid_t vx_initpid; /* PID of fake init process */

struct _vx_limit limit; /* vserver limits */
struct _vx_sched sched; /* vserver scheduler */
struct _vx_cvirt cvirt; /* virtual/bias stuff */
struct _vx_cacct cacct; /* context accounting */

char vx_name[65]; /* vserver name */
};
</pre>

Here as example the Scheduler Substructure:
<pre>
struct _vx_sched {
spinlock_t tokens_lock; /* lock for this structure */

int fill_rate; /* Fill rate: add X tokens ... */
int interval; /* Divisor: ... each Y jiffies */
atomic_t tokens; /* current number of tokens */
int tokens_min; /* Limit: minimum for unhold */
int tokens_max; /* Limit: no more than N tokens */
uint32_t jiffies; /* bias: integral multiple of Y */

uint64_t ticks; /* token tick events */
cpumask_t cpus_allowed; /* cpu mask for context */
};
</pre>

The main idea behind this separation is that each substructure belongs to a logically distinct part of the implementation which provides an init and cleanup function for this structure, thus simplifying maintainability and readability of those structures.

==== The Scheduler Command Data ====

As an example for the data structure used to control a specific part of the context from user-space, here is a scheduler command and the utilized data structure to set the properties:

<pre>
#define VCMD_set_sched VC_CMD(SCHED, 1, 2)

struct vcmd_set_sched_v2 {
int32_t fill_rate; /* Fill rate: add X tokens ... */
int32_t interval; /* Divisor: ... each Y jiffies */
int32_t tokens; /* current number of tokens */
int32_t tokens_min; /* Limit: minimum for unhold */
int32_t tokens_max; /* Limit: no more than N tokens */
uint64_t cpu_mask; /* Mask: allowed cpus */
};
</pre>

==== Example Accounting: Sockets ====

Basically all the accounting and limit stuff are defined as macros or inline functions capable of handling the different resources, hiding the underlying implementation wherever possible.

<pre>
#define vx_acc_sock(v,f,p,s) \
__vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__)

static inline void __vx_acc_sock(struct vx_info *vxi,
int family, int pos, int size, char *file, int line)
{
if (vxi) {
int type = vx_sock_type(family);

atomic_inc(&vxi->cacct.sock[type][pos].count);
atomic_add(size, &vxi->cacct.sock[type][pos].total);
}
}

#define vx_sock_recv(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s))
#define vx_sock_send(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s))
#define vx_sock_fail(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))
</pre>

And this general definition is then used where appropriate, for example in the __sock_sendmsg() function like this:

<pre>
len = sock->ops->sendmsg(iocb, sock, msg, size);
if (sock->sk) {
if (len == size)
vx_sock_send(sock->sk, size);
else
vx_sock_fail(sock->sk, size);
}
</pre>

==== Example Limits: Virtual Memory ====

<pre>
#define vx_pages_avail(m, p, r) \
__vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__)

static inline int __vx_pages_avail(struct vx_info *vxi,
int res, int pages, char *file, int line)
{
if (!vxi)
return 1;
if (vxi->limit.rlim[res] == RLIM_INFINITY)
return 1;
if (atomic_read(&vxi->limit.res[res]) +
pages < vxi->limit.rlim[res])
return 1;
return 0;
}

#define vx_vmpages_avail(m,p) vx_pages_avail(m, p, RLIMIT_AS)
#define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK)
#define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)
</pre>

And again the test against those limits at certain places, for example here in copy_process()

<pre>
/* check vserver memory */
if (p->mm && !(clone_flags & CLONE_VM)) {
if (vx_vmpages_avail(p->mm, p->mm->total_vm))
vx_pages_add(p->mm->mm_vx_info,
RLIMIT_AS, p->mm->total_vm);
else
goto bad_fork_free;
}
</pre>

==== Example Virtualization: Uptime ====

<pre>
void vx_vsi_uptime(struct timespec *uptime)
{
struct vx_info *vxi = current->vx_info;

set_normalized_timespec(uptime,
uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec,
uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec);
return;
}

if (vx_flags(VXF_VIRT_UPTIME, 0))
vx_vsi_uptime(&uptime, &idle);
</pre>

== Future Directions ==

''(work in progress)''

=== Hierarchical Contexts ===

=== Security Branch ===

=== Stealth Branch ===

Paper

2006-09-25T13:35:48Z

Meandtheshell: grammar

== Abstract ==

A soft partitioning concept based on ''Security Contexts'' which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources.

A VPS provides an almost identical operating environment as a conventional Linux Server. All services, such as ssh, mail, Web and databases, can be started on such a VPS, without (or in special cases with only minimal) modification, just like on any real server.

Each virtual server has its own user account database and root password and is isolated from other virtual servers, except for the fact that they share the same hardware resources.

== Introduction ==

Over the years, computers have become sufficiently powerful to use virtualization to create the illusion of many smaller virtual machines, each running a separate operating system instance.

There are several kinds of Virtual Machines (VMs) which provide similar features, but differ in the degree of abstraction and the methods used for virtualization.

Most of them accomplish what they do by ''emulating'' some real or fictional hardware, which in turn requires ''real'' resources from the Host (the machine running the VMs). This approach, used by most System Emulators (like QEMU, Bochs, ...), allows the emulator to run an arbitrary Guest Operating System, even for a different Architecture (CPU and Hardware). No modifications need to be made to the Guest OS because it isn't aware of the fact that it isn't running on real hardware.

Some System Emulators require small modifications or specialized drivers to be added to Host or Guest to improve performance and minimize the overhead required for the hardware emulation. Although this significantly improves efficiency, there are still large amounts of resources being wasted in caches and mediation between Guest and Host (examples for this approach are UML and Xen).

But suppose you do not want to run many different Operating Systems simultaneously on a single box? Most applications running on a server do not require hardware access or kernel level code, and could easily share a machine with others, if they could be separated and secured...

== The Concept ==

At a basic level, a Linux Server consists of three building blocks: Hardware, Kernel and Applications. The Hardware usually depends on the provider or system maintainer, and, while it has a big influence on the overall performance, it cannot be changed that easily, and will likely differ from one setup to another.

The main purpose of the Kernel is to build an abstraction layer on top of the hardware to allow processes (Applications) to work with and operate on resources (Data) without knowing the details of the underlying hardware. Ideally, those processes would be completely hardware agnostic, by being written in an interpreted language and therefore not requiring any hardware-specific knowledge.

Given that a system has enough resources to drive ten times the number of applications a single Linux server would usually require, why not put ten servers on that box, which will then share the available resources in an efficient manner?

Most server applications (e.g. httpd) will assume that it is the only application providing a particular service, and usually will also assume a certain filesystem layout and environment. This dictates that similar or identical services running on the same physical server, but for example, only differing in their addresses, have to be coordinated. This typically requires a great deal of administrative work which can lead to reduced system stability and security.

The basic concept of the Linux-VServer solution is to separate the user-space environment into distinct units (sometimes called Virtual Private Servers) in such a way that each VPS looks and feels like a real server to the processes contained within.

Although different Linux Distributions use (sometimes heavily) patched kernels to provide special support for unusual hardware or extra functionality, most Linux Distributions are not tied to a special kernel.

Linux-VServer uses this fact to allow several distributions, to be run simultaneously on a single, shared kernel, without direct access to the hardware, and share the resources in a very efficient way.

== Existing Infrastructure ==

Recent Linux Kernels already provide many security features that are utilized by Linux-VServer to do its work. Especially features such as the Linux Capability System, Resource Limits, File Attributes and the Change Root Environment. The following sections will give a short overview about each of these.

=== Linux Capability System ===

In computer science, a capability is a token used by a process to prove that it is allowed to perform an operation on an object. The Linux Capability System is based on "POSIX Capabilities", a somewhat different concept, designed to split up the all powerful root privilege into a set of distinct privileges.

==== POSIX Capabilities ====

A process has three sets of bitmaps called the inheritable(I), permitted(P), and effective(E) capabilities. Each capability is implemented as a bit in each of these bitmaps that is either set or unset.

When a process tries to do a privileged operation, the operating system will check the appropriate bit in the effective set of the process (instead of checking whether the effective uid of the process is 0 as is normally done).

For example, when a process tries to set the clock, the Linux kernel will check that the process has the CAP_SYS_TIME bit (which is currently bit 25) set in its effective set.

The permitted set of the process indicates the capabilities the process can use. The process can have capabilities set in the permitted set that are not in the effective set.

This indicates that the process has temporarily disabled this capability. A process is allowed to set a bit in its effective set only if it is available in the permitted set. The distinction between effective and permitted exists so that processes can "bracket" operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.

The implementation in Linux stopped at this point, whereas POSIX Capabilities[U5] requires the addition of capability sets to files too, to replace the SUID flag (at least for executables)

==== Capability Overview ====

The list of POSIX Capabilities used with Linux is long, and the 32 available bits are almost used up. While the detailed list of all capabilities can be found in /usr/include/linux/capability.h on most Linux systems, an overview of important capabilities is given here.

{| class="wikitablenowrap"
! [0] CAP_CHOWN
| change file ownership and group.
|-
! [5] CAP_KILL
| send a signal to a process with a different real or effective user ID
|-
! [6] CAP_SETGID
| permit setgid(2), setgroups(2), and forged gids on socket credentials passing
|-
! [7] CAP_SETUID
| permit set*uid(2), and forged uids on socket credentials passing
|-
! [8] CAP_SETPCAP
| transfer/remove any capability in permitted set to/from any pid
|-
! [9] CAP_LINUX_IMMUTABLE
| allow modification of S_IMMUTABLE and S_APPEND file attributes
|-
! [11] CAP_NET_BROADCAST
| permit broadcasting and listening to multicast
|-
! [12] CAP_NET_ADMIN
| permit interface configuration, IP firewall, masquerading, accounting, socket debugging, routing tables, bind to any address, enter promiscuous mode, multicasting, ...
|-
! [13] CAP_NET_RAW
| permit usage of RAW and PACKET sockets
|-
! [16] CAP_SYS_MODULE
| insert and remove kernel modules
|-
! [18] CAP_SYS_CHROOT
| permit chroot(2)
|-
! [19] CAP_SYS_PTRACE
| permit ptrace() of any process
|-
! [21] CAP_SYS_ADMIN
| this list would be too long, it basically allows to do everything else, not mentioned in another capability.
|-
! [22] CAP_SYS_BOOT
| permit reboot(2)
|-
! [23] CAP_SYS_NICE
| allow raising priority and setting priority on other processes, modify scheduling
|-
! [24] CAP_SYS_RESOURCE
| override resource limits, quota, reserved space on fs, ...
|-
! [27] CAP_MKNOD
| permit the privileged aspects of mknod(2)
|}

=== Resource Limits ===

Resources for each process can be limited by specifying a Resource Limit. Similar to the Linux Capabilities, there are two different limits, a Soft Limit and a Hard Limit.

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from zero up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value, as long as the soft limit stays below the hard limit.

==== Limit-able Resource Overview ====

The list of all defined resource limits can be found in /usr/include/asm/resource.h on most Linux systems, an overview of relevant resource limits is given here.

{| class="wikitablenowrap"
|-
! [0] RLIMIT_CPU
| CPU time in seconds. process is sent a SIGXCPU signal after reaching the soft limit, and SIGKILL on hard limit.
|-
! [4] RLIMIT_CORE
| maximum size of core files generated
|-
! [5] RLIMIT_RSS
| number of pages the process's resident set can consume (the number of virtual pages resident in RAM)
|-
! [6] RLIMIT_NPROC
| The maximum number of processes that can be created for the real user ID of the calling process.
|-
! [7] RLIMIT_NOFILE
| Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
|-
! [8] RLIMIT_MEMLOCK
| The maximum number of virtual memory pages that may be locked into RAM using mlock() and mlockall().
|-
! [9] RLIMIT_AS
| The maximum number of virtual memory pages available to the process (address space limit). \
|}

=== File Attributes ===

Originally, this feature was only available with ext2, but now all major filesystems implement a basic set of File Attributes that permit certain properties to be changed. Here again is a short overview of the possible attributes, and what they mean.

{| class="wikitablenowrap"
! s SECRM
| When a file with this attribute set is deleted, its blocks are zeroed and written back to the disk.
|-
! u UNRM
| When a file with this attribute set is deleted, its contents are saved.
|-
! c COMPR
| files marked with this attribute are automatically compressed on write and uncompressed on read. (not implemented yet)
|-
! i IMMUTABLE
| A file with this attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file.
|-
! a APPEND
| files with this attribute set can only be opened in append mode for writing.
|-
! d NODUMP
| if this flag is set, the file is not candidate for backup with the dump utility.
|-
! S SYNC
| updates to the file contents are done synchronously.
|-
! A NOATIME
| prevents updating the atime record on files when they are accessed or modified.
|-
! t NOTAIL
| A file with the t attribute will not have a partial block fragment at the end of the file merged with other files.
|-
! D DIRSYNC
| changes to a directory having this attribute set will be done synchronously.
|}

=== The chroot(1) Command ===

chroot allows you to run a command with a different directory acting as the root directory. This means that all filesystem lookups are done with '/' referring to the substitute root directory and not to the original one.

While the Linux chroot implementation isn't very secure, it increases the isolation of processes with regards to the filesystem, and, if used properly, can create a filesystem "jail" for a single process or a restricted user, daemon or service.

== Required Modifications ==

This chapter will describe the essential Kernel modifications to implement something like Linux-VServer.

=== Context Separation ===

The separation mentioned in the Concepts section requires some modifications to the kernel to allow for the notion of Contexts.
The purpose of this "Context" is to hide all processes outside of its scope, and prohibit any unwanted interaction between a process inside the context and a process belonging to another context.

This separation requires the extension of some existing data structures in order for them to become aware of contexts and to differentiate between identical uids used in different virtual servers.

It also requires the definition of a default context that is used when the host system is booted, and to work around the issues resulting from some false assumptions made by some user-space tools (like pstree) that the init process has to exist and to be running under id '1'.

To simplify administration, the Host Context isn't treated any differently than any other context as far as process isolation is concerned. To allow for process overview, a special Spectator context has been defined to peek at all processes at once.

=== Network Separation ===

While the Context Separation is sufficient to isolate groups of processes, a different kind of separation, or rather a limitation, is required to confine processes to a subset of available network addresses.

Several issues have to be considered when doing so; for example, the fact that bindings to special addresses like IPADDR_ANY or the local host address have to be handled in a very special way.

Currently, Linux-VServer doesn't make use of virtual network devices (and maybe never will) to minimize the resulting overhead. Therefore socket binding and packet transmission have been adjusted.

=== The Chroot Barrier ===

One major problem of the chroot() system used in Linux lies within the fact that this information is volatile, and will be changed on the next chroot() Syscall.

One simple method to escape from a chroot-ed environment is as follows: First, create or open a file and retain the file-descriptor, then chroot into a subdirectory at equal or lower level with regards to the file. This causes the root to be moved down in the filesystem. Next, use fchdir() on the file-descriptor to escape from that new root. This will consequently escape from the old root as well, as this was lost in the last chroot() Syscall.

While early Linux-VServer versions tried to fix this by "funny" methods, recent versions use a special marking, known as the Chroot Barrier, on the parent directory of each VPS to prevent unauthorized modification and escape from confinement.

=== Upper Bound for Caps ===

Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask.

The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.

=== Resource Isolation ===

Most resources are somewhat shared among the different contexts. Some require more additional isolation than others, either to avoid security issues or to allow for improved accounting.

Those resources are:

* shared memory, IPC
* user and process IDs
* file xid tagging
* Unix ptys
* sockets

=== Filesystem XID Tagging ===

Although it can be disabled completely, this modification is required for more robust filesystem level security and context isolation. It also is mandatory for Context Disk Limits and Per Context Quota Support on a shared partition.

The concept of adding a context id (xid) to each file to make the context ownership persistent sounds simple, but the actual implementation is non-trivial - mainly because adding this information either requires a change to the on disk representation of the filesystem or the application of some tricks.

One non-intrusive approach to avoid modification of the underlying filesystem is to use the upper bits of existing fields, like those for UID and GID to store the additional XID.

Once context information is available for each inode, it is a logical step to extend the access controls to check against context too.
Currently all inode access restrictions have been extended to check for the context id, with special exceptions for the Host Context and the Spectator Context.

Untagged files belong to the Host Context and are silently treated as if they belong to the current context, which is required for Unification. If such a file is modified from inside a context, it silently migrates to the new one, changing its xid.

The following Tagging Methods are implemented:
{| class="wikitablenowrap"
! UID32/GID32 or EXTERNAL
| This format uses currently unused space within the disk inode to store the context information. As of now, this is only defined for ext2/ext3 but will be also defined for xfs, reiserfs, and jfs as soon as possible. Advantage: Full 32bit uid/gid values.
|-
! UID32/GID16
| This format uses the upper half of the group id to store the context information. This is done transparently, except if the format is changed without prior file conversion. Advantage: works on all 32bit U/GID FSs. Drawback: GID is reduced to 16 bits.
|-
! UID24/GID24
| This format uses the upper quarter of user and group id to store the context information, again transparently. This allows for about 16 million user and group ids, which should suffice for the majority of all applications. Advantage: works on all 32bit U/GID FSs. Drawback: UID and GID are reduced to 24 bits.
|}

== Additional Modifications ==

In addition to the bare minimum, there are a number of modifications that are not mandatory, but have proven extremely useful over time.

=== Context Flags ===

It was very soon discovered that some features require a flag, a kind of switch to turn them on and off separately for each Linux-VServer, so a simple flag-word was added.

This flag-word supports quite a number of flags, a flag-word mask, which allows to tell what flags are available, and a special trigger mechanism, providing one-time flags, set on startup, that can only be cleared once, usually causing a special action or event.

Here is a list of planned and mostly implemented Context Flags, available in the development branch of Linux-VServer:

{| class="wikitablenowrap"
! [0] VXF_INFO_LOCK
| (legacy, obsoleted)
|-
! [1] VXF_INFO_SCHED
| schedule all processes in a context as if they where one. (legacy, obsoleted)
|-
! [2] VXF_INFO_NPROC
| limit the number of processes in a context to the initial NPROC value. (legacy, obsoleted)
|-
! [3] VXF_INFO_PRIVATE
| do not allow to join this context from outside. (legacy)
|-
! [4] VXF_INFO_INIT
| show the init process with pid '1' (legacy)
|-
! [5] VXF_INFO_HIDE
| (legacy, obsoleted)
|-
! [6] VXF_INFO_ULIMIT
| (legacy, obsoleted)
|-
! [7] VXF_INFO_NSPACE
| (legacy, obsoleted)
|-
! [8] VXF_SCHED_HARD
| activate the Hard CPU scheduling
|-
! [9] VXF_SCHED_PRIO
| use the context token bucket for calculating the process priorities
|-
! [10] VXF_SCHED_PAUSE
| put all processes in this context on the hold queue, not scheduling them any longer
|-
! [16] VXF_VIRT_MEM
| virtualize the memory information so that the VM and RSS limits are used for meminfo and friends
|-
! [17] VXF_VIRT_UPTIME
| virtualize the uptime, beginning with the time of context creation
|-
! [18] VXF_VIRT_CPU
|
|-
! [24] VXF_HIDE_MOUNT
| show empty proc/{pid}/mounts
|-
! [25] VXF_HIDE_NETIF
| hide network interfaces and addresses not permitted by the network context
|}

=== Context Capabilities ===

As the Linux Capabilities have almost reached the maximum number that is possible without heavy modifications to the kernel, it was a natural step to add a context-specific capability system.

The Linux-VServer context capability set acts as a mechanism to fine tune existing Linux capabilities. It is not visible to the processes within a context, as they would not know how to modify or verify it.

In general there are two ways to use those capabilities:

* Require one or a number of context capabilities to be set in addition to a given Linux capability, each one controlling a distinct part of the functionality.\ For example the CAP_NET_ADMIN could be split into RAW and PACKET sockets, so you could take away each of them separately by not providing the required context capability.

* Consider the context capability sufficient for a specified functionality, even if the Linux Capability says something different.\ For example mount() requires CAP_SYS_ADMIN which adds a dozen other things we do not want, so we define a CCAP_MOUNT to allow mounts for certain contexts.
The difference between the Context Flags and the Context Caps is more an abstract logical separation than a functional one, because they are handled very similar.

Again, a list of the Context Capabilities and their purpose:

{| class="wikitablenowrap"
! [0] VXC_SET_UTSNAME
| allow the context to change the host and domain name with the appropriate kernel Syscall
|-
! [1] VXC_SET_RLIMIT
| allow the context to modify the resource limits (within the vserver limits).
|-
! [8] VXC_RAW_ICMP
| allow raw icmp packets in a secure way (this makes ping work from inside)
|-
! [16] VXC_SECURE_MOUNT
| permit secure mounts, which at the moment means that the nodev mount option is added.
|}

=== Context Accounting ===

Some properties of a context are useful to the admin, either for keeping an overview of the resources, to get a feeling for the capacity of the host, or for billing them in some way to a customer.

There are two different kinds of accountable properties, those having a current value which represents the state of the system (for example the speed of a vehicle), and those which monotonically increase over time (like the mileage).

Most of the state type of properties also qualify for applying some limits, so they are handled specially. this is described in more detail in the following section.

Good candidates for Context Accounting are:

* Amount of CPU Time spent
* Number of Forks done
* Socket Messages by Type
* Network Packets Transmitted and Received

=== Context Limits ===

Most properties related to system resources, might it be the memory consumption, the number of processes or file-handles, or the current network bandwidth, qualify for imposing limits on them.

To provide a general framework for all kinds of limits, Context Limits allow the configuration of three different values for each limit-able resource: the minimum, a soft limit and a hard limit (maximum).

At the time this is written, only the hard limits are supported and not all of them are actually enforced, but here is a list of current and planned Context Limits:

* process limits
* scheduler limits
* memory limits
* per-context disk limits
* per-context user/group quota

Additionally the context limit system keeps track of observed maxima and resource limit hits, to provide some feedback for the administrator.

=== Virtualization ===

One major difference between the Linux-VServer approach and Virtual Machines is that you do not have the virtualization part as a side-effect, so you have to do that by hand where it makes sense.

For example, a Virtual Machine does not need to think about uptime, because naturally the running OS was started somewhere in the past and will not have any problem to tell the time it thinks it began running.

A context can also store the time when it was created, but that will be different from the systems uptime, so in addition, there has to be some function, which adjusts the values passed from kernel to user-space depending on the context the process belongs to.

This is what for Linux-VServer is known as Virtualization (actually it's more faking some values passed to and from the kernel to make the processes think that they are on a different machine).

Currently modified for the purpose of Virtualization are:

* System Uptime
* Host and Domain Name
* Machine Type and Kernel Version
* Context Memory Availability
* Context Disk Space

=== Improved Security ===

Proc-FS Security provides a mechanism to protect dynamic entries in the proc filesystem from being seen in every context.
The system consists of three flags for each Proc-FS entry: Admin, Watch and Hide.

The Hide flag enables or disables the entire feature, so any combination with the Hide flag cleared will mean total visibility.
The Admin and Watch flags determine where the hidden entry remains visible; so for example if Admin and Hidden are set, the Host Context will be the only one able to see this specific entry.

=== Kernel Helper ===

For some purposes, it makes sense to have an user-space tool to act on behalf of the kernel, when a process inside a context requests something usually available on a real server, but naturally not available inside a context.

The best, and currently only example for this is the Reboot Helper, which handles the reboot() system call, invoked from inside a context on behalf of the Kernel. It is executed, in Host side user-space to take appropriate actions - either reboot or just shutdown (halt) the specified context.

While the helper is designed to be flexible and handle different things in a similar way there are no other users of this helper at the moment. It might be replaced by an event interface in near future.

== Features and Bonus Material ==

=== Unification ===

Because one of the central objectives for Linux-VServer is to reduce the overall resource usage wherever possible, a truly great idea was born to share files between different contexts without interfering with the usual administrative tasks or reducing the level of security created by the isolation.

Files common to more than one context, which are not very likely going to change, like libraries or binaries, can be hard linked on a shared filesystem, thus reducing the amount of disk space, inode caches, and even memory mappings for shared libraries.

The only drawback is that without additional measures, a malicious context would be able to deliberately or accidentally destroy or modify such shared files, which in turn would harm the other contexts.

One step is to make the shared files immutable by using the Immutable File Attribute (and removing the Linux Capability required to modify this attribute). However an additional attribute is required to allow removal of such immutable shared files, to allow for updates of libraries or executables from inside a context.

Such hard linked, immutable but unlink-able files belonging to more than one context are called unified and the process of finding common files and preparing them in this way is called Unification.

The reason for doing this is reduced resource consumption, not simplified administration. While a typical Linux Server install will consume about 500MB of disk space, 10 unified servers will only need about 700MB and as a bonus use less memory for caching.

=== Private Namespaces ===

A recent addition to the Linux-VServer branch was the introduction of Private Namespaces. This uses the already existing Virtual Filesystem Layer of the Linux kernel to create a separate view of the filesystem for the processes belonging to a context.

The major advantage over the shared namespace used by default is that any modifications to the namespace layout (like mounts) do not affect other contexts, not even the Host Context.

Obviously the drawback of that approach is that entering such a Private Namespace isn't as trivial as changing the root directory, but with proper kernel support this will completely replace the chroot() in the future.

=== The Linux-VServer Proc-FS ===

A structured, dynamically generated subtree of the well-known Proc-FS - actually two of them - has been created to allow for inspecting the different values of Security and Network Contexts.

<pre>
/proc/virtual
.../info

/proc/virtual/<pid>
.../info
.../status
.../sched
.../cvirt
.../cacct
.../limit
</pre>

=== Token Bucket Extensions ===

While the basic idea of Linux-VServer is a peaceful coexistence of all contexts, sharing the common resources in a respectful way, it is sometimes useful to control the resource distribution for resource hungry processes.

The basic principle of a Token Bucket is not very new. It is given here as an example for the Hard CPU Limit. The same principle also applies to scheduler priorities, network bandwidth limitation and resource control in general.

The Hard CPU Limit uses this mechanism in the following way: consider a bucket of a certain size S which is filled with a specified amount of tokens R every interval T, until the bucket is "full" - excess tokens are spilled. At each timer tick, a running process consumes exactly one token from the bucket, unless the bucket is empty, in which case the process is put on a hold queue until the bucket has been refilled with a minimum M of tokens. The process is then rescheduled.

A major advantage of a Token Bucket is that a certain amount of tokens can be accumulated in times of quiescence, which later can be used to burst when resources are required.

Where a per-process Token Bucket would allow for a CPU resource limitation of a single process, a Context Token Bucket allows to control the CPU usage of all confined processes.

Another approach, which is also implemented, is to use the current fill level of the bucket to adjust the process priority, thus reducing the priority of processes belonging to excessive contexts.

=== Context Disk Limits ===

This Feature requires the use of XID Tagged Files, and allows for independent Disk Limits for different contexts on a shared partition.
The number of inodes and blocks for each filesystem is accounted, if an XID-Hash was added for the Context-Filesystem combo.

Those values, including current usage, maximum and reserved space, will be shown for filesystem queries, creating the illusion that the shared filesystem has a different usage and size, for each context.

=== Per-Context Quota ===

Similar to the Context Disk Limits, Per-Context Quota uses separate quota hashes for different Contexts on a shared filesystem. This is not required to allow for Linux-VServer quota on separate partitions.

=== The VRoot Proxy Device ===

Quota operations (ioctls) require some access to the block device, which for security reasons is not available inside a VPS.

=== Stealth ===

For some applications, for example the preparation of a honey-pot or an especially realistic imitation of a real server for educational purposes, it can make sense to make the context indistinguishable from a real server.

However, since other freely available alternatives like QEMU or UML are much better at this, and require much less effort, this is not a central issue in Linux-VServer development.

== Linux-VServer Security ==

Now that we know what the Linux-VServer framework provides and how some features work, let's have a word on security, because you should not rely on the framework to be secure per definition. Instead, you should exactly know what you are doing.

=== Secure Capabilities ===

Currently the following Linux Capabilities are considered secure for VPS use. If others are added, it will probably open some security hole.

* CAP_CHOWN
* CAP_DAC_OVERRIDE
* CAP_DAC_READ_SEARCH
* CAP_FOWNER
* CAP_FSETID
* CAP_KILL
* CAP_SETGID
* CAP_SETUID
* CAP_NET_BIND_SERVICE
* CAP_SYS_CHROOT
* CAP_SYS_PTRACE
* CAP_SYS_BOOT
* CAP_SYS_TTY_CONFIG
* CAP_LEASE

CAP_NET_RAW for example is not considered secure although it is often used to allow the broken ping command to work, although there are better alternatives like the userspace ping command poink[U7] or the VXC_RAW_ICMP Context Capability.

=== The Chroot Barrier ===

Ensuring that the Barrier flag is set on the parent directory of each VPS is vital if you do not want VPS root to escape from the confinement and walk your Host's root filesystem.

=== Secure Device Nodes ===

The /dev directory of a VPS should not contain more than the following devices and the one directory for the unix pts tree.

* c 1 7 full
* c 1 3 null
* c 5 2 ptmx
* c 1 8 random
* c 5 0 tty
* c 1 9 urandom
* c 1 5 zero
* d pts

Of course other device nodes like console, mem and kmem, even block and character devices can be added, but some expertise is required in order to ensure no security holes are opened.

=== Secure Proc-FS Entries ===

There has been no detailed evaluation of secure and unsecure entries in the proc filesystem, but there have been some incidents where unprotected (not protected via Linux Capabilities) writable proc entries caused mayhem.

For example, /proc/sysrq-trigger is something which should not be accessible inside a VPS without a very good reason.

== Field of Application ==

The primary goal of this project is to create virtual servers sharing the same machine. A virtual server operates like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, and SQL servers.

=== Administrative Separation ===

This allows a clever provider to sell something called Virtual Private Server, which uses less resources than other virtualization techniques, which in turn allows to put more units on a single machine.

The list of providers doing so is relatively long, and so this is rightfully considered the main area of application.

=== Service Separation ===

Separating different or similar services which otherwise would interfere with each other, either because they are poorly designed or because they are simply incapable of peaceful coexistence for whatever reason, can be easily done with Linux-VServer.

But even on the old-fashioned real server machines, putting some extremely exposed or untrusted, because unknown or proprietary, services into some kind of jail can improve maintainability and security a lot.

=== Enhancing Security ===

While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server.

The service in the main environment would be:

* Unreachable from the network.
* Able to log messages from the virtual server in a secure way. The virtual server would be unable to change/erase the logs.\ Even a cracked virtual server would not be able the edit the log.
* Able to run intrusion detection facilities, potentially spying the state of the virtual server without being accessible or noticed.\ For example, tripwire could run there and it would be impossible to circumvent its operation or trick it.

Another option is to put the firewall in a virtual server, and pull in the DMZ, containing each service in a separate VPS. On proper configuration, this setup can reduce the number of required machines drastically, without impacting performance.

=== Easy Maintenance ===

One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for a virtual server installation.

The main server acts as a host and takes care of all the details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server with very few manipulations.

For example, to move the virtual server from one physical computer to another, it sufficient to do the following:

* shutdown the running server
* copy it over to the other machine
* copy the configuration
* start the virtual server on the new machine

No adjustments to user setup, password database or hardware configuration are required, as long as both machines are binary compatible.

=== Fail-over Scenarios ===

Pushing the limit a little further, replication technology could be used to keep an up-to-the-minute copy of the filesystem of a running Virtual Server. This would permit a very fast fail-over if the running server goes offline for whatever reason.

All the known methods to accomplish this, starting with network replication via rsync, or drbd, via network devices, or shared disk arrays, to distributed filesystems, can be utilized to reduce the down-time and improve overall efficiency.

=== For Testing ===

Consider a software tool or package which should be built for several versions of a specific distribution (Mandrake 8.2, 9.0, 9.1, 9.2, 10.0) or even for different distributions.

This is easily solved with Linux-VServer. Given plenty of disk space, the different distributions can be installed and running side by side, simplifying the task of switching from one to another.

Of course this can be accomplished by chroot() alone, but with Linux-VServer it's a much more realistic simulation.

== Performance and Stability ==

''(work in progress)''

=== Impact of Linux-VServer on the Host ===

seems to be 0% ...

=== Overhead inside a Context ===

seems to be less than 2% ...

=== Size of the Kernel Patch ===

Comparison of the different patches ...

{| class="wikitablenowrap"
! patch
! hunks
! +
! -
|-
| patch-2.4.24-vs1.00.diff
| 178
| 1112
| 135
|-
| patch-2.4.24-vs1.20.diff
| 216
| 2035
| 178
|-
| patch-2.4.24-vs1.26.diff
| 225
| 2118
| 180
|-
| patch-2.4.25-vs1.27.diff
| 252
| 2166
| 201
|-
| patch-2.4.26-vs1.28.diff
| 254
| 2183
| 202
|-
| patch-2.6.6-vs1.9.0.diff
| 494
| 5699
| 303
|-
| patch-2.6.6-vs1.9.1.diff
| 497
| 5878
| 307
|-
| patch-2.6.7-vs1.9.2.diff
| 618
| 6836
| 348
|-
| uml-patch-2.4.26-1.diff
| 449
| 36885
| 48
|}

== Non Intel i386 Hardware ==

Linux-VServer was designed to be mostly architecture agnostic, therefore only a small part, the syscall definition itself, is architecture specific. Nevertheless some architectures have private copies of basically architecture independent code for whatever reason, and therefore small modifications are often required.

The following architectures are supported and some of them are even tested:

* alpha
* ia32 / ia64 / xbox
* x86_64 (AMD64)
* mips / mips64
* hppa / hppa64
* ppc / ppc64
* sparc / sparc64
* s390
* uml

Adding a new architecture is relatively simple although extensive testing is required to make sure that every feature is working as expected (and of course, the hardware ;).

== Linux Kernel Intro ==

While almost all of the described features reside in the Linux Kernel, nifty Userspace Tools are required to activate and control the new functionality.

Those Userspace Tools in general communicate with the Linux Kernel via System Calls (or Syscall for short).
This chapter will give a short overview how Linux Kernel and User Space is organized and how Syscalls, a simple method of communication between processes and kernel, work.

=== Kernel and User Space ===

In Linux and similar Operating Systems, User and Kernel Space is separated, and address space is divided into two parts. Kernel space is where the kernel code resides, and user space is where the user programs live. Of course, a given user program can't write to kernel memory or to another program's memory area.

Unfortunately, this is also the case for kernel code. Kernel code can't write to user space either. What does this mean? Well, when a given hardware driver wants to write data bytes to a program in user memory, it can't do it directly, but rather it must use specific kernel functions instead. Also, when parameters are passed by address to a kernel function, the kernel function can not read the parameters directly. It must use other kernel functions to read each byte of the parameters.

Of course, there are some helpers which do the transfer to and from user space.

<pre>
copy_to_user(void *to, const void *from, long n);
copy_from_user(void *to, const void *from, long n);
</pre>

get_user() and put_user() Get or put the given byte, word, or long from or to user memory. This is a macro, and it relies on the type of the argument to determine the number of bytes to transfer.

=== Linux Syscalls ===

Most libc calls rely on system calls, which are the simplest kernel functions a user program can call.

These system calls are implemented in the kernel itself or in loadable kernel modules, which are little chunks of dynamically link-able kernel code.

Linux system calls are implemented through a multiplexor called with a given maskable interrupt. In Linux, this interrupt is int 0x80. When the 'int 0x80' instruction is executed, control is given to the kernel (or, more accurately, to the _system_call() function), and the actual demultiplexing process occurs.

How does _system_call() work ?

First, all registers are saved and the content of the %eax register is checked against the global system calls table, which enumerates all system calls and their addresses.

This table can be accessed with the extern void *sys_call_table[] variable. A given number and memory address in this table corresponds to each system call.

System call numbers can be found in /usr/include/sys/syscall.h.

They are of the form SYS_systemcallname. If the system call is not implemented, the corresponding cell in the sys_call_table is 0, and an error is returned.

Otherwise, the system call actually exists and the corresponding entry in the table is the memory address of the system call code.

== Kernel Side Implementation ==

While this chapter is mainly of interest to kernel developers it might be fun to take a small peek behind the curtain to get a glimpse how everything really works.

=== The Syscall Command Switch ===

For a long time Linux-VServer used a few different Syscalls to accomplish different aspects of the work, but very soon the number of required commands grew large, and the Syscalls started to have magic values, selecting the desired behavior.

Not too long ago, a single syscall was reserved for Linux-VServer, and while the opinion on that might differ from developer to developer, it was generally considered a good decision not to have more than one syscall.

The advantage of different Syscalls would be simpler handling of the Syscalls on different architectures; however, this hasn't been a problem so far, as the data passed to and from the kernel has strong typed fields conforming to the C99 types.

Regardless, the availability of one system call required the creation of a multiplexor, which decides, based on some selector, what specific command is to be executed, and then passes on the remaining arguments to that command, which does the actual work.

<pre>
extern asmlinkage long
sys_vserver(uint32_t cmd, uint32_t id, void __user *data)
</pre>

The Linux-VServer syscall is passed three arguments regardless of what actual command is specified: a command (cmd), a number (id), and a user-space data-structure of yet unknown size.

To allow for some structure for debugging purposes and some kind of command versioning, the cmd is split into three parts: the lower 12 bit contain a version number, then 4 bits are reserved, the upper 16 bits are divided into 8 bit command and 6 bit category, again reserving 2 bits for the future.

There are 64 Categories with up to 256 commands in each category, allowing for 4096 revisions of each command, which is far more than will ever be required.

Here is an overview of the categories already defined, and their numerical value:

<pre>
Syscall Matrix V2.6

|VERSION|CREATE |MODIFY |MIGRATE|CONTROL|EXPERIM| |SPECIAL|SPECIAL|
|STATS |DESTROY|ALTER |CHANGE |LIMIT |TEST | | | |
|INFO |SETUP | |MOVE | | | | | |
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SYSTEM |VERSION|VSETUP |VHOST | | | | |DEVICES| |
HOST | 00| 01| 02| 03| 04| 05| | 06| 07|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
CPU | |VPROC |PROCALT|PROCMIG|PROCTRL| | |SCHED. | |
PROCESS| 08| 09| 10| 11| 12| 13| | 14| 15|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
MEMORY | | | | | | | |SWAP | |
| 16| 17| 18| 19| 20| 21| | 22| 23|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
NETWORK| |VNET |NETALT |NETMIG |NETCTL | | |SERIAL | |
| 24| 25| 26| 27| 28| 29| | 30| 31|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
DISK | | | | | | | |INODE | |
VFS | 32| 33| 34| 35| 36| 37| | 38| 39|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
OTHER | | | | | | | |VINFO | |
| 40| 41| 42| 43| 44| 45| | 46| 47|
=======+=======+=======+=======+=======+=======+=======+ +=======+=======+
SPECIAL| | | | |FLAGS | | | | |
| 48| 49| 50| 51| 52| 53| | 54| 55|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
SPECIAL| | | | |RLIMIT |SYSCALL| | |COMPAT |
| 56| 57| 58| 59| 60|TEST 61| | 62| 63|
-------+-------+-------+-------+-------+-------+-------+ +-------+-------+
</pre>

The definition of those Commands is simplified by some macros, so for example the commands to get and set the Context Flags are defined like this:

<pre>
#define VCMD_get_cflags VC_CMD(FLAGS, 1, 0)
#define VCMD_set_cflags VC_CMD(FLAGS, 2, 0)

extern int vc_get_cflags(uint32_t, void __user *);
extern int vc_set_cflags(uint32_t, void __user *);
</pre>

Note that the command itself is not passed to the actual command implementation, only the id and the pointer to user-space data.

=== Utilized Data Structures ===

There are many different data structures used by different parts of the implementation; while only a few examples are given here, all utilized structures can be found in the source.

==== The Context Data Structure ====

The Context Data Structure consists of a few fields required to manage the contexts, and handle context destruction, as well as future hierarchical contexts.

Logically separated sections of that structure, like for the scheduler or the context limits are defined in separate structures, and incorporated into the main one.

<pre>
struct vx_info {
struct list_head vx_list; /* linked list of contexts */
xid_t vx_id; /* context id */
atomic_t vx_refcount; /* refcount */
struct vx_info *vx_parent; /* parent context */

struct namespace *vx_namespace; /* private namespace */
struct fs_struct *vx_fs; /* private namespace fs */
uint64_t vx_flags; /* context flags */
uint64_t vx_bcaps; /* bounding caps (system) */
uint64_t vx_ccaps; /* context caps (vserver) */

pid_t vx_initpid; /* PID of fake init process */

struct _vx_limit limit; /* vserver limits */
struct _vx_sched sched; /* vserver scheduler */
struct _vx_cvirt cvirt; /* virtual/bias stuff */
struct _vx_cacct cacct; /* context accounting */

char vx_name[65]; /* vserver name */
};
</pre>

Here as example the Scheduler Substructure:
<pre>
struct _vx_sched {
spinlock_t tokens_lock; /* lock for this structure */

int fill_rate; /* Fill rate: add X tokens ... */
int interval; /* Divisor: ... each Y jiffies */
atomic_t tokens; /* current number of tokens */
int tokens_min; /* Limit: minimum for unhold */
int tokens_max; /* Limit: no more than N tokens */
uint32_t jiffies; /* bias: integral multiple of Y */

uint64_t ticks; /* token tick events */
cpumask_t cpus_allowed; /* cpu mask for context */
};
</pre>

The main idea behind this separation is that each substructure belongs to a logically distinct part of the implementation which provides an init and cleanup function for this structure, thus simplifying maintainability and readability of those structures.

==== The Scheduler Command Data ====

As an example for the data structure used to control a specific part of the context from user-space, here is a scheduler command and the utilized data structure to set the properties:

<pre>
#define VCMD_set_sched VC_CMD(SCHED, 1, 2)

struct vcmd_set_sched_v2 {
int32_t fill_rate; /* Fill rate: add X tokens ... */
int32_t interval; /* Divisor: ... each Y jiffies */
int32_t tokens; /* current number of tokens */
int32_t tokens_min; /* Limit: minimum for unhold */
int32_t tokens_max; /* Limit: no more than N tokens */
uint64_t cpu_mask; /* Mask: allowed cpus */
};
</pre>

==== Example Accounting: Sockets ====

Basically all the accounting and limit stuff are defined as macros or inline functions capable of handling the different resources, hiding the underlying implementation wherever possible.

<pre>
#define vx_acc_sock(v,f,p,s) \
__vx_acc_sock((v), (f), (p), (s), __FILE__, __LINE__)

static inline void __vx_acc_sock(struct vx_info *vxi,
int family, int pos, int size, char *file, int line)
{
if (vxi) {
int type = vx_sock_type(family);

atomic_inc(&vxi->cacct.sock[type][pos].count);
atomic_add(size, &vxi->cacct.sock[type][pos].total);
}
}

#define vx_sock_recv(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 0, (s))
#define vx_sock_send(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 1, (s))
#define vx_sock_fail(sk,s) \
vx_acc_sock((sk)->sk_vx_info, (sk)->sk_family, 2, (s))
</pre>

And this general definition is then used where appropriate, for example in the __sock_sendmsg() function like this:

<pre>
len = sock->ops->sendmsg(iocb, sock, msg, size);
if (sock->sk) {
if (len == size)
vx_sock_send(sock->sk, size);
else
vx_sock_fail(sock->sk, size);
}
</pre>

==== Example Limits: Virtual Memory ====

<pre>
#define vx_pages_avail(m, p, r) \
__vx_pages_avail((m)->mm_vx_info, (r), (p), __FILE__, __LINE__)

static inline int __vx_pages_avail(struct vx_info *vxi,
int res, int pages, char *file, int line)
{
if (!vxi)
return 1;
if (vxi->limit.rlim[res] == RLIM_INFINITY)
return 1;
if (atomic_read(&vxi->limit.res[res]) +
pages < vxi->limit.rlim[res])
return 1;
return 0;
}

#define vx_vmpages_avail(m,p) vx_pages_avail(m, p, RLIMIT_AS)
#define vx_vmlocked_avail(m,p) vx_pages_avail(m, p, RLIMIT_MEMLOCK)
#define vx_rsspages_avail(m,p) vx_pages_avail(m, p, RLIMIT_RSS)
</pre>

And again the test against those limits at certain places, for example here in copy_process()

<pre>
/* check vserver memory */
if (p->mm && !(clone_flags & CLONE_VM)) {
if (vx_vmpages_avail(p->mm, p->mm->total_vm))
vx_pages_add(p->mm->mm_vx_info,
RLIMIT_AS, p->mm->total_vm);
else
goto bad_fork_free;
}
</pre>

==== Example Virtualization: Uptime ====

<pre>
void vx_vsi_uptime(struct timespec *uptime)
{
struct vx_info *vxi = current->vx_info;

set_normalized_timespec(uptime,
uptime->tv_sec - vxi->cvirt.bias_tp.tv_sec,
uptime->tv_nsec - vxi->cvirt.bias_tp.tv_nsec);
return;
}

if (vx_flags(VXF_VIRT_UPTIME, 0))
vx_vsi_uptime(&uptime, &idle);
</pre>

== Future Directions ==

''(work in progress)''

=== Hierarchical Contexts ===

=== Security Branch ===

=== Stealth Branch ===

Problematic Programs

2006-09-22T13:01:19Z

Meandtheshell: formating

Some programs do things that might work on a normal host but not inside a V-Server. This is often not a fault of V-Server itself, the programs are doing automagic things which fail and no proper error handling is done. Also sometimes the actions need special rights which are not permitted by default in V-Servers. Allowing CAPs is often not necessary since those special CAPs are only required once (e.g. when the program initializes the directories/settings/whatever).

=== OpenGroupware Apache Module ===
If your V-Server doesn't have access to localhost, then the connection to the OpenGroupware server will fail with a "Internal Server Error". The apache module for OpenGroupware called mod_ngobjweb uses a hardcoded "127.0.0.1" IP address in the source (handler.c line 339), this line you need to change to the IP address that should be used (the IP of the V-Server that runs the OpenGroupware? server)

=== Hylafax (with CAPI) ===
If you want to run hylafax in a V-Server, you will get a CAP and device problem which can be easily solved. First you need your capi20 devices in your V-Server, which can't be created by ./MAKEDEV (requires special CAPs) so copy the devices into the V-Server, like this (command run on the host):<pre>cp -aR /dev/capi* /vservers/your_vserver/dev</pre>

Now hylafax can access your CAPI ISDN card but will exit after a few seconds, the problem is it tries to create a /dev/null nod in the hylafax chroot. This fails because of missing CAPs, so lets help hylafax again with copying the nod into the hylafax chroot in the V-Server. Like this (command run on the host):<pre>cp -aR /dev/null /vservers/your_vserver/var/spool/hylafax/dev</pre>
Allright, now hylafax should have CAPI access and run properly.

=== Links inside screen inside a V-Server ===
Don't know why, but links crashes systematically being inside a screen session inside a V-Server started outside a V-Server. (please elaborate!)

=== screen inside a VServer ===

<pre>[root@ge root]# vserver zoe enter

zoe:/# screen
Cannot open your terminal '/dev/pts/5' - please check.

zoe:/# strace screen
...
stat64("/dev/pts/5", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0
open("/dev/pts/5", O_RDWR|O_NONBLOCK) = -1 EACCES (Permission denied)</pre>

is neither a bug nor an issue with screen, it just shows that a vserver context is not allowed to mess with host terminals. either use ssh/telnet to reach the 'guest' or start the screen session before you do the 'enter' (i.e. on the host)

=== OpenLDAP Startup ===
slapd needs name resolution available in order to start up, otherwise it appears to hang. Make sure you have working DNS (or whatever) available to your vserver before starting one with slapd. This behavior is confirmed in my setup, no confirmation from others yet. My Setup: vservers all bind to an interface on a DMZ-like network segment, BIND runs on a vserver. slapd would hang at startup if the BIND vserver had not been started first.

=== rndc ===
Bind's rndc has a hardcoded 127.0.0.1 somewhere so any command to rndc will fail with connection refused. You should have a reachable localhost address defined in /etc/hosts and then you can use <pre>rndc -s localhost</pre> command. You can make a rndc.conf and set the default-server option, like that the '-s localhost' isn't necessary.

=== Asterisk ===
Since some version of Asterisk (at least since 1.0.2), it will not run anymore. On start it fails with: "Unable to set high priority".
This can be solved by allowing CAP_SYS_NICE for that V-Server. You can also not run Asterisk with the realtime priority - Just pass the '-p' command ligne argument to disable the read-time priority. Good doc on setting up Asterisk devices in the vserver: http://www.telephreak.org/papers/vpa/

=== Open/FreeSwan ===
Fails because of writing to /proc (requires patch) TODO: write me

=== Samba ===
Oplocks don't work as smbd insists on receiving break requests from 127.0.0.1
Just patch source/smbd/oplock.c (commenting paranoid code)
<pre>
+++ oplock.c.orig 2005-02-14 14:27:51.000000000 +0200
--- oplock.c 2005-02-02 12:27:50.000000000 +0200
@@ -181,14 +181,12 @@
return False;
}

+#if 0
/* Validate message from address (must be localhost). */
if(from.sin_addr.s_addr != htonl(INADDR_LOOPBACK)) {
DEBUG(0,("receive_local_message: invalid 'from' address \
(was %lx should be 127.0.0.1)\n", (long)from.sin_addr.s_addr));
return False;
}
+#endif

/* Setup the message header */
SIVAL(buffer,OPBRK_CMD_LEN_OFFSET,msg_len);
</pre>
or if you don't want to patch the samba source code you can disable oplock in Samba and it will work too!

Just put the following in your smb.conf:
<pre>
kernel oplocks = no
oplocks = no
</pre>
Note: The Vserver using Samba should also listen on the broadcast address. Thereby you will not be able to have two samba servers in the same net (on the same broadcast).

==== Samba from Debian 3.1 ====

The samba deb in sarge (3.1) provided file sharing. The only oddity observed is that the vserver guest running samba did not appear in a windows box's 'My Network Places'

Use a WINS server. The SMB browsing protocol relies heavily on broadcasts on the local net, which are problematic with vservers. WINS resolution on the other hand is unicast and works flawlessly under vserver.

==== Samba printer and file server with cups ====
Samba runs correctly in a Mandriva (Mdk) 10.1 Vserver, (Apart from the above oplock problem ?).First, edit your ''/etc/sysconfig/network'' file, and set ''networking'' to ''yes'' (This will solve problems for other services !):
<pre>
# cat /etc/sysconfig/network
NETWORKING=yes
</pre>
Some more tweaking is needeed in ''/etc/smb.conf''
<pre>
# cat /etc/smb.conf
...
# YOUR VSERVER IP/MASK HERE
interfaces = xxx.xxx.xxx.xxx/mask
...
</pre>

But if you're using Samba + Cups to provide printing for Windows clients, AND if you want to use the ''Point and Print'' feature, there is more: In the ''[printers]'' section of your ''smb.conf'', you should have the ''use client drivers'' directive set to ''no'', or the driver upload procedure will fail !
<pre>
# cat /etc/smb.conf
...
use client driver = no
...
</pre>
So, here is a full ''smb.conf'' file:
<pre>
# cat /etc/samba/smb.conf | awk '!/^$/ && !/^\s*(#|;)/ {print $0}'
[global]
workgroup = MYDOMAIN
netbios name = MYHOSTNAME
server string = MYCOMMENT (Samba %v)
printcap name = cups
load printers = yes
printing = cups
printer admin = @adm
log file = /var/log/samba/log.%m
max log size = 50
map to guest = bad user
security = domain
password server = *
encrypt passwords = yes
smb passwd file = /etc/samba/smbpasswd
username map = /etc/samba/smbusers
idmap uid = 10000-20000
idmap gid = 10000-20000
socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192
interfaces = 127. MYVSERVERIP/MYVSERVERMASK
wins server = MYWINSIP
dns proxy = no
# for french users:
dos charset = 850
unix charset = ISO8859-1
[homes]
comment = Home Directories
browseable = no
writable = no
[printers]
comment = All Printers
path = /var/spool/samba
browseable = no
guest ok = no
writable = no
printable = yes
create mode = 0700
print command = lpr-cups -P %p -o raw %s -r # using client side printer drivers.
use client driver = no
[print$]
path = /var/lib/samba/printers
browseable = yes
write list = @adm root
guest ok = yes
inherit permissions = yes
</pre>
...And a working smbusers:
<pre>
# Unix_name = SMB_name1 SMB_name2 ...
root = administrator MYDOMAIN\administrator
nobody = guest pcguest smbguest
</pre>

=== Cups print server ===
Symptoms: The Cups init script exits with:
<pre>
Starting CUPS printing system: cupsd: Child exited with status 98!
</pre>
And the logs (''/var/log/cups/error_log'') show:
<pre>
E [date:hour...] StartListening: Unable to bind socket for address 0.0.0.0:631 - Address already in use.
</pre>
...Or something like this.

With a correct "cupsd.conf file" (Tested version 1.1.21-0.rc1.7mdk, on Mandrake 10.1 - Now Mandriva), it works; All we need is to remove references to ''127.0.0.1'' or ''localhost'' from the file, as well as correctly unsetting the ''Listen'' directive:
<pre>
LogLevel info
TempDir /var/spool/cups/tmp
# No 'Listen' directive !
Port 631
BrowseAddress @LOCAL
BrowseDeny All
BrowseAllow @LOCAL
BrowseOrder deny,allow
<Location />
Order Deny,Allow
Deny From All
Allow From @LOCAL
</Location>
<Location /admin>
AuthType Basic
AuthClass System
Order Deny,Allow
Deny From All
Allow From YOUR_NETWORK_ADDRESS/YOUR_NETMASK # Example: 172.16.0.0/24
# Or
Allow From @LOCAL
</Location>
</pre>
Then you'll need to modify the ''/etc/init.d/cups'' script, to comment any section referring to ''127.0.0.1'' lookup and configuration. This section exists at least on Mandrake 10.1, and is pretty long (Lines 35 to 55 and/or 79), and additionnaly four "''else...if''" lines must be commented far below (Lines 161 to 164) !

Remember to stop any Cupsd running in the host server, or to start it via a wrapper ''/etc/init.d/v_cups'' script:
<pre>
#!/bin/sh
# chkconfig: 2345 15 60
# description: Wrapper to start cups bound to a single IP
USR_LIB_VSERVER=/usr/lib/util-vserver
exec $USR_LIB_VSERVER/vsysvwrapper cups $*
</pre>
Do not forget to give a password to the root user, if you want to ba able to manage your printers from the web interface (http://yourcupsvserver:631)!
<pre>
# passwd root
...
</pre>
If you use Mandriva 10.1 (And maybe some other distros), you’ll need to add the printers drivers for Cups, and reload it:
<pre>
# urpmi --root /vservers/yourcupsvserver/ cups-drivers
# /etc/init.d/cups reload
</pre>
…It added 67 Mb of packages for me.

Then use ''/etc/init.d/v_cups (re)start'' to launch Cups on the host server.
You will now be able to make Cupsd start in the vserver , but more tweaking on the ACLs may be necessary to avoid authentification problems...

=== Bind9 on Debian GNU/Linux Woody (3.0) and Sarge (3.1) ===
named provided by the bind9 binary packages fails to start because it is compiled with CAPs option.

The debian way is to build** your own package without CAPs:
<pre>
su -
cd /usr/src
apt-get build-dep bind9
apt-get source bind9
cd bind9-x.x.x
vi debian/rules
</pre>
Insert the following line after "./configure --prefix=/usr \":
<pre>
--disable-linux-caps \
</pre>
On a NPTL-enabled system you alse have to replace
<pre>
--enable-threads \
</pre>
with
<pre>
--disable-threads \
</pre>
or bind might refuse to run with an other user than root.

Save the file and go ahead with compiling/installing:
<pre>
dpkg-buildpackage
dpkg -i ../bind9-x.x.x.deb
echo "bind9 hold" | dpkg --set-selections
</pre>
The last line is to set the package "on hold", so it is not touched by the update process. you have to take care of security holes by yourself now!

The Xs in "bind9-x.x.x" denote the version number of bind9. Alternatively you can allow the CAP_SYS_RESOURCE for that V-Server. The best way would be to fix bind, which is somehow broken when it comes to capabilities. Daniel Hokka Zakrisson repaired it. His patch is to be found here:

[bind-9.3.2-caps-when-available.patch]

So, if you recompile, it would be the cleanest way to apply that patch. Thanks Daniel! It would be also nice, if someone submits that patch to the bind people or maybe to your distribution's package maintainers in the first step.

Get my [vserver-guest-ready Debian bind9 package] for Debian Sid guests. Feedback welcome: aj@net-lab.net

=== Zimbra Mail ===
Zimbra is many applications (including Postfix and MySQL? and OpenLDAP? and more) which try to take over the interfaces, and depend a lot on binding from 127.0.0.1 - it is not hard to change, but there is a couple of tricks - it is documented here - http://wiki.zimbra.com/index.php?title=Install_VServer

=== xine ===
won't start with no error message.

"xine --verbose" shows this.
<pre>
ERROR: Could not determine network interfaces, you must use a interfaces config line
</pre>
This happens if you have the xineplug_inp_smb.so plugin. Delete it and everything is fine.

=== 127.0.0.1 issues ===
I had problems with an application that wanted me to access it on 127.0.0.1 and AS 127.0.0.1 to be able to do its configuration. A simple tweak solved the problem. I renamed the default interface directory "0" in /etc/vservers/server/interfaces to "1" and created interface 0 as :
<pre>
dev lo
ip 127.0.0.1
mask 255.0.0.0
name lo
</pre>
now interface "1" is the default created interface by the vserver build script with a local adress like 192.168.1.2 and interface "0" is the loopback. I can now telnet on 127.0.0.1 and it sees that im connecting to 127.0.0.1 from 127.0.0.1

Compiling nagios-1.4 within a vserver requires this, otherwise it hangs during the configure with "checking for ICMP ping syntax..."

=== Hula-project ===
Does not want to start. TODO: add more information.

=== Postfix ===

==== Postfix 2.1.5 (Debian Sarge) ====
On a vserver with two interfaces (lo and eth0), and a postfix 2.1.5 listening on lo, postfix can't send emails : "Invalid argument"... Setting smtp_bind_address (http://www.postfix.org/postconf.5.html#smtp_bind_address) to the external address solves the issue.

==== Postfix Policy Daemon ====
Running a Debian 3.1 Sarge with Backports I have several issues with the postfix-policyd because it wants to set the rlimits.

Log returns:
<pre>
cannot set rlimit: Operation not permitted
</pre>
Strace tells us:
<pre>
setrlimit(RLIMIT_NOFILE, {rlim_cur=4097, rlim_max=4097}) = -1 EPERM (Operation not permitted)
</pre>
Output on the Host
<pre>
# Ulimit -Ha
...
-n: file descriptors 1024
...
</pre>
Thats too little...

Solution:

The App has again a build in need to use CAP_SYS_RESOURCE (which is bad (tm)) so in the guest do:
<pre>
# ulimit -HS -n 8192
# ulimit -Ha
.. shows us now the correct 8192 instead of 1024.

# vserver $yourVserverName restart
# vserver $yourVserverName enter

$yourVserverName # ulimit -Ha
...
-n: file descriptors 8192
...
</pre>
Everything should be fine now !

Problematic Programs

2006-09-22T12:54:42Z

Meandtheshell: I ported the remaining content - page is now entirely part of the new wiki

Frequently Asked Questions

2006-09-20T14:35:17Z

Meandtheshell: I made s/master/host/ in order to keep up the unified naming

<div style="margin: 2em auto 2em auto; padding: 10px; background-color: #F9ECCD; border: 1px solid #004433; text-align: center;">
[[Image:Icon-Caution.png|left]]
We currently migrate to MediaWiki from our old installation, but not all content has been migrated yet. Take a look at the [[Wiki Team]] page for instructions how to help or look at the [http://oldwiki.linux-vserver.org old wiki] to find the information not migrated yet.

'''To ease migration we created a [[List of old Documentation pages]].'''
</div>

CURRENTLY THE CONTENT OF THE OLD WIKI FAQ (AND MORE) IS BEING MIGRATED TO THIS PAGE (TASK: DERJOHN)

__TOC__

{{Question|Question=What is a 'Guest'?||Details=To talk about stuff, we need some naming. The physical machine is called 'Host' and the 'main' context running the Host Distro is called 'Host Context'. The virtual machine/distro is called 'Guest' and basically is a Distribution (Userspace) running inside a 'Guest Context'.|Signature=derjohn}}

{{Question|Question=What kind of Operating System (OS) can I run as guest?||Details=
A: With VServer you can only run Linux guests. The trick is that a guest does not run a kernel on its own (as XEN and UML do), it merely uses a virtualized host kernel-interface. VServer offers so called security contexts which make it possible to seperate one guest from each other, i.e. they cannot get data from each other. Imagine it as a chroot environment with much more security and features.|Signature=derjohn}}

{{Question|Question=Which distributions did you test?||Details=
A: Some. Check out the wiki for ready-made guest images. But you can easily build own guest images, e.g. with Debian's debootstrap. Checkout ((step-by-step Guide 2.6)) how to do that.|Signature=derjohn}}

{{Question|Question=Is VServer comparable to XEN/UML/QEMU?||Details=
A: Nope. XEN/UML/QEMU and VServer are just good friends. Because you ask, you probably know what XEN/UML/QEMU are. VServer in contrary to XEN/UML/QEMU not "emulate" any hardware you run a kernel on. You can run a VServer kernel in a XEN/UML/QEMU guest. This is confirmed to work at least with Linux 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Is VServer secure?||Details=
A: We hope so. It should be as least as secure as Linux is. We consider it much much more secure though.|Signature=derjohn}}

{{Question|Question=Performance?||Details=
A: For a single guest, we basically have native performance. Some tests showed insignificant overhead (about 1-2%) others ran faster than on an unpatched kernel. This is IMVHO significantly less than other solutions waste, especially if you have more than a single guest (because of the resource sharing).|Signature=derjohn}}

{{Question|Question=Is SMP Supported?||Details=
A: Yes, on all SMP capable kernel architectures.|Signature=derjohn}}

{{Question|Question=Resource sharing?||Details=
A: Yes ....
* memory: Dynamically.
* CPU usage: Dynamically (token bucket)|Signature=derjohn}}

{{Question|Question=Resource limiting?||Details=
A: Yes, you can set maximum limits per guest, but you can only offer guaranteed resource availability with some ticks at the time. There is the possibility to ulimit and to rlimit. Rlimit is a new feature of kernel 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Disk I/O limiting? Is that possible?||Details=
A: Well, since vs2.1.1 linux-vserver supports a mechanism called 'I/O scheduling', which appeared in the 2.6 mainline some time ago. The mainline kernel offers several I/O schedulers:

<pre>
# cat /sys/block/hdc/queue/scheduler
noop [anticipatory] deadline cfq
</pre>

The default is anticipatory a.k.a. "AS". When running several guests on a host you probably want the I/O performance shared in a fair way among the different guests. The kernel comes with a "completely fair queueing" scheduler, CFQ, which can do that. (More on schedulers can be found at http://lwn.net/Articles/114770/)

This is how to set the scheduler to "cfq" manually:
<pre>
root# echo "cfq" > /sys/block/hdc/queue/scheduler
root# cat /sys/block/hdc/queue/scheduler
noop anticipatory deadline [cfq]
</pre>

Keep in mind that you have to do it on all physical discs. So if you run an md-softraid, do it to all physical /dev/hdXYZ discs!

If you run Debian there is a predefined way to set the /sys values at boot-time:

<pre>
# apt-get install sysfsutils
[...]

# cat /etc/sysfs.conf | grep cfq
block/sda/queue/scheduler = cfq
block/sdc/queue/scheduler = cfq

# /etc/init.d/sysfsutils restart
</pre>

For non-vserver processes and CFQ you can set by which key the kernel decides about the fairness:

<pre>
cat /sys/block/hdc/queue/iosched/key_type
pgid [tgid] uid gid
</pre>
Hint: The 'key_type'-feature has been removed in the mainline kernel recently. Don't look for it any longer :(

The default is tgid, which means to share fairly among process groups. Think every guest is treated like a own process group. It's not possible to set a scheduler strategy within a guest. All processes belonging to the same guest are treated like "noop" within the guest. So: If you run apache and some ftp-server within the _same_ guest, there is no fair scheduling between them, but there is fair scheduling between the whole guest and all other guests.

And: It's possible to tune the scheduler parameters in several ways. Have a look at /sys/block/hdc/queue/....

You need a very recent Version of VS devel, e.g. the 2.1.1-rc18 can do it. Some older version have that feature too, then it got lost and was reinvented. So: Go and get a rc18 - only in 'devel', not stable!|Signature=derjohn}}

{{Question|Question=Why isn't there a device /dev/bla? within a guest||Details=
A: Device nodes allow Userspace to access hardware (or virtual resources). Creating a device node inside the guest's namespace will give access to that device, so for security reasons, the number of 'given' devices is small.|Signature=derjohn}}

{{Question|Question=What is Unification (vunify)?||Details=
A: Unification is Hard Links on Steroids. Guests can 'share' common files (usually binaries and libraries) in a secure way, by creating hard links with special properties (immutable but unlinkable (removable)). The tool to identify common files and to unify them is called vunify.|Signature=derjohn}}

{{Question|Question=What is vhashify?||Details=
A: The successor of vunify, a tool which does unification based on hash values (which allows to find common files in arbitrary paths.)|Signature=derjohn}}

{{Question|Question=How do I manage a multi-guest setup with vhashify?||Details=
A: For 'vhashify', just do these once:

<pre>
mkdir /etc/vservers/.defaults/apps/vunify/hash /vservers/.hash
ln -0s /vservers/.hash /etc/vservers/.defaults/apps/vunify/hash/root
</pre>

Then, do this one line per vserver:

<pre>
mkdir /etc/vservers/<vservername>/apps/vunify # vhashify reuses vunify configuration
</pre>

The command 'ln' creates a link between two files. "ln -s" creates a symbolic link -- two files are linked by name. "ln -0s" uses a Vserver extention to create a unified link.|Signature=derjohn}}

{{Question|Question=With which VS version should I begin?||Details=
A: If you are new to VServer I recommend to try 2.0.+. Take "alpha utils" Version 0.30.210. In Debian Sid there appeared well running version of it recently. (It's a .210 at the time of writing).|Signature=derjohn}}

{{Question|Question=is there a way to implement "user/group quota" per VServer?||Details=
A: Yes, but not on a shared partition for now. You need to put the guest on a separate partition, setup a vroot device (to make the quota access secure), copy that into the guest, and adjust the mtab line inside the guest.|Signature=derjohn}}

{{Question|Question=what about "Quota" for a context?||Details=
A: Context quotas are now called Disk Limits (so that we can tell them apart from the user/group quotas :). They are supported out of the box (with vs2.0) for all major filesystems (Ext2/3, ReiserFS, XFS, JFS)|Signature=derjohn}}

{{Question|Question=Does it support IPv6?||Details=
A: Currently not. Some developer has to move his ... to reimplement this functionality from the V4 code (I read that on the ML ;)). Will probably be superseded by the ngnet (next generation networking) soon. There is a Wiki page regarding this: http://linux-vserver.org/IPv6|Signature=derjohn}}

{{Question|Question=I can't do all I want with the network interfaces inside the guest?||Details=
A: For now the networking is 'Host Business' -- the host is a router, and each guest is a server. You can set the capability ICMP_RAW in the context of the guest, or even the capability CAP_NET_RAW (which would even allow to sniff interfaces of other guests!). Likely to change with ngnet. |Signature=derjohn}}

{{Question|Question=Is there a web-based interface for vserver that will allow creation/deletion/configuration etc. of vserver guests?||Details=
A. [Update] Errrh, there is http://OpenVPS.org which is a set of scripts with a web-interface for webhosters/ISPs.
A. [Update] Errrh, there is http://Openvcp.org which is a distributed system (agent!) with a web-interface, with which you can build/remove guests! cool stuff! beta, try out!|Signature=derjohn}}

{{Question|Question=What is old-style and new-style config?||Details=
A. Old-style config refers to a single text-file that contains all the configuration settings. With new-style config the configuration is split into several directories and files. You should probably go for new-style config if you are asking.|Signature=derjohn}}

{{Question|Question=What is the "great flower page"?||Details=
A. Well, this page contains all configation options for vserver in version > 1.9 (I think .. I joined Linux-VServer in version 2, so I don't know for sure). The name of the page is derivived from the stylesheet(s) it contains: It displays background pictures of a very great flower, so regard it as highly optimized. It was designed by a non-designer, who asks us to create a better one. I played with the thought of creating a complete new theme for that page - but actually we all got used to the name "great flower page", so we stick to it. If you are unable to read it clearly, feel invited to join the IRC channel #vserver, we may tell you how to ;)|Signature=derjohn}}

{{Question|Question=How do I add several IPs to a vserver? ||Details=
A: First of all a single guest vserver only supports up to 16 IPs (There is a 64-IP patch available, which is in "derjohn's kernel", you need extra util-vserver anyway).
Here is a little helper-script that adds a list of IPs defined in a text file, one per line.
<pre>
#!/bin/bash
j=1
for i in `cat myiplist`; do
j=$(($j+1))
mkdir $j
echo $i > $j/ip
echo $i > $j/ip-old
echo "24" > $j/prefix
done
</pre>|Signature=derjohn}}

{{Question|Question=If my host has only one a single public IP, can I use RFC1918 IP (e.g. 192.168.foo.bar) for the guest vservers?||Details=
A: Yes, use iptables with SNAT to masquerade it.
<pre>
iptables -t nat -I POSTROUTING -s $VSERVER_NETZ ! -d $VSERVER_NETZ -j SNAT --to $EXT_IP
</pre>
See: HowtoPrivateNetworking and
http://www.tgunkel.de/it/software/doc/linux_server#h3-Vserver_Masquerading_SNAT (THX, [MUPPETS]Gonzo)|Signature=derjohn}}

{{Question|Question=If I shut down my vserver guest, the whole Internet interface ethX on the host is shut down. What happened? ||Details=
A: When you shut down a guest (''i.e. vserver foo stop''), the IP is brought down on the host also. If this IP happens to be the primary IP of the host, the kernel will not only bring down the primary IP, but also all secondary IP addresses. But in very recent kernels, there is an option ''settable'' which prevents that nasty feature. It's called "alias promotion". You may set it via sysctl by adding ''net.ipv4.conf.all.promote_secondaries=1'' in /etc/sysctl.conf or via sysctl command line.|Signature=derjohn}}

{{Question|Question=On Debian Sarge (stable) only util-vserver is 0.30-204 available, which has been reported to be buggy (I didnt check the version for longer time) How do I compile a local version of alpha util-vserver .210 on Debian?||Details=
A:
<pre>
apt-get build-dep util-vserver

./configure --prefix=/usr/local/ --enable-release \
--mandir=/usr/local/share/man \
--infodir=/usr/local/share/info \
--sysconfdir=/etc --enable-dietlibc \
--localstatedir=/var \
--with-vrootdir=/var/lib/vservers

make

make install-distribution
(Which does a make install + setting a symlink ln -s /usr/local/lib/util-vserver/vshelper /sbin/vshelper )

</pre>

To test which version you are running:
<pre>
# which vserver
/usr/local/sbin/vserver

</pre>

This should point to ..local...

If you dont want to build it yourself: On www.backports.org there are backported (for sarge) linux-images (2.6.16) with vserver-patch enabled and a updated util-vserver package as well.
|Signature=derjohn}}

{{Question|Question=I use derjohn's kernel or a differnet kernel with a more-than-16-IPs-per-guest-patch and can't use more than 16 IPs. Why?||Details=
A: You need to patch util-vserver, too. So you obviously need to recompile util-vserver (see above). In the util-vserver directory there are header files in the ./kernel/ directory. Patch like this:

<pre>
kernel/network.h:#define NB_IPV4ROOT 64
</pre>

BTW: The initial patches can be found here: http://vserver.13thfloor.at/Experimental/VARIOUS/util-vserver-0.30.196-net64.diff.bz2 and http://vserver.13thfloor.at/Experimental/VARIOUS/delta-2.6.9-vs1.9.3-net64.diff
|Signature=unknown}}

{{Question|Question=I run a Debian host and want to build an Ubuntu guest. Howto?||Details=
A: Simple ;) Assume you want to build a breezy guest on a sid host with IP 192.168.0.2 and hostname vubuntu, then do:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu
</pre>

[UPDATE] Currently there are problems in building breezy under unclear circumstances, which seems to have to do with udev. If the above didnt work, try:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu -- --exclude=udev
</pre>
In very recent versions of the utils, the problem should not occur anymore (it has to do with the 'secure-mount' if you look in the MLs)

Well, sid's debootstrap knows how to bootstrap Ubuntu linux. Make sure to have a current debootstrap package:
<pre>
apt-get update
apt-get install debootstrap
</pre>
The knowledge how to build ubuntu 'breezy badger' (which you probably want to be your guest at the time of writing) has been added recently.|Signature=derjohn}}

{{Question|Question=How do I make a vserver guest start by default?||Details=
A: At least on Debian, I can tell you how to do it with the new-style config. If your guest is called "derjohn" and you want it to be started somewhere at the of your bootstrap process, then do:
<pre>
echo "default" > /etc/vservers/derjohn/apps/init/mark
</pre>
If you want to start it earlier, please read the init script "/etc/init.d/vserver-default" to find out how to do it. In most cases you don't need to change this. On Debian the vservers are started at "90", so after most other stuff is up (networking etc.).

Besides that I created a small helper script for managing the autostart foo: ((vserver-autostart))|Signature=derjohn}}

{{Question|Question=My host works, but when I start a guest it says that it has a problem with chbind.||Details=
A: You are probably using util-vserver <= 0.30.209, which does use dynamic network contexts internally (With 0.30.210 this fact changed). So if you compiled your kernel without dynamic contexts, you may start guests, but you can't use the network context.The solution is either to switch to .210 util (or Hollow's toolset) or compile the kernel with dynamic network contexts.
SE Keyword: invalid option `nid' testme.sh|Signature=derjohn}}

{{Question|Question=When I try to ssh to the guest, I log into the host, even if I installed sshd on the guest. What's wrong here?||Details=
A: Look at /etc/ssh/sshd_config of the host:

<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
</pre>

And now change the setting to
<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
ListenAddress your.hosts.ip.here # not the guests IP!
</pre>

Then '/etc/init.d/ssh restart' on the host, after that on the guest (if you did apt-get install ssh on the guest already.)

Do I have to explain more? If the hosts sshd binds all available IP addresses on port 22 (The hosts 'sees' even all addresses of the guests!). So if the guest starts its sshd, it cant bind to port 22 any more. You need to change that setting only on the host.
(BTW: A similar approach has to be done for a lot of daemons, e.g. Apache. If the daemon does not support an explicit bind, you may use the chbind command to 'hide' IP addresses from the daemon before starting.)|Signature=derjohn}}

{{Question|Question=I did everything right, but the application foo does not start. What's up there?||Details=
A: Before asking on the IRC channel, please check out the 'problematic programs' page:
[[Problematic Programs]]|Signature=derjohn}}

{{Question|Question=Bind9 does not like to start in my guest.||Details=
A: Check out the 'problematic programs' page:
((ProblematicPrograms)) and/or get my [((ProblematicPrograms)) and/or get my [http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb vserver-guest-ready Debian package] for Debian Sid guests from that URL: http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb and check out the [http://linux-vserver.derjohn.de/bind9-packages/README.txt readme]. (Hint: This is fresh stuff. The give me Feedback)

[UPDATE] Since VServer Devel 2.1.1-rc18 you do not need to patch the userland tools anymore. The capabilities are masked.|Signature=derjohn}}

{{Question|Question=Which guest vservers are running?||Details=
A: {{vserver-stat}}. Example output:
<pre>
CTX PROC VSZ RSS userTIME sysTIME UPTIME NAME
0 77 965.1M 334.6M 14m14s18 2m28s69 1h33m46 root server
49152 7 14M 5.2M 0m00s40 0m00s30 1h30m15 chiffon
</pre>|Signature=derjohn}}

{{Question|Question=How can I reboot/halt guests?||Details=
A: It depends.
For vserver with legacy-interfaces support, you have to replace {{/sbin/halt}} in guests with vreboot and start rebootmgr in host. You also need to have a dummy <guest>.conf file in /etc/vservers for each guest. Please have a look at /etc/init.d/rebootmgr.
Vserver with native interface utilizes /dev/initctl. No changes are needed in guests. Just make sure that REBOOT capability is adjusted in guests.|Signature=derjohn}}

{{Question|Question=Do I really need the legacy-interfaces? What are these legacy-interfaces?||Details=
A: Since vserver is an ongoing project, new features might replace old ones, some might still on development. Legacy-interfaces are available for backward compability (which might be removed someday). See Q: How can I reboot/halt guests?|Signature=derjohn}}

{{Question|Question= I have a vserver running on a Linux kernel with preemption. Is VServer "preempt" safe?||Details=
A: There are no known issues about running vserver on a preemption enabled kernel. I would like to add, that the vserver kernelhackers would probably exclude that option in 'make menuconfig' if there would be an incompatibility. Just my $.02 :)|Signature=derjohn}}

{{Question|Question=Is this a new project? When was it started?||Details=
A: The first public occurance of linux-vserver was Oct 2001. The initial mail can be found here: http://www.cs.helsinki.fi/linux/linux-kernel/2001-40/1065.html
So you can expect a mature software product wich does it's magic quite well (And hey, we have a version > 2.0 ! )|Signature=derjohn}}

{{Question|Question=Can I run an OpenVPN Server in a guest?||Details=
A: Yes. I don't want to provide an in-depth OpenVPN tutorial, but want to show how I made OpenVPN work in a guest as server. I was not able to run it with a tun devive, due to a buglet in util-vserver and kernel when it comes to settings a an ip address a point to point link: If you add "ip addr add <ip> peer <mypeer> dev tun0" there is no way to map the tun0 interface into a guest, even not with a 'nodev' option. (bug confirmed to be reproducible by daniel_hoczac)

First of all you have to prepare the host with a persistent tuntap interface in tap-mode. The tools we need come from the uml-utilities.
Then you need to create a device /dev/net/tun, which the OpenVPN userspace daemon reads. Well assume 10.10.10.100 is the server IP, and 10.10.10.101 is the client ip - to be cool be choose a /31 netmask (255.255.255.254), so we have a net without broadcast and don't waste IPs :)

On the host do:
<pre>
# apt-get install uml-utilities
# cd /var/lib/vserver/<myopenvpnserver>/dev/
# ./MAKEDEV tun
(creates the dev/net/tun device accessible by te guest - even a tap interface need /dev/net/tun !)
# tunctl -t tap0
(creates the network device 'tap0' persistently)
</pre>

Then add the ip to the guest:
<pre>
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/ip
10.10.10.100
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/prefix
31
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/dev
tap0
(This kind of config brings the ip when the vserver is started - only the tap0 interface has to exist already, see above!)
</pre>

Here is a sample config for the guest (which is acting as a server):

Install OpenVPN package on server and client, in the Debian case:
<pre>
# apt-get install openvpn
</pre>

The server's conf looks like that:
<pre>
# port and interface specs

# behave like a ssl-webserver
port 443
proto tcp-server

# tap device? (keep in mind you need /dev/net/tun !)
dev tap0

# now the ips we will use for the tunnel
ifconfig 10.10.10.100 255.255.255.254
ifconfig-noexec

# the server part

# Keep VPN connections, even if the client IP changes
float

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key - create it with 'openvpn --genkey --secret static.key'
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4
status openvpn-status.log
</pre>

The client's conf may look like that (This example even makes the tunnel the clients default address):
<pre>
# cat /etc/openvpn/client.conf
# port and interface specs

# the following is not necessary, if you bring up openvpn via Debian's init script:
daemon ovpn-my-clients-name

# behave like a ssl-webserver
port 443
proto tcp-client
remote %%%<insert-the-guest-primary-public-ip-here>%%%%
# what device tun ot tap?
dev tap

# now the ips we will use for the tunnel
ifconfig 10.10.10.101 255.255.255.254

# Keep VPN connections, even if the client IP changes
float
mssfix

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4

# set the default route
route-gateway 10.10.10.100
redirect-gateway def1
# to add special routes you can do it wihtin the openvpn client conf:
# route <dest> <mask> <gateway>

# if you need to connect via proxy (like squid)
# http-proxy s p [up] [auth] : Connect to remote host through an HTTP proxy at
# address s and port p. If proxy authentication is required,
# up is a file containing username/password on 2 lines, or
# 'stdin' to prompt from console. Add auth='ntlm' if
# the proxy requires NTLM authentication.

# http-proxy s p [up] [auth]

# http-proxy-option type [parm] : Set extended HTTP proxy options.
# Repeat to set multiple options.
# VERSION version (default=1.0)
# AGENT user-agent

# http-proxy-option type [parm]
</pre>

In the next lesson I will talk about OpenVPN's server mode, which can deal with with multiple clients connecting to one ip and one port (i.e. you only need one guest for tons or 'roadwarriros'), tls connections and pki.

Contributions welcome. :)|Signature=derjohn}}

{{Question|Question=32 vs 64 Bit? What should I take?||Details=
A: If you have the choice make the host a 64 bit one. You can run a guest as 32 bit or as 64 bit on a 64 bit host. To run it as 32 bit, you need to compile the x86_64 (a.k.a. AMD64) with the following options:

<pre>
[*] Kernel support for ELF binaries
<M> Kernel support for MISC binaries
[*] IA32 Emulation <---- without that, the entire 32bit API is not present
<M> IA32 a.out support
</pre>

You can force the guest to behave like a 32 environment like this:
<pre>
echo linux_32bit > /etc/vservers/$NAME/personality
echo i686 > /etc/vservers/$NAME/uts/machine
</pre>
(thanks cehteh for the hint!)

But you can force debootstrap to but 32 bit binaries into the guest by 'export ARCH=i386';
<pre>
export ARCH=i386 ; vserver build ....
</pre>|Signature=derjohn}}

{{Question|Question=I want to (re)mount a partition in a running guest ... but the guest has no rights (capability) to (re)mount?||Details=
A: I'll explain. I take as example your /tmp partition within the guest is too small, what will be likely the case if you stay with the 16MB default (vserver build mounts /tmp as 16 MB tempfs!).
<pre>
# vnamespace -e XID mount -t tmpfs -o remount,size=256m,mode=1777 none /var/lib/vservers/<guest>/tmp/
</pre>
Be warned that the guest will not recognize the change, as the /etc/mtab file is not updated when you mount like this. To permanently change the mount, edit /etc/vserver/<guest>/fstab on the host.|Signature=derjohn}}

{{Question|Question=How do I limit a guests RAM? I want to prevent OOM situations on the host!||Details=
A: First you can read [http://linux-vserver.org/Memory+Allocation].
If you want a recipe, do that:
1. Check the size of memory pages. On x86 and x86_64 is usually 4 KB per page.
2. Create /etc/vserver/<guest>/rlimits/
3. Check your physical memory size on the host, e.g. with "free -m". maxram = kilobytes/pagesize.
4. Limit the guests physical RAM to value smaller then maxram:

<pre>
echo %%insertYourPagesHereSmallerThanMaxram%% > /etc/vserver/<guest>/rlimits/rss
</pre>

5. Check your swapspace, e.g. with 'swapon -s'. maxswap = swapkilobytes/pagesize.
6. Limit the guest's maximum number of as pages to a value smaller than (maxram+maxswap):

<pre>
echo %%desiredvalue%% > /etc/vserver/<guest>/rlimits/as
</pre>

It should be clear this can still lead to OOM situations. Example: You have two guests and your as limit per guest is greater than 50% of (maxram+maxswap). If both guests request their maximum at the same point in time, there will be not enough mem .....|Signature=derjohn}}

{{Question|Question=Were can I get newer versions of VServer as ready made packages for Debian?||Details=
A: Here you go: http://linux-vserver.derjohn.de/ . There is also some stuff on backports.org, but my kernels are always 'devel' branch.|Signature=derjohn}}

{{Question|Question=Can I use iptables ?||Details=
Yes but right now only on the host (rootserver). Please realize that all traffic is local and will not touch the forward chain.|Signature=BeginnerFAQ}}

{{Question|Question=Trying to connect to a vserver from the host or another vserver on the same host fails||Details=
strace shows
<pre>
sin_addr=inet_addr("xx.xx.xx.xx")}, yy) = -1 EINVAL (Invalid argument)
</pre>
A: The host/guest cannot communicate with another guest on same host.
* check all netmasks on all interfaces (do they overlap) ?
* check policy routing (disable it temporary) ?
* check that lo is up (Networking within a host/guest always uses lo interface)
|Signature=CommonProblems}}

{{Question|Question=#1 ERROR: capset(): Operation not permitted||Details=capabilities are not enabled in kernel-setup
please check that CONFIG_SECURITY_CAPABILITIES is loaded or included in the kernel. ( check with "cat /path_to_kernel/.config | grep -i cap ")
(2.6.11.5-vs-1.9.5 + 0.30-205)|Signature=IrcQuestions}}

{{Question|Question=How can I make 'vserver start' mount the root filesystem||Details=
mount it via /etc/vservers/vserver-name/fstab, make sure to set the option 'dev' e.g.:
<pre>/dev/drbd0 / xfs rw,dev 0 0</pre>
util-vserver 210 won't be able to find some scripts for the reboot, add into /etc/vservers/vserver-name/apps/init/cmd.stop
<pre>/etc/init.d/rc
6</pre>
|Signature=AdrianReyer}}

Frequently Asked Questions

2006-09-20T13:09:53Z

Meandtheshell: fixed a formating bug ...

<div style="margin: 2em auto 2em auto; padding: 10px; background-color: #F9ECCD; border: 1px solid #004433; text-align: center;">
[[Image:Icon-Caution.png|left]]
We currently migrate to MediaWiki from our old installation, but not all content has been migrated yet. Take a look at the [[Wiki Team]] page for instructions how to help or look at the [http://oldwiki.linux-vserver.org old wiki] to find the information not migrated yet.

'''To ease migration we created a [[List of old Documentation pages]].'''
</div>

CURRENTLY THE CONTENT OF THE OLD WIKI FAQ (AND MORE) IS BEING MIGRATED TO THIS PAGE (TASK: DERJOHN)

__TOC__

{{Question|Question=What is a 'Guest'?||Details=To talk about stuff, we need some naming. The physical machine is called 'Host' and the 'main' context running the Host Distro is called 'Host Context'. The virtual machine/distro is called 'Guest' and basically is a Distribution (Userspace) running inside a 'Guest Context'.|Signature=derjohn}}

{{Question|Question=What kind of Operating System (OS) can I run as guest?||Details=
A: With VServer you can only run Linux guests. The trick is that a guest does not run a kernel on its own (as XEN and UML do), it merely uses a virtualized host kernel-interface. VServer offers so called security contexts which make it possible to seperate one guest from each other, i.e. they cannot get data from each other. Imagine it as a chroot environment with much more security and features.|Signature=derjohn}}

{{Question|Question=Which distributions did you test?||Details=
A: Some. Check out the wiki for ready-made guest images. But you can easily build own guest images, e.g. with Debian's debootstrap. Checkout ((step-by-step Guide 2.6)) how to do that.|Signature=derjohn}}

{{Question|Question=Is VServer comparable to XEN/UML/QEMU?||Details=
A: Nope. XEN/UML/QEMU and VServer are just good friends. Because you ask, you probably know what XEN/UML/QEMU are. VServer in contrary to XEN/UML/QEMU not "emulate" any hardware you run a kernel on. You can run a VServer kernel in a XEN/UML/QEMU guest. This is confirmed to work at least with Linux 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Is VServer secure?||Details=
A: We hope so. It should be as least as secure as Linux is. We consider it much much more secure though.|Signature=derjohn}}

{{Question|Question=Performance?||Details=
A: For a single guest, we basically have native performance. Some tests showed insignificant overhead (about 1-2%) others ran faster than on an unpatched kernel. This is IMVHO significantly less than other solutions waste, especially if you have more than a single guest (because of the resource sharing).|Signature=derjohn}}

{{Question|Question=Is SMP Supported?||Details=
A: Yes, on all SMP capable kernel architectures.|Signature=derjohn}}

{{Question|Question=Resource sharing?||Details=
A: Yes ....
* memory: Dynamically.
* CPU usage: Dynamically (token bucket)|Signature=derjohn}}

{{Question|Question=Resource limiting?||Details=
A: Yes, you can set maximum limits per guest, but you can only offer guaranteed resource availability with some ticks at the time. There is the possibility to ulimit and to rlimit. Rlimit is a new feature of kernel 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Disk I/O limiting? Is that possible?||Details=
A: Well, since vs2.1.1 linux-vserver supports a mechanism called 'I/O scheduling', which appeared in the 2.6 mainline some time ago. The mainline kernel offers several I/O schedulers:

<pre>
# cat /sys/block/hdc/queue/scheduler
noop [anticipatory] deadline cfq
</pre>

The default is anticipatory a.k.a. "AS". When running several guests on a host you probably want the I/O performance shared in a fair way among the different guests. The kernel comes with a "completely fair queueing" scheduler, CFQ, which can do that. (More on schedulers can be found at http://lwn.net/Articles/114770/)

This is how to set the scheduler to "cfq" manually:
<pre>
root# echo "cfq" > /sys/block/hdc/queue/scheduler
root# cat /sys/block/hdc/queue/scheduler
noop anticipatory deadline [cfq]
</pre>

Keep in mind that you have to do it on all physical discs. So if you run an md-softraid, do it to all physical /dev/hdXYZ discs!

If you run Debian there is a predefined way to set the /sys values at boot-time:

<pre>
# apt-get install sysfsutils
[...]

# cat /etc/sysfs.conf | grep cfq
block/sda/queue/scheduler = cfq
block/sdc/queue/scheduler = cfq

# /etc/init.d/sysfsutils restart
</pre>

For non-vserver processes and CFQ you can set by which key the kernel decides about the fairness:

<pre>
cat /sys/block/hdc/queue/iosched/key_type
pgid [tgid] uid gid
</pre>
Hint: The 'key_type'-feature has been removed in the mainline kernel recently. Don't look for it any longer :(

The default is tgid, which means to share fairly among process groups. Think every guest is treated like a own process group. It's not possible to set a scheduler strategy within a guest. All processes belonging to the same guest are treated like "noop" within the guest. So: If you run apache and some ftp-server within the _same_ guest, there is no fair scheduling between them, but there is fair scheduling between the whole guest and all other guests.

And: It's possible to tune the scheduler parameters in several ways. Have a look at /sys/block/hdc/queue/....

You need a very recent Version of VS devel, e.g. the 2.1.1-rc18 can do it. Some older version have that feature too, then it got lost and was reinvented. So: Go and get a rc18 - only in 'devel', not stable!|Signature=derjohn}}

{{Question|Question=Why isn't there a device /dev/bla? within a guest||Details=
A: Device nodes allow Userspace to access hardware (or virtual resources). Creating a device node inside the guest's namespace will give access to that device, so for security reasons, the number of 'given' devices is small.|Signature=derjohn}}

{{Question|Question=What is Unification (vunify)?||Details=
A: Unification is Hard Links on Steroids. Guests can 'share' common files (usually binaries and libraries) in a secure way, by creating hard links with special properties (immutable but unlinkable (removable)). The tool to identify common files and to unify them is called vunify.|Signature=derjohn}}

{{Question|Question=What is vhashify?||Details=
A: The successor of vunify, a tool which does unification based on hash values (which allows to find common files in arbitrary paths.)|Signature=derjohn}}

{{Question|Question=How do I manage a multi-guest setup with vhashify?||Details=
A: For 'vhashify', just do these once:

<pre>
mkdir /etc/vservers/.defaults/apps/vunify/hash /vservers/.hash
ln -0s /vservers/.hash /etc/vservers/.defaults/apps/vunify/hash/root
</pre>

Then, do this one line per vserver:

<pre>
mkdir /etc/vservers/<vservername>/apps/vunify # vhashify reuses vunify configuration
</pre>

The command 'ln' creates a link between two files. "ln -s" creates a symbolic link -- two files are linked by name. "ln -0s" uses a Vserver extention to create a unified link.|Signature=derjohn}}

{{Question|Question=With which VS version should I begin?||Details=
A: If you are new to VServer I recommend to try 2.0.+. Take "alpha utils" Version 0.30.210. In Debian Sid there appeared well running version of it recently. (It's a .210 at the time of writing).|Signature=derjohn}}

{{Question|Question=is there a way to implement "user/group quota" per VServer?||Details=
A: Yes, but not on a shared partition for now. You need to put the guest on a separate partition, setup a vroot device (to make the quota access secure), copy that into the guest, and adjust the mtab line inside the guest.|Signature=derjohn}}

{{Question|Question=what about "Quota" for a context?||Details=
A: Context quotas are now called Disk Limits (so that we can tell them apart from the user/group quotas :). They are supported out of the box (with vs2.0) for all major filesystems (Ext2/3, ReiserFS, XFS, JFS)|Signature=derjohn}}

{{Question|Question=Does it support IPv6?||Details=
A: Currently not. Some developer has to move his ... to reimplement this functionality from the V4 code (I read that on the ML ;)). Will probably be superseded by the ngnet (next generation networking) soon. There is a Wiki page regarding this: http://linux-vserver.org/IPv6|Signature=derjohn}}

{{Question|Question=I can't do all I want with the network interfaces inside the guest?||Details=
A: For now the networking is 'Host Business' -- the host is a router, and each guest is a server. You can set the capability ICMP_RAW in the context of the guest, or even the capability CAP_NET_RAW (which would even allow to sniff interfaces of other guests!). Likely to change with ngnet. |Signature=derjohn}}

{{Question|Question=Is there a web-based interface for vserver that will allow creation/deletion/configuration etc. of vserver guests?||Details=
A. [Update] Errrh, there is http://OpenVPS.org which is a set of scripts with a web-interface for webhosters/ISPs.
A. [Update] Errrh, there is http://Openvcp.org which is a distributed system (agent!) with a web-interface, with which you can build/remove guests! cool stuff! beta, try out!|Signature=derjohn}}

{{Question|Question=What is old-style and new-style config?||Details=
A. Old-style config refers to a single text-file that contains all the configuration settings. With new-style config the configuration is split into several directories and files. You should probably go for new-style config if you are asking.|Signature=derjohn}}

{{Question|Question=What is the "great flower page"?||Details=
A. Well, this page contains all configation options for vserver in version > 1.9 (I think .. I joined Linux-VServer in version 2, so I don't know for sure). The name of the page is derivived from the stylesheet(s) it contains: It displays background pictures of a very great flower, so regard it as highly optimized. It was designed by a non-designer, who asks us to create a better one. I played with the thought of creating a complete new theme for that page - but actually we all got used to the name "great flower page", so we stick to it. If you are unable to read it clearly, feel invited to join the IRC channel #vserver, we may tell you how to ;)|Signature=derjohn}}

{{Question|Question=How do I add several IPs to a vserver? ||Details=
A: First of all a single guest vserver only supports up to 16 IPs (There is a 64-IP patch available, which is in "derjohn's kernel", you need extra util-vserver anyway).
Here is a little helper-script that adds a list of IPs defined in a text file, one per line.
<pre>
#!/bin/bash
j=1
for i in `cat myiplist`; do
j=$(($j+1))
mkdir $j
echo $i > $j/ip
echo $i > $j/ip-old
echo "24" > $j/prefix
done
</pre>|Signature=derjohn}}

{{Question|Question=If my host has only one a single public IP, can I use RFC1918 IP (e.g. 192.168.foo.bar) for the guest vservers?||Details=
A: Yes, use iptables with SNAT to masquerade it.
<pre>
iptables -t nat -I POSTROUTING -s $VSERVER_NETZ ! -d $VSERVER_NETZ -j SNAT --to $EXT_IP
</pre>
See: HowtoPrivateNetworking and
http://www.tgunkel.de/it/software/doc/linux_server#h3-Vserver_Masquerading_SNAT (THX, [MUPPETS]Gonzo)|Signature=derjohn}}

{{Question|Question=If I shut down my vserver guest, the whole Internet interface ethX on the host is shut down. What happened? ||Details=
A: When you shut down a guest (''i.e. vserver foo stop''), the IP is brought down on the host also. If this IP happens to be the primary IP of the host, the kernel will not only bring down the primary IP, but also all secondary IP addresses. But in very recent kernels, there is an option ''settable'' which prevents that nasty feature. It's called "alias promotion". You may set it via sysctl by adding ''net.ipv4.conf.all.promote_secondaries=1'' in /etc/sysctl.conf or via sysctl command line.|Signature=derjohn}}

{{Question|Question=On Debian Sarge (stable) only util-vserver is 0.30-204 available, which has been reported to be buggy (I didnt check the version for longer time) How do I compile a local version of alpha util-vserver .210 on Debian?||Details=
A:
<pre>
apt-get build-dep util-vserver

./configure --prefix=/usr/local/ --enable-release \
--mandir=/usr/local/share/man \
--infodir=/usr/local/share/info \
--sysconfdir=/etc --enable-dietlibc \
--localstatedir=/var \
--with-vrootdir=/var/lib/vservers

make

make install-distribution
(Which does a make install + setting a symlink ln -s /usr/local/lib/util-vserver/vshelper /sbin/vshelper )

</pre>

To test which version you are running:
<pre>
# which vserver
/usr/local/sbin/vserver

</pre>

This should point to ..local...

If you dont want to build it yourself: On www.backports.org there are backported (for sarge) linux-images (2.6.16) with vserver-patch enabled and a updated util-vserver package as well.
|Signature=derjohn}}

{{Question|Question=I use derjohn's kernel or a differnet kernel with a more-than-16-IPs-per-guest-patch and can't use more than 16 IPs. Why?||Details=
A: You need to patch util-vserver, too. So you obviously need to recompile util-vserver (see above). In the util-vserver directory there are header files in the ./kernel/ directory. Patch like this:

<pre>
kernel/network.h:#define NB_IPV4ROOT 64
</pre>

BTW: The initial patches can be found here: http://vserver.13thfloor.at/Experimental/VARIOUS/util-vserver-0.30.196-net64.diff.bz2 and http://vserver.13thfloor.at/Experimental/VARIOUS/delta-2.6.9-vs1.9.3-net64.diff
|Signature=unknown}}

{{Question|Question=I run a Debian host and want to build an Ubuntu guest. Howto?||Details=
A: Simple ;) Assume you want to build a breezy guest on a sid host with IP 192.168.0.2 and hostname vubuntu, then do:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu
</pre>

[UPDATE] Currently there are problems in building breezy under unclear circumstances, which seems to have to do with udev. If the above didnt work, try:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu -- --exclude=udev
</pre>
In very recent versions of the utils, the problem should not occur anymore (it has to do with the 'secure-mount' if you look in the MLs)

Well, sid's debootstrap knows how to bootstrap Ubuntu linux. Make sure to have a current debootstrap package:
<pre>
apt-get update
apt-get install debootstrap
</pre>
The knowledge how to build ubuntu 'breezy badger' (which you probably want to be your guest at the time of writing) has been added recently.|Signature=derjohn}}

{{Question|Question=How do I make a vserver guest start by default?||Details=
A: At least on Debian, I can tell you how to do it with the new-style config. If your guest is called "derjohn" and you want it to be started somewhere at the of your bootstrap process, then do:
<pre>
echo "default" > /etc/vservers/derjohn/apps/init/mark
</pre>
If you want to start it earlier, please read the init script "/etc/init.d/vserver-default" to find out how to do it. In most cases you don't need to change this. On Debian the vservers are started at "90", so after most other stuff is up (networking etc.).

Besides that I created a small helper script for managing the autostart foo: ((vserver-autostart))|Signature=derjohn}}

{{Question|Question=My host works, but when I start a guest it says that it has a problem with chbind.||Details=
A: You are probably using util-vserver <= 0.30.209, which does use dynamic network contexts internally (With 0.30.210 this fact changed). So if you compiled your kernel without dynamic contexts, you may start guests, but you can't use the network context.The solution is either to switch to .210 util (or Hollow's toolset) or compile the kernel with dynamic network contexts.
SE Keyword: invalid option `nid' testme.sh|Signature=derjohn}}

{{Question|Question=When I try to ssh to the guest, I log into the host, even if I installed sshd on the guest. What's wrong here?||Details=
A: Look at /etc/ssh/sshd_config of the host:

<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
</pre>

And now change the setting to
<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
ListenAddress your.hosts.ip.here # not the guests IP!
</pre>

Then '/etc/init.d/ssh restart' on the host, after that on the guest (if you did apt-get install ssh on the guest already.)

Do I have to explain more? If the hosts sshd binds all available IP addresses on port 22 (The hosts 'sees' even all addresses of the guests!). So if the guest starts its sshd, it cant bind to port 22 any more. You need to change that setting only on the host.
(BTW: A similar approach has to be done for a lot of daemons, e.g. Apache. If the daemon does not support an explicit bind, you may use the chbind command to 'hide' IP addresses from the daemon before starting.)|Signature=derjohn}}

{{Question|Question=I did everything right, but the application foo does not start. What's up there?||Details=
A: Before asking on the IRC channel, please check out the 'problematic programs' page:
[[Problematic Programs]]|Signature=derjohn}}

{{Question|Question=Bind9 does not like to start in my guest.||Details=
A: Check out the 'problematic programs' page:
((ProblematicPrograms)) and/or get my [((ProblematicPrograms)) and/or get my [http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb vserver-guest-ready Debian package] for Debian Sid guests from that URL: http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb and check out the [http://linux-vserver.derjohn.de/bind9-packages/README.txt readme]. (Hint: This is fresh stuff. The give me Feedback)

[UPDATE] Since VServer Devel 2.1.1-rc18 you do not need to patch the userland tools anymore. The capabilities are masked.|Signature=derjohn}}

{{Question|Question=Which guest vservers are running?||Details=
A: {{vserver-stat}}. Example output:
<pre>
CTX PROC VSZ RSS userTIME sysTIME UPTIME NAME
0 77 965.1M 334.6M 14m14s18 2m28s69 1h33m46 root server
49152 7 14M 5.2M 0m00s40 0m00s30 1h30m15 chiffon
</pre>|Signature=derjohn}}

{{Question|Question=How can I reboot/halt guests?||Details=
A: It depends.
For vserver with legacy-interfaces support, you have to replace {{/sbin/halt}} in guests with vreboot and start rebootmgr in host. You also need to have a dummy <guest>.conf file in /etc/vservers for each guest. Please have a look at /etc/init.d/rebootmgr.
Vserver with native interface utilizes /dev/initctl. No changes are needed in guests. Just make sure that REBOOT capability is adjusted in guests.|Signature=derjohn}}

{{Question|Question=Do I really need the legacy-interfaces? What are these legacy-interfaces?||Details=
A: Since vserver is an ongoing project, new features might replace old ones, some might still on development. Legacy-interfaces are available for backward compability (which might be removed someday). See Q: How can I reboot/halt guests?|Signature=derjohn}}

{{Question|Question= I have a vserver running on a Linux kernel with preemption. Is VServer "preempt" safe?||Details=
A: There are no known issues about running vserver on a preemption enabled kernel. I would like to add, that the vserver kernelhackers would probably exclude that option in 'make menuconfig' if there would be an incompatibility. Just my $.02 :)|Signature=derjohn}}

{{Question|Question=Is this a new project? When was it started?||Details=
A: The first public occurance of linux-vserver was Oct 2001. The initial mail can be found here: http://www.cs.helsinki.fi/linux/linux-kernel/2001-40/1065.html
So you can expect a mature software product wich does it's magic quite well (And hey, we have a version > 2.0 ! )|Signature=derjohn}}

{{Question|Question=Can I run an OpenVPN Server in a guest?||Details=
A: Yes. I don't want to provide an in-depth OpenVPN tutorial, but want to show how I made OpenVPN work in a guest as server. I was not able to run it with a tun devive, due to a buglet in util-vserver and kernel when it comes to settings a an ip address a point to point link: If you add "ip addr add <ip> peer <mypeer> dev tun0" there is no way to map the tun0 interface into a guest, even not with a 'nodev' option. (bug confirmed to be reproducible by daniel_hoczac)

First of all you have to prepare the host with a persistent tuntap interface in tap-mode. The tools we need come from the uml-utilities.
Then you need to create a device /dev/net/tun, which the OpenVPN userspace daemon reads. Well assume 10.10.10.100 is the server IP, and 10.10.10.101 is the client ip - to be cool be choose a /31 netmask (255.255.255.254), so we have a net without broadcast and don't waste IPs :)

On the host do:
<pre>
# apt-get install uml-utilities
# cd /var/lib/vserver/<myopenvpnserver>/dev/
# ./MAKEDEV tun
(creates the dev/net/tun device accessible by te guest - even a tap interface need /dev/net/tun !)
# tunctl -t tap0
(creates the network device 'tap0' persistently)
</pre>

Then add the ip to the guest:
<pre>
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/ip
10.10.10.100
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/prefix
31
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/dev
tap0
(This kind of config brings the ip when the vserver is started - only the tap0 interface has to exist already, see above!)
</pre>

Here is a sample config for the guest (which is acting as a server):

Install OpenVPN package on server and client, in the Debian case:
<pre>
# apt-get install openvpn
</pre>

The server's conf looks like that:
<pre>
# port and interface specs

# behave like a ssl-webserver
port 443
proto tcp-server

# tap device? (keep in mind you need /dev/net/tun !)
dev tap0

# now the ips we will use for the tunnel
ifconfig 10.10.10.100 255.255.255.254
ifconfig-noexec

# the server part

# Keep VPN connections, even if the client IP changes
float

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key - create it with 'openvpn --genkey --secret static.key'
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4
status openvpn-status.log
</pre>

The client's conf may look like that (This example even makes the tunnel the clients default address):
<pre>
# cat /etc/openvpn/client.conf
# port and interface specs

# the following is not necessary, if you bring up openvpn via Debian's init script:
daemon ovpn-my-clients-name

# behave like a ssl-webserver
port 443
proto tcp-client
remote %%%<insert-the-guest-primary-public-ip-here>%%%%
# what device tun ot tap?
dev tap

# now the ips we will use for the tunnel
ifconfig 10.10.10.101 255.255.255.254

# Keep VPN connections, even if the client IP changes
float
mssfix

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4

# set the default route
route-gateway 10.10.10.100
redirect-gateway def1
# to add special routes you can do it wihtin the openvpn client conf:
# route <dest> <mask> <gateway>

# if you need to connect via proxy (like squid)
# http-proxy s p [up] [auth] : Connect to remote host through an HTTP proxy at
# address s and port p. If proxy authentication is required,
# up is a file containing username/password on 2 lines, or
# 'stdin' to prompt from console. Add auth='ntlm' if
# the proxy requires NTLM authentication.

# http-proxy s p [up] [auth]

# http-proxy-option type [parm] : Set extended HTTP proxy options.
# Repeat to set multiple options.
# VERSION version (default=1.0)
# AGENT user-agent

# http-proxy-option type [parm]
</pre>

In the next lesson I will talk about OpenVPN's server mode, which can deal with with multiple clients connecting to one ip and one port (i.e. you only need one guest for tons or 'roadwarriros'), tls connections and pki.

Contributions welcome. :)|Signature=derjohn}}

{{Question|Question=32 vs 64 Bit? What should I take?||Details=
A: If you have the choice make the host a 64 bit one. You can run a guest as 32 bit or as 64 bit on a 64 bit host. To run it as 32 bit, you need to compile the x86_64 (a.k.a. AMD64) with the following options:

<pre>
[*] Kernel support for ELF binaries
<M> Kernel support for MISC binaries
[*] IA32 Emulation <---- without that, the entire 32bit API is not present
<M> IA32 a.out support
</pre>

You can force the guest to behave like a 32 environment like this:
<pre>
echo linux_32bit > /etc/vservers/$NAME/personality
echo i686 > /etc/vservers/$NAME/uts/machine
</pre>
(thanks cehteh for the hint!)

But you can force debootstrap to but 32 bit binaries into the guest by 'export ARCH=i386';
<pre>
export ARCH=i386 ; vserver build ....
</pre>|Signature=derjohn}}

{{Question|Question=I want to (re)mount a partition in a running guest ... but the guest has no rights (capability) to (re)mount?||Details=
A: I'll explain. I take as example your /tmp partition within the guest is too small, what will be likely the case if you stay with the 16MB default (vserver build mounts /tmp as 16 MB tempfs!).
<pre>
# vnamespace -e XID mount -t tmpfs -o remount,size=256m,mode=1777 none /var/lib/vservers/<guest>/tmp/
</pre>
Be warned that the guest will not recognize the change, as the /etc/mtab file is not updated when you mount like this. To permanently change the mount, edit /etc/vserver/<guest>/fstab on the host.|Signature=derjohn}}

{{Question|Question=How do I limit a guests RAM? I want to prevent OOM situations on the host!||Details=
A: First you can read [http://linux-vserver.org/Memory+Allocation].
If you want a recipe, do that:
1. Check the size of memory pages. On x86 and x86_64 is usually 4 KB per page.
2. Create /etc/vserver/<guest>/rlimits/
3. Check your physical memory size on the host, e.g. with "free -m". maxram = kilobytes/pagesize.
4. Limit the guests physical RAM to value smaller then maxram:

<pre>
echo %%insertYourPagesHereSmallerThanMaxram%% > /etc/vserver/<guest>/rlimits/rss
</pre>

5. Check your swapspace, e.g. with 'swapon -s'. maxswap = swapkilobytes/pagesize.
6. Limit the guest's maximum number of as pages to a value smaller than (maxram+maxswap):

<pre>
echo %%desiredvalue%% > /etc/vserver/<guest>/rlimits/as
</pre>

It should be clear this can still lead to OOM situations. Example: You have two guests and your as limit per guest is greater than 50% of (maxram+maxswap). If both guests request their maximum at the same point in time, there will be not enough mem .....|Signature=derjohn}}

{{Question|Question=Were can I get newer versions of VServer as ready made packages for Debian?||Details=
A: Here you go: http://linux-vserver.derjohn.de/ . There is also some stuff on backports.org, but my kernels are always 'devel' branch.|Signature=derjohn}}

{{Question|Question=Can I use iptables ?||Details=
Yes but right now only on the host (rootserver). Please realize that all traffic is local and will not touch the forward chain.|Signature=BeginnerFAQ}}

{{Question|Question=Trying to connect to a vserver from the host or another vserver on the same host fails||Details=
strace shows
<pre>
sin_addr=inet_addr("xx.xx.xx.xx")}, yy) = -1 EINVAL (Invalid argument)
</pre>
A: The vserver/master cannot communicate with another vserver on same host.
* check all netmasks on all interfaces (do they overlap) ?
* check policy routing (disable it temporary) ?
* check that lo is up (Networking within a host/vserver always uses lo interface)
|Signature=CommonProblems}}

{{Question|Question=#1 ERROR: capset(): Operation not permitted||Details=capabilities are not enabled in kernel-setup
please check that CONFIG_SECURITY_CAPABILITIES is loaded or included in the kernel. ( check with "cat /path_to_kernel/.config | grep -i cap ")
(2.6.11.5-vs-1.9.5 + 0.30-205)|Signature=IrcQuestions}}

{{Question|Question=How can I make 'vserver start' mount the root filesystem||Details=
mount it via /etc/vservers/vserver-name/fstab, make sure to set the option 'dev' e.g.:
<pre>/dev/drbd0 / xfs rw,dev 0 0</pre>
util-vserver 210 won't be able to find some scripts for the reboot, add into /etc/vservers/vserver-name/apps/init/cmd.stop
<pre>/etc/init.d/rc
6</pre>
|Signature=AdrianReyer}}

Problematic Programs

2006-09-20T11:15:09Z

Meandtheshell: ported more stuff but still not finished yet ...

Frequently Asked Questions

2006-09-19T18:08:53Z

Meandtheshell: made an internal Link to "Probelmatic Programs" (note the blank since "ProblematicPrograms" links to the old wiki)

<div style="margin: 2em auto 2em auto; padding: 10px; background-color: #F9ECCD; border: 1px solid #004433; text-align: center;">
[[Image:Icon-Caution.png|left]]
We currently migrate to MediaWiki from our old installation, but not all content has been migrated yet. Take a look at the [[Wiki Team]] page for instructions how to help or look at the [http://oldwiki.linux-vserver.org old wiki] to find the information not migrated yet.

'''To ease migration we created a [[List of old Documentation pages]].'''
</div>

CURRENTLY THE CONTENT OF THE OLD WIKI FAQ (AND MORE) IS BEING MIGRATED TO THIS PAGE (TASK: DERJOHN)

__TOC__

{{Question|Question=What is a 'Guest'?||Details=To talk about stuff, we need some naming. The physical machine is called 'Host' and the 'main' context running the Host Distro is called 'Host Context'. The virtual machine/distro is called 'Guest' and basically is a Distribution (Userspace) running inside a 'Guest Context'.|Signature=derjohn}}

{{Question|Question=What kind of Operating System (OS) can I run as guest?||Details=
A: With VServer you can only run Linux guests. The trick is that a guest does not run a kernel on its own (as XEN and UML do), it merely uses a virtualized host kernel-interface. VServer offers so called security contexts which make it possible to seperate one guest from each other, i.e. they cannot get data from each other. Imagine it as a chroot environment with much more security and features.|Signature=derjohn}}

{{Question|Question=Which distributions did you test?||Details=
A: Some. Check out the wiki for ready-made guest images. But you can easily build own guest images, e.g. with Debian's debootstrap. Checkout ((step-by-step Guide 2.6)) how to do that.|Signature=derjohn}}

{{Question|Question=Is VServer comparable to XEN/UML/QEMU?||Details=
A: Nope. XEN/UML/QEMU and VServer are just good friends. Because you ask, you probably know what XEN/UML/QEMU are. VServer in contrary to XEN/UML/QEMU not "emulate" any hardware you run a kernel on. You can run a VServer kernel in a XEN/UML/QEMU guest. This is confirmed to work at least with Linux 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Is VServer secure?||Details=
A: We hope so. It should be as least as secure as Linux is. We consider it much much more secure though.|Signature=derjohn}}

{{Question|Question=Performance?||Details=
A: For a single guest, we basically have native performance. Some tests showed insignificant overhead (about 1-2%) others ran faster than on an unpatched kernel. This is IMVHO significantly less than other solutions waste, especially if you have more than a single guest (because of the resource sharing).|Signature=derjohn}}

{{Question|Question=Is SMP Supported?||Details=
A: Yes, on all SMP capable kernel architectures.|Signature=derjohn}}

{{Question|Question=Resource sharing?||Details=
A: Yes ....
* memory: Dynamically.
* CPU usage: Dynamically (token bucket)|Signature=derjohn}}

{{Question|Question=Resource limiting?||Details=
A: Yes, you can set maximum limits per guest, but you can only offer guaranteed resource availability with some ticks at the time. There is the possibility to ulimit and to rlimit. Rlimit is a new feature of kernel 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Disk I/O limiting? Is that possible?||Details=
A: Well, since vs2.1.1 linux-vserver supports a mechanism called 'I/O scheduling', which appeared in the 2.6 mainline some time ago. The mainline kernel offers several I/O schedulers:

<pre>
# cat /sys/block/hdc/queue/scheduler
noop [anticipatory] deadline cfq
</pre>

The default is anticipatory a.k.a. "AS". When running several guests on a host you probably want the I/O performance shared in a fair way among the different guests. The kernel comes with a "completely fair queueing" scheduler, CFQ, which can do that. (More on schedulers can be found at http://lwn.net/Articles/114770/)

This is how to set the scheduler to "cfq" manually:
<pre>
root# echo "cfq" > /sys/block/hdc/queue/scheduler
root# cat /sys/block/hdc/queue/scheduler
noop anticipatory deadline [cfq]
</pre>

Keep in mind that you have to do it on all physical discs. So if you run an md-softraid, do it to all physical /dev/hdXYZ discs!

If you run Debian there is a predefined way to set the /sys values at boot-time:

<pre>
# apt-get install sysfsutils
[...]

# cat /etc/sysfs.conf | grep cfq
block/sda/queue/scheduler = cfq
block/sdc/queue/scheduler = cfq

# /etc/init.d/sysfsutils restart
</pre>

For non-vserver processes and CFQ you can set by which key the kernel decides about the fairness:

<pre>
cat /sys/block/hdc/queue/iosched/key_type
pgid [tgid] uid gid
</pre>
Hint: The 'key_type'-feature has been removed in the mainline kernel recently. Don't look for it any longer :(

The default is tgid, which means to share fairly among process groups. Think every guest is treated like a own process group. It's not possible to set a scheduler strategy within a guest. All processes belonging to the same guest are treated like "noop" within the guest. So: If you run apache and some ftp-server within the _same_ guest, there is no fair scheduling between them, but there is fair scheduling between the whole guest and all other guests.

And: It's possible to tune the scheduler parameters in several ways. Have a look at /sys/block/hdc/queue/....

You need a very recent Version of VS devel, e.g. the 2.1.1-rc18 can do it. Some older version have that feature too, then it got lost and was reinvented. So: Go and get a rc18 - only in 'devel', not stable!|Signature=derjohn}}

{{Question|Question=Why isn't there a device /dev/bla? within a guest||Details=
A: Device nodes allow Userspace to access hardware (or virtual resources). Creating a device node inside the guest's namespace will give access to that device, so for security reasons, the number of 'given' devices is small.|Signature=derjohn}}

{{Question|Question=What is Unification (vunify)?||Details=
A: Unification is Hard Links on Steroids. Guests can 'share' common files (usually binaries and libraries) in a secure way, by creating hard links with special properties (immutable but unlinkable (removable)). The tool to identify common files and to unify them is called vunify.|Signature=derjohn}}

{{Question|Question=What is vhashify?||Details=
A: The successor of vunify, a tool which does unification based on hash values (which allows to find common files in arbitrary paths.)|Signature=derjohn}}

{{Question|Question=How do I manage a multi-guest setup with vhashify?||Details=
A: For 'vhashify', just do these once:

<pre>
mkdir /etc/vservers/.defaults/apps/vunify/hash /vservers/.hash
ln -0s /vservers/.hash /etc/vservers/.defaults/apps/vunify/hash/root
</pre>

Then, do this one line per vserver:

<pre>
mkdir /etc/vservers/<vservername>/apps/vunify # vhashify reuses vunify configuration
</pre>

The command 'ln' creates a link between two files. "ln -s" creates a symbolic link -- two files are linked by name. "ln -0s" uses a Vserver extention to create a unified link.|Signature=derjohn}}

{{Question|Question=With which VS version should I begin?||Details=
A: If you are new to VServer I recommend to try 2.0.+. Take "alpha utils" Version 0.30.210. In Debian Sid there appeared well running version of it recently. (It's a .210 at the time of writing).|Signature=derjohn}}

{{Question|Question=is there a way to implement "user/group quota" per VServer?||Details=
A: Yes, but not on a shared partition for now. You need to put the guest on a separate partition, setup a vroot device (to make the quota access secure), copy that into the guest, and adjust the mtab line inside the guest.|Signature=derjohn}}

{{Question|Question=what about "Quota" for a context?||Details=
A: Context quotas are now called Disk Limits (so that we can tell them apart from the user/group quotas :). They are supported out of the box (with vs2.0) for all major filesystems (Ext2/3, ReiserFS, XFS, JFS)|Signature=derjohn}}

{{Question|Question=Does it support IPv6?||Details=
A: Currently not. Some developer has to move his ... to reimplement this functionality from the V4 code (I read that on the ML ;)). Will probably be superseded by the ngnet (next generation networking) soon. There is a Wiki page regarding this: http://linux-vserver.org/IPv6|Signature=derjohn}}

{{Question|Question=I can't do all I want with the network interfaces inside the guest?||Details=
A: For now the networking is 'Host Business' -- the host is a router, and each guest is a server. You can set the capability ICMP_RAW in the context of the guest, or even the capability CAP_NET_RAW (which would even allow to sniff interfaces of other guests!). Likely to change with ngnet. |Signature=derjohn}}

{{Question|Question=Is there a web-based interface for vserver that will allow creation/deletion/configuration etc. of vserver guests?||Details=
A. [Update] Errrh, there is http://OpenVPS.org which is a set of scripts with a web-interface for webhosters/ISPs.
A. [Update] Errrh, there is http://Openvcp.org which is a distributed system (agent!) with a web-interface, with which you can build/remove guests! cool stuff! beta, try out!|Signature=derjohn}}

{{Question|Question=What is old-style and new-style config?||Details=
A. Old-style config refers to a single text-file that contains all the configuration settings. With new-style config the configuration is split into several directories and files. You should probably go for new-style config if you are asking.|Signature=derjohn}}

{{Question|Question=What is the "great flower page"?||Details=
A. Well, this page contains all configation options for vserver in version > 1.9 (I think .. I joined Linux-VServer in version 2, so I don't know for sure). The name of the page is derivived from the stylesheet(s) it contains: It displays background pictures of a very great flower, so regard it as highly optimized. It was designed by a non-designer, who asks us to create a better one. I played with the thought of creating a complete new theme for that page - but actually we all got used to the name "great flower page", so we stick to it. If you are unable to read it clearly, feel invited to join the IRC channel #vserver, we may tell you how to ;)|Signature=derjohn}}

{{Question|Question=How do I add several IPs to a vserver? ||Details=
A: First of all a single guest vserver only supports up to 16 IPs (There is a 64-IP patch available, which is in "derjohn's kernel", you need extra util-vserver anyway).
Here is a little helper-script that adds a list of IPs defined in a text file, one per line.
<pre>
#!/bin/bash
j=1
for i in `cat myiplist`; do
j=$(($j+1))
mkdir $j
echo $i > $j/ip
echo $i > $j/ip-old
echo "24" > $j/prefix
done
</pre>|Signature=derjohn}}

{{Question|Question=If my host has only one a single public IP, can I use RFC1918 IP (e.g. 192.168.foo.bar) for the guest vservers?||Details=
A: Yes, use iptables with SNAT to masquerade it.
<pre>
iptables -t nat -I POSTROUTING -s $VSERVER_NETZ ! -d $VSERVER_NETZ -j SNAT --to $EXT_IP
</pre>
See: HowtoPrivateNetworking and
http://www.tgunkel.de/it/software/doc/linux_server#h3-Vserver_Masquerading_SNAT (THX, [MUPPETS]Gonzo)|Signature=derjohn}}

{{Question|Question=If I shut down my vserver guest, the whole Internet interface ethX on the host is shut down. What happened? ||Details=
A: When you shut down a guest (''i.e. vserver foo stop''), the IP is brought down on the host also. If this IP happens to be the primary IP of the host, the kernel will not only bring down the primary IP, but also all secondary IP addresses. But in very recent kernels, there is an option ''settable'' which prevents that nasty feature. It's called "alias promotion". You may set it via sysctl by adding ''net.ipv4.conf.all.promote_secondaries=1'' in /etc/sysctl.conf or via sysctl command line.|Signature=derjohn}}

{{Question|Question=On Debian Sarge (stable) only util-vserver is 0.30-204 available, which has been reported to be buggy (I didnt check the version for longer time) How do I compile a local version of alpha util-vserver .210 on Debian?||Details=
A:
<pre>
apt-get build-dep util-vserver

./configure --prefix=/usr/local/ --enable-release \
--mandir=/usr/local/share/man \
--infodir=/usr/local/share/info \
--sysconfdir=/etc --enable-dietlibc \
--localstatedir=/var \
--with-vrootdir=/var/lib/vservers

make

make install-distribution
(Which does a make install + setting a symlink ln -s /usr/local/lib/util-vserver/vshelper /sbin/vshelper )

</pre>

To test which version you are running:
<pre>
# which vserver
/usr/local/sbin/vserver

</pre>

This should point to ..local...

If you dont want to build it yourself: On www.backports.org there are backported (for sarge) linux-images (2.6.16) with vserver-patch enabled and a updated util-vserver package as well.
|Signature=derjohn}}

{{Question|Question=I use derjohn's kernel or a differnet kernel with a more-than-16-IPs-per-guest-patch and can't use more than 16 IPs. Why?||Details=
A: You need to patch util-vserver, too. So you obviously need to recompile util-vserver (see above). In the util-vserver directory there are header files in the ./kernel/ directory. Patch like this:

<pre>
kernel/network.h:#define NB_IPV4ROOT 64
</pre>

BTW: The initial patches can be found here: http://vserver.13thfloor.at/Experimental/VARIOUS/util-vserver-0.30.196-net64.diff.bz2 and http://vserver.13thfloor.at/Experimental/VARIOUS/delta-2.6.9-vs1.9.3-net64.diff
|Signature=unknown}}

{{Question|Question=I run a Debian host and want to build an Ubuntu guest. Howto?||Details=
A: Simple ;) Assume you want to build a breezy guest on a sid host with IP 192.168.0.2 and hostname vubuntu, then do:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu
</pre>

[UPDATE] Currently there are problems in building breezy under unclear circumstances, which seems to have to do with udev. If the above didnt work, try:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu -- --exclude=udev
</pre>
In very recent versions of the utils, the problem should not occur anymore (it has to do with the 'secure-mount' if you look in the MLs)

Well, sid's debootstrap knows how to bootstrap Ubuntu linux. Make sure to have a current debootstrap package:
<pre>
apt-get update
apt-get install debootstrap
</pre>
The knowledge how to build ubuntu 'breezy badger' (which you probably want to be your guest at the time of writing) has been added recently.|Signature=derjohn}}

{{Question|Question=How do I make a vserver guest start by default?||Details=
A: At least on Debian, I can tell you how to do it with the new-style config. If your guest is called "derjohn" and you want it to be started somewhere at the of your bootstrap process, then do:
<pre>
echo "default" > /etc/vservers/derjohn/apps/init/mark
</pre>
If you want to start it earlier, please read the init script "/etc/init.d/vserver-default" to find out how to do it. In most cases you don't need to change this. On Debian the vservers are started at "90", so after most other stuff is up (networking etc.).

Besides that I created a small helper script for managing the autostart foo: ((vserver-autostart))|Signature=derjohn}}

{{Question|Question=My host works, but when I start a guest it says that it has a problem with chbind.||Details=
A: You are probably using util-vserver <= 0.30.209, which does use dynamic network contexts internally (With 0.30.210 this fact changed). So if you compiled your kernel without dynamic contexts, you may start guests, but you can't use the network context.The solution is either to switch to .210 util (or Hollow's toolset) or compile the kernel with dynamic network contexts.
SE Keyword: invalid option `nid' testme.sh|Signature=derjohn}}

{{Question|Question=When I try to ssh to the guest, I log into the host, even if I installed sshd on the guest. What's wrong here?||Details=
A: Look at /etc/ssh/sshd_config of the host:

<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
</pre>

And now change the setting to
<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
ListenAddress your.hosts.ip.here # not the guests IP!
</pre>

Then '/etc/init.d/ssh restart' on the host, after that on the guest (if you did apt-get install ssh on the guest already.)

Do I have to explain more? If the hosts sshd binds all available IP addresses on port 22 (The hosts 'sees' even all addresses of the guests!). So if the guest starts its sshd, it cant bind to port 22 any more. You need to change that setting only on the host.
(BTW: A similar approach has to be done for a lot of daemons, e.g. Apache. If the daemon does not support an explicit bind, you may use the chbind command to 'hide' IP addresses from the daemon before starting.)|Signature=derjohn}}

{{Question|Question=I did everything right, but the application foo does not start. What's up there?||Details=
A: Before asking on the IRC channel, please check out the 'problematic programs' page:
[[Problematic Programs]]|Signature=derjohn}}

{{Question|Question=Bind9 does not like to start in my guest.||Details=
A: Check out the 'problematic programs' page:
((ProblematicPrograms)) and/or get my [((ProblematicPrograms)) and/or get my [http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb vserver-guest-ready Debian package] for Debian Sid guests from that URL: http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb and check out the [http://linux-vserver.derjohn.de/bind9-packages/README.txt readme]. (Hint: This is fresh stuff. The give me Feedback)

[UPDATE] Since VServer Devel 2.1.1-rc18 you do not need to patch the userland tools anymore. The capabilities are masked.|Signature=derjohn}}

{{Question|Question=Which guest vservers are running?||Details=
A: {{vserver-stat}}. Example output:
<pre>
CTX PROC VSZ RSS userTIME sysTIME UPTIME NAME
0 77 965.1M 334.6M 14m14s18 2m28s69 1h33m46 root server
49152 7 14M 5.2M 0m00s40 0m00s30 1h30m15 chiffon
</pre>|Signature=derjohn}}

{{Question|Question=How can I reboot/halt guests?||Details=
A: It depends.
For vserver with legacy-interfaces support, you have to replace {{/sbin/halt}} in guests with vreboot and start rebootmgr in host. You also need to have a dummy <guest>.conf file in /etc/vservers for each guest. Please have a look at /etc/init.d/rebootmgr.
Vserver with native interface utilizes /dev/initctl. No changes are needed in guests. Just make sure that REBOOT capability is adjusted in guests.|Signature=derjohn}}

{{Question|Question=Do I really need the legacy-interfaces? What are these legacy-interfaces?||Details=
A: Since vserver is an ongoing project, new features might replace old ones, some might still on development. Legacy-interfaces are available for backward compability (which might be removed someday). See Q: How can I reboot/halt guests?|Signature=derjohn}}

{{Question|Question= I have a vserver running on a Linux kernel with preemption. Is VServer "preempt" safe?||Details=
A: There are no known issues about running vserver on a preemption enabled kernel. I would like to add, that the vserver kernelhackers would probably exclude that option in 'make menuconfig' if there would be an incompatibility. Just my $.02 :)|Signature=derjohn}}

{{Question|Question=Is this a new project? When was it started?||Details=
A: The first public occurance of linux-vserver was Oct 2001. The initial mail can be found here: http://www.cs.helsinki.fi/linux/linux-kernel/2001-40/1065.html
So you can expect a mature software product wich does it's magic quite well (And hey, we have a version > 2.0 ! )|Signature=derjohn}}

{{Question|Question=Can I run an OpenVPN Server in a guest?||Details=
A: Yes. I don't want to provide an in-depth OpenVPN tutorial, but want to show how I made OpenVPN work in a guest as server. I was not able to run it with a tun devive, due to a buglet in util-vserver and kernel when it comes to settings a an ip address a point to point link: If you add "ip addr add <ip> peer <mypeer> dev tun0" there is no way to map the tun0 interface into a guest, even not with a 'nodev' option. (bug confirmed to be reproducible by daniel_hoczac)

First of all you have to prepare the host with a persistent tuntap interface in tap-mode. The tools we need come from the uml-utilities.
Then you need to create a device /dev/net/tun, which the OpenVPN userspace daemon reads. Well assume 10.10.10.100 is the server IP, and 10.10.10.101 is the client ip - to be cool be choose a /31 netmask (255.255.255.254), so we have a net without broadcast and don't waste IPs :)

On the host do:
<pre>
# apt-get install uml-utilities
# cd /var/lib/vserver/<myopenvpnserver>/dev/
# ./MAKEDEV tun
(creates the dev/net/tun device accessible by te guest - even a tap interface need /dev/net/tun !)
# tunctl -t tap0
(creates the network device 'tap0' persistently)
</pre>

Then add the ip to the guest:
<pre>
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/ip
10.10.10.100
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/prefix
31
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/dev
tap0
(This kind of config brings the ip when the vserver is started - only the tap0 interface has to exist already, see above!)
</pre>

Here is a sample config for the guest (which is acting as a server):

Install OpenVPN package on server and client, in the Debian case:
<pre>
# apt-get install openvpn
</pre>

The server's conf looks like that:
<pre>
# port and interface specs

# behave like a ssl-webserver
port 443
proto tcp-server

# tap device? (keep in mind you need /dev/net/tun !)
dev tap0

# now the ips we will use for the tunnel
ifconfig 10.10.10.100 255.255.255.254
ifconfig-noexec

# the server part

# Keep VPN connections, even if the client IP changes
float

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key - create it with 'openvpn --genkey --secret static.key'
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4
status openvpn-status.log
</pre>

The client's conf may look like that (This example even makes the tunnel the clients default address):
<pre>
# cat /etc/openvpn/client.conf
# port and interface specs

# the following is not necessary, if you bring up openvpn via Debian's init script:
daemon ovpn-my-clients-name

# behave like a ssl-webserver
port 443
proto tcp-client
remote %%%<insert-the-guest-primary-public-ip-here>%%%%
# what device tun ot tap?
dev tap

# now the ips we will use for the tunnel
ifconfig 10.10.10.101 255.255.255.254

# Keep VPN connections, even if the client IP changes
float
mssfix

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4

# set the default route
route-gateway 10.10.10.100
redirect-gateway def1
# to add special routes you can do it wihtin the openvpn client conf:
# route <dest> <mask> <gateway>

# if you need to connect via proxy (like squid)
# http-proxy s p [up] [auth] : Connect to remote host through an HTTP proxy at
# address s and port p. If proxy authentication is required,
# up is a file containing username/password on 2 lines, or
# 'stdin' to prompt from console. Add auth='ntlm' if
# the proxy requires NTLM authentication.

# http-proxy s p [up] [auth]

# http-proxy-option type [parm] : Set extended HTTP proxy options.
# Repeat to set multiple options.
# VERSION version (default=1.0)
# AGENT user-agent

# http-proxy-option type [parm]
</pre>

In the next lesson I will talk about OpenVPN's server mode, which can deal with with multiple clients connecting to one ip and one port (i.e. you only need one guest for tons or 'roadwarriros'), tls connections and pki.

Contributions welcome. :)|Signature=derjohn}}

{{Question|Question=32 vs 64 Bit? What should I take?||Details=
A: If you have the choice make the host a 64 bit one. You can run a guest as 32 bit or as 64 bit on a 64 bit host. To run it as 32 bit, you need to compile the x86_64 (a.k.a. AMD64) with the following options:

<pre>
[*] Kernel support for ELF binaries
<M> Kernel support for MISC binaries
[*] IA32 Emulation <---- without that, the entire 32bit API is not present
<M> IA32 a.out support
</pre>

You can force the guest to behave like a 32 environment like this:
<pre>
echo linux_32bit > /etc/vservers/$NAME/personality
echo i686 > /etc/vservers/$NAME/uts/machine
</pre>
(thanks cehteh for the hint!)

But you can force debootstrap to but 32 bit binaries into the guest by 'export ARCH=i386';
<pre>
export ARCH=i386 ; vserver build ....
</pre>|Signature=derjohn}}

{{Question|Question=I want to (re)mount a partition in a running guest ... but the guest has no rights (capability) to (re)mount?||Details=
A: I'll explain. I take as example your /tmp partition within the guest is too small, what will be likely the case if you stay with the 16MB default (vserver build mounts /tmp as 16 MB tempfs!).
<pre>
# vnamespace -e XID mount -t tmpfs -o remount,size=256m,mode=1777 none /var/lib/vservers/<guest>/tmp/
</pre>
Be warned that the guest will not recognize the change, as the /etc/mtab file is not updated when you mount like this. To permanently change the mount, edit /etc/vserver/<guest>/fstab on the host.|Signature=derjohn}}

{{Question|Question=How do I limit a guests RAM? I want to prevent OOM situations on the host!||Details=
A: First you can read [http://linux-vserver.org/Memory+Allocation].
If you want a recipe, do that:
1. Check the size of memory pages. On x86 and x86_64 is usually 4 KB per page.
2. Create /etc/vserver/<guest>/rlimits/
3. Check your physical memory size on the host, e.g. with "free -m". maxram = kilobytes/pagesize.
4. Limit the guests physical RAM to value smaller then maxram:

<pre>
echo %%insertYourPagesHereSmallerThanMaxram%% > /etc/vserver/<guest>/rlimits/rss
</pre>

5. Check your swapspace, e.g. with 'swapon -s'. maxswap = swapkilobytes/pagesize.
6. Limit the guest's maximum number of as pages to a value smaller than (maxram+maxswap):

<pre>
echo %%desiredvalue%% > /etc/vserver/<guest>/rlimits/as
</pre>

It should be clear this can still lead to OOM situations. Example: You have two guests and your as limit per guest is greater than 50% of (maxram+maxswap). If both guests request their maximum at the same point in time, there will be not enough mem .....|Signature=derjohn}}

{{Question|Question=Were can I get newer versions of VServer as ready made packages for Debian?||Details=
A: Here you go: http://linux-vserver.derjohn.de/ . There is also some stuff on backports.org, but my kernels are always 'devel' branch.|Signature=derjohn}}

{{Question|Question=Can I use iptables ?||Details=
Yes but right now only on the host (rootserver). Please realize that all traffic is local and will not touch the forward chain.|Signature=BeginnerFAQ}}

{{Question|Question=Try to connect to a vserver from the master or another vserver on the same host fails with
:: strace shows:
<pre>
sin_addr=inet_addr("xx.xx.xx.xx")}, yy) = -1 EINVAL (Invalid argument)
</pre>

||Details=
A: The vserver/master cannot communicate with another vserver on same host.
* check all netmasks on all interfaces (do they overlap) ?
* check policy routing (disable it temporary) ?
* check that lo is up (Networking within a host/vserver always uses lo interface)
|Signature=CommonProblems}}

{{Question|Question=#1 ERROR: capset(): Operation not permitted||Details=capabilities are not enabled in kernel-setup
please check that CONFIG_SECURITY_CAPABILITIES is loaded or included in the kernel. ( check with "cat /path_to_kernel/.config | grep -i cap ")
(2.6.11.5-vs-1.9.5 + 0.30-205)|Signature=IrcQuestions}}

{{Question|Question=How can I make 'vserver start' mount the root filesystem||Details=
mount it via /etc/vservers/vserver-name/fstab, make sure to set the option 'dev' e.g.:
<pre>/dev/drbd0 / xfs rw,dev 0 0</pre>
util-vserver 210 won't be able to find some scripts for the reboot, add into /etc/vservers/vserver-name/apps/init/cmd.stop
<pre>/etc/init.d/rc
6</pre>
|Signature=AdrianReyer}}

Problematic Programs

2006-09-19T18:05:21Z

Meandtheshell: Template:ProblematicPrograms moved to Problematic Programs: I started to fill in content so it isn't a template anymore :)

Template:ProblematicPrograms

2006-09-19T18:05:21Z

Meandtheshell: Template:ProblematicPrograms moved to Problematic Programs: I started to fill in content so it isn't a template anymore :)

#REDIRECT [[Problematic Programs]]

Frequently Asked Questions

2006-09-19T18:01:24Z

Meandtheshell:

<div style="margin: 2em auto 2em auto; padding: 10px; background-color: #F9ECCD; border: 1px solid #004433; text-align: center;">
[[Image:Icon-Caution.png|left]]
We currently migrate to MediaWiki from our old installation, but not all content has been migrated yet. Take a look at the [[Wiki Team]] page for instructions how to help or look at the [http://oldwiki.linux-vserver.org old wiki] to find the information not migrated yet.

'''To ease migration we created a [[List of old Documentation pages]].'''
</div>

CURRENTLY THE CONTENT OF THE OLD WIKI FAQ (AND MORE) IS BEING MIGRATED TO THIS PAGE (TASK: DERJOHN)

__TOC__

{{Question|Question=What is a 'Guest'?||Details=To talk about stuff, we need some naming. The physical machine is called 'Host' and the 'main' context running the Host Distro is called 'Host Context'. The virtual machine/distro is called 'Guest' and basically is a Distribution (Userspace) running inside a 'Guest Context'.|Signature=derjohn}}

{{Question|Question=What kind of Operating System (OS) can I run as guest?||Details=
A: With VServer you can only run Linux guests. The trick is that a guest does not run a kernel on its own (as XEN and UML do), it merely uses a virtualized host kernel-interface. VServer offers so called security contexts which make it possible to seperate one guest from each other, i.e. they cannot get data from each other. Imagine it as a chroot environment with much more security and features.|Signature=derjohn}}

{{Question|Question=Which distributions did you test?||Details=
A: Some. Check out the wiki for ready-made guest images. But you can easily build own guest images, e.g. with Debian's debootstrap. Checkout ((step-by-step Guide 2.6)) how to do that.|Signature=derjohn}}

{{Question|Question=Is VServer comparable to XEN/UML/QEMU?||Details=
A: Nope. XEN/UML/QEMU and VServer are just good friends. Because you ask, you probably know what XEN/UML/QEMU are. VServer in contrary to XEN/UML/QEMU not "emulate" any hardware you run a kernel on. You can run a VServer kernel in a XEN/UML/QEMU guest. This is confirmed to work at least with Linux 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Is VServer secure?||Details=
A: We hope so. It should be as least as secure as Linux is. We consider it much much more secure though.|Signature=derjohn}}

{{Question|Question=Performance?||Details=
A: For a single guest, we basically have native performance. Some tests showed insignificant overhead (about 1-2%) others ran faster than on an unpatched kernel. This is IMVHO significantly less than other solutions waste, especially if you have more than a single guest (because of the resource sharing).|Signature=derjohn}}

{{Question|Question=Is SMP Supported?||Details=
A: Yes, on all SMP capable kernel architectures.|Signature=derjohn}}

{{Question|Question=Resource sharing?||Details=
A: Yes ....
* memory: Dynamically.
* CPU usage: Dynamically (token bucket)|Signature=derjohn}}

{{Question|Question=Resource limiting?||Details=
A: Yes, you can set maximum limits per guest, but you can only offer guaranteed resource availability with some ticks at the time. There is the possibility to ulimit and to rlimit. Rlimit is a new feature of kernel 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Disk I/O limiting? Is that possible?||Details=
A: Well, since vs2.1.1 linux-vserver supports a mechanism called 'I/O scheduling', which appeared in the 2.6 mainline some time ago. The mainline kernel offers several I/O schedulers:

<pre>
# cat /sys/block/hdc/queue/scheduler
noop [anticipatory] deadline cfq
</pre>

The default is anticipatory a.k.a. "AS". When running several guests on a host you probably want the I/O performance shared in a fair way among the different guests. The kernel comes with a "completely fair queueing" scheduler, CFQ, which can do that. (More on schedulers can be found at http://lwn.net/Articles/114770/)

This is how to set the scheduler to "cfq" manually:
<pre>
root# echo "cfq" > /sys/block/hdc/queue/scheduler
root# cat /sys/block/hdc/queue/scheduler
noop anticipatory deadline [cfq]
</pre>

Keep in mind that you have to do it on all physical discs. So if you run an md-softraid, do it to all physical /dev/hdXYZ discs!

If you run Debian there is a predefined way to set the /sys values at boot-time:

<pre>
# apt-get install sysfsutils
[...]

# cat /etc/sysfs.conf | grep cfq
block/sda/queue/scheduler = cfq
block/sdc/queue/scheduler = cfq

# /etc/init.d/sysfsutils restart
</pre>

For non-vserver processes and CFQ you can set by which key the kernel decides about the fairness:

<pre>
cat /sys/block/hdc/queue/iosched/key_type
pgid [tgid] uid gid
</pre>
Hint: The 'key_type'-feature has been removed in the mainline kernel recently. Don't look for it any longer :(

The default is tgid, which means to share fairly among process groups. Think every guest is treated like a own process group. It's not possible to set a scheduler strategy within a guest. All processes belonging to the same guest are treated like "noop" within the guest. So: If you run apache and some ftp-server within the _same_ guest, there is no fair scheduling between them, but there is fair scheduling between the whole guest and all other guests.

And: It's possible to tune the scheduler parameters in several ways. Have a look at /sys/block/hdc/queue/....

You need a very recent Version of VS devel, e.g. the 2.1.1-rc18 can do it. Some older version have that feature too, then it got lost and was reinvented. So: Go and get a rc18 - only in 'devel', not stable!|Signature=derjohn}}

{{Question|Question=Why isn't there a device /dev/bla? within a guest||Details=
A: Device nodes allow Userspace to access hardware (or virtual resources). Creating a device node inside the guest's namespace will give access to that device, so for security reasons, the number of 'given' devices is small.|Signature=derjohn}}

{{Question|Question=What is Unification (vunify)?||Details=
A: Unification is Hard Links on Steroids. Guests can 'share' common files (usually binaries and libraries) in a secure way, by creating hard links with special properties (immutable but unlinkable (removable)). The tool to identify common files and to unify them is called vunify.|Signature=derjohn}}

{{Question|Question=What is vhashify?||Details=
A: The successor of vunify, a tool which does unification based on hash values (which allows to find common files in arbitrary paths.)|Signature=derjohn}}

{{Question|Question=How do I manage a multi-guest setup with vhashify?||Details=
A: For 'vhashify', just do these once:

<pre>
mkdir /etc/vservers/.defaults/apps/vunify/hash /vservers/.hash
ln -0s /vservers/.hash /etc/vservers/.defaults/apps/vunify/hash/root
</pre>

Then, do this one line per vserver:

<pre>
mkdir /etc/vservers/<vservername>/apps/vunify # vhashify reuses vunify configuration
</pre>

The command 'ln' creates a link between two files. "ln -s" creates a symbolic link -- two files are linked by name. "ln -0s" uses a Vserver extention to create a unified link.|Signature=derjohn}}

{{Question|Question=With which VS version should I begin?||Details=
A: If you are new to VServer I recommend to try 2.0.+. Take "alpha utils" Version 0.30.210. In Debian Sid there appeared well running version of it recently. (It's a .210 at the time of writing).|Signature=derjohn}}

{{Question|Question=is there a way to implement "user/group quota" per VServer?||Details=
A: Yes, but not on a shared partition for now. You need to put the guest on a separate partition, setup a vroot device (to make the quota access secure), copy that into the guest, and adjust the mtab line inside the guest.|Signature=derjohn}}

{{Question|Question=what about "Quota" for a context?||Details=
A: Context quotas are now called Disk Limits (so that we can tell them apart from the user/group quotas :). They are supported out of the box (with vs2.0) for all major filesystems (Ext2/3, ReiserFS, XFS, JFS)|Signature=derjohn}}

{{Question|Question=Does it support IPv6?||Details=
A: Currently not. Some developer has to move his ... to reimplement this functionality from the V4 code (I read that on the ML ;)). Will probably be superseded by the ngnet (next generation networking) soon. There is a Wiki page regarding this: http://linux-vserver.org/IPv6|Signature=derjohn}}

{{Question|Question=I can't do all I want with the network interfaces inside the guest?||Details=
A: For now the networking is 'Host Business' -- the host is a router, and each guest is a server. You can set the capability ICMP_RAW in the context of the guest, or even the capability CAP_NET_RAW (which would even allow to sniff interfaces of other guests!). Likely to change with ngnet. |Signature=derjohn}}

{{Question|Question=Is there a web-based interface for vserver that will allow creation/deletion/configuration etc. of vserver guests?||Details=
A. [Update] Errrh, there is http://OpenVPS.org which is a set of scripts with a web-interface for webhosters/ISPs.
A. [Update] Errrh, there is http://Openvcp.org which is a distributed system (agent!) with a web-interface, with which you can build/remove guests! cool stuff! beta, try out!|Signature=derjohn}}

{{Question|Question=What is old-style and new-style config?||Details=
A. Old-style config refers to a single text-file that contains all the configuration settings. With new-style config the configuration is split into several directories and files. You should probably go for new-style config if you are asking.|Signature=derjohn}}

{{Question|Question=What is the "great flower page"?||Details=
A. Well, this page contains all configation options for vserver in version > 1.9 (I think .. I joined Linux-VServer in version 2, so I don't know for sure). The name of the page is derivived from the stylesheet(s) it contains: It displays background pictures of a very great flower, so regard it as highly optimized. It was designed by a non-designer, who asks us to create a better one. I played with the thought of creating a complete new theme for that page - but actually we all got used to the name "great flower page", so we stick to it. If you are unable to read it clearly, feel invited to join the IRC channel #vserver, we may tell you how to ;)|Signature=derjohn}}

{{Question|Question=How do I add several IPs to a vserver? ||Details=
A: First of all a single guest vserver only supports up to 16 IPs (There is a 64-IP patch available, which is in "derjohn's kernel", you need extra util-vserver anyway).
Here is a little helper-script that adds a list of IPs defined in a text file, one per line.
<pre>
#!/bin/bash
j=1
for i in `cat myiplist`; do
j=$(($j+1))
mkdir $j
echo $i > $j/ip
echo $i > $j/ip-old
echo "24" > $j/prefix
done
</pre>|Signature=derjohn}}

{{Question|Question=If my host has only one a single public IP, can I use RFC1918 IP (e.g. 192.168.foo.bar) for the guest vservers?||Details=
A: Yes, use iptables with SNAT to masquerade it.
<pre>
iptables -t nat -I POSTROUTING -s $VSERVER_NETZ ! -d $VSERVER_NETZ -j SNAT --to $EXT_IP
</pre>
See: HowtoPrivateNetworking and
http://www.tgunkel.de/it/software/doc/linux_server#h3-Vserver_Masquerading_SNAT (THX, [MUPPETS]Gonzo)|Signature=derjohn}}

{{Question|Question=If I shut down my vserver guest, the whole Internet interface ethX on the host is shut down. What happened? ||Details=
A: When you shut down a guest (''i.e. vserver foo stop''), the IP is brought down on the host also. If this IP happens to be the primary IP of the host, the kernel will not only bring down the primary IP, but also all secondary IP addresses. But in very recent kernels, there is an option ''settable'' which prevents that nasty feature. It's called "alias promotion". You may set it via sysctl by adding ''net.ipv4.conf.all.promote_secondaries=1'' in /etc/sysctl.conf or via sysctl command line.|Signature=derjohn}}

{{Question|Question=On Debian Sarge (stable) only util-vserver is 0.30-204 available, which has been reported to be buggy (I didnt check the version for longer time) How do I compile a local version of alpha util-vserver .210 on Debian?||Details=
A:
<pre>
apt-get build-dep util-vserver

./configure --prefix=/usr/local/ --enable-release \
--mandir=/usr/local/share/man \
--infodir=/usr/local/share/info \
--sysconfdir=/etc --enable-dietlibc \
--localstatedir=/var \
--with-vrootdir=/var/lib/vservers

make

make install-distribution
(Which does a make install + setting a symlink ln -s /usr/local/lib/util-vserver/vshelper /sbin/vshelper )

</pre>

To test which version you are running:
<pre>
# which vserver
/usr/local/sbin/vserver

</pre>

This should point to ..local...

If you dont want to build it yourself: On www.backports.org there are backported (for sarge) linux-images (2.6.16) with vserver-patch enabled and a updated util-vserver package as well.
|Signature=derjohn}}

{{Question|Question=I use derjohn's kernel or a differnet kernel with a more-than-16-IPs-per-guest-patch and can't use more than 16 IPs. Why?||Details=
A: You need to patch util-vserver, too. So you obviously need to recompile util-vserver (see above). In the util-vserver directory there are header files in the ./kernel/ directory. Patch like this:

<pre>
kernel/network.h:#define NB_IPV4ROOT 64
</pre>

BTW: The initial patches can be found here: http://vserver.13thfloor.at/Experimental/VARIOUS/util-vserver-0.30.196-net64.diff.bz2 and http://vserver.13thfloor.at/Experimental/VARIOUS/delta-2.6.9-vs1.9.3-net64.diff
|Signature=unknown}}

{{Question|Question=I run a Debian host and want to build an Ubuntu guest. Howto?||Details=
A: Simple ;) Assume you want to build a breezy guest on a sid host with IP 192.168.0.2 and hostname vubuntu, then do:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu
</pre>

[UPDATE] Currently there are problems in building breezy under unclear circumstances, which seems to have to do with udev. If the above didnt work, try:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu -- --exclude=udev
</pre>
In very recent versions of the utils, the problem should not occur anymore (it has to do with the 'secure-mount' if you look in the MLs)

Well, sid's debootstrap knows how to bootstrap Ubuntu linux. Make sure to have a current debootstrap package:
<pre>
apt-get update
apt-get install debootstrap
</pre>
The knowledge how to build ubuntu 'breezy badger' (which you probably want to be your guest at the time of writing) has been added recently.|Signature=derjohn}}

{{Question|Question=How do I make a vserver guest start by default?||Details=
A: At least on Debian, I can tell you how to do it with the new-style config. If your guest is called "derjohn" and you want it to be started somewhere at the of your bootstrap process, then do:
<pre>
echo "default" > /etc/vservers/derjohn/apps/init/mark
</pre>
If you want to start it earlier, please read the init script "/etc/init.d/vserver-default" to find out how to do it. In most cases you don't need to change this. On Debian the vservers are started at "90", so after most other stuff is up (networking etc.).

Besides that I created a small helper script for managing the autostart foo: ((vserver-autostart))|Signature=derjohn}}

{{Question|Question=My host works, but when I start a guest it says that it has a problem with chbind.||Details=
A: You are probably using util-vserver <= 0.30.209, which does use dynamic network contexts internally (With 0.30.210 this fact changed). So if you compiled your kernel without dynamic contexts, you may start guests, but you can't use the network context.The solution is either to switch to .210 util (or Hollow's toolset) or compile the kernel with dynamic network contexts.
SE Keyword: invalid option `nid' testme.sh|Signature=derjohn}}

{{Question|Question=When I try to ssh to the guest, I log into the host, even if I installed sshd on the guest. What's wrong here?||Details=
A: Look at /etc/ssh/sshd_config of the host:

<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
</pre>

And now change the setting to
<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
ListenAddress your.hosts.ip.here # not the guests IP!
</pre>

Then '/etc/init.d/ssh restart' on the host, after that on the guest (if you did apt-get install ssh on the guest already.)

Do I have to explain more? If the hosts sshd binds all available IP addresses on port 22 (The hosts 'sees' even all addresses of the guests!). So if the guest starts its sshd, it cant bind to port 22 any more. You need to change that setting only on the host.
(BTW: A similar approach has to be done for a lot of daemons, e.g. Apache. If the daemon does not support an explicit bind, you may use the chbind command to 'hide' IP addresses from the daemon before starting.)|Signature=derjohn}}

{{Question|Question=I did everything right, but the application foo does not start. What's up there?||Details=
A: Before asking on the IRC channel, please check out the 'problematic programs' page:
[[Template:ProblematicPrograms]]|Signature=derjohn}}

{{Question|Question=Bind9 does not like to start in my guest.||Details=
A: Check out the 'problematic programs' page:
((ProblematicPrograms)) and/or get my [((ProblematicPrograms)) and/or get my [http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb vserver-guest-ready Debian package] for Debian Sid guests from that URL: http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb and check out the [http://linux-vserver.derjohn.de/bind9-packages/README.txt readme]. (Hint: This is fresh stuff. The give me Feedback)

[UPDATE] Since VServer Devel 2.1.1-rc18 you do not need to patch the userland tools anymore. The capabilities are masked.|Signature=derjohn}}

{{Question|Question=Which guest vservers are running?||Details=
A: {{vserver-stat}}. Example output:
<pre>
CTX PROC VSZ RSS userTIME sysTIME UPTIME NAME
0 77 965.1M 334.6M 14m14s18 2m28s69 1h33m46 root server
49152 7 14M 5.2M 0m00s40 0m00s30 1h30m15 chiffon
</pre>|Signature=derjohn}}

{{Question|Question=How can I reboot/halt guests?||Details=
A: It depends.
For vserver with legacy-interfaces support, you have to replace {{/sbin/halt}} in guests with vreboot and start rebootmgr in host. You also need to have a dummy <guest>.conf file in /etc/vservers for each guest. Please have a look at /etc/init.d/rebootmgr.
Vserver with native interface utilizes /dev/initctl. No changes are needed in guests. Just make sure that REBOOT capability is adjusted in guests.|Signature=derjohn}}

{{Question|Question=Do I really need the legacy-interfaces? What are these legacy-interfaces?||Details=
A: Since vserver is an ongoing project, new features might replace old ones, some might still on development. Legacy-interfaces are available for backward compability (which might be removed someday). See Q: How can I reboot/halt guests?|Signature=derjohn}}

{{Question|Question= I have a vserver running on a Linux kernel with preemption. Is VServer "preempt" safe?||Details=
A: There are no known issues about running vserver on a preemption enabled kernel. I would like to add, that the vserver kernelhackers would probably exclude that option in 'make menuconfig' if there would be an incompatibility. Just my $.02 :)|Signature=derjohn}}

{{Question|Question=Is this a new project? When was it started?||Details=
A: The first public occurance of linux-vserver was Oct 2001. The initial mail can be found here: http://www.cs.helsinki.fi/linux/linux-kernel/2001-40/1065.html
So you can expect a mature software product wich does it's magic quite well (And hey, we have a version > 2.0 ! )|Signature=derjohn}}

{{Question|Question=Can I run an OpenVPN Server in a guest?||Details=
A: Yes. I don't want to provide an in-depth OpenVPN tutorial, but want to show how I made OpenVPN work in a guest as server. I was not able to run it with a tun devive, due to a buglet in util-vserver and kernel when it comes to settings a an ip address a point to point link: If you add "ip addr add <ip> peer <mypeer> dev tun0" there is no way to map the tun0 interface into a guest, even not with a 'nodev' option. (bug confirmed to be reproducible by daniel_hoczac)

First of all you have to prepare the host with a persistent tuntap interface in tap-mode. The tools we need come from the uml-utilities.
Then you need to create a device /dev/net/tun, which the OpenVPN userspace daemon reads. Well assume 10.10.10.100 is the server IP, and 10.10.10.101 is the client ip - to be cool be choose a /31 netmask (255.255.255.254), so we have a net without broadcast and don't waste IPs :)

On the host do:
<pre>
# apt-get install uml-utilities
# cd /var/lib/vserver/<myopenvpnserver>/dev/
# ./MAKEDEV tun
(creates the dev/net/tun device accessible by te guest - even a tap interface need /dev/net/tun !)
# tunctl -t tap0
(creates the network device 'tap0' persistently)
</pre>

Then add the ip to the guest:
<pre>
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/ip
10.10.10.100
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/prefix
31
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/dev
tap0
(This kind of config brings the ip when the vserver is started - only the tap0 interface has to exist already, see above!)
</pre>

Here is a sample config for the guest (which is acting as a server):

Install OpenVPN package on server and client, in the Debian case:
<pre>
# apt-get install openvpn
</pre>

The server's conf looks like that:
<pre>
# port and interface specs

# behave like a ssl-webserver
port 443
proto tcp-server

# tap device? (keep in mind you need /dev/net/tun !)
dev tap0

# now the ips we will use for the tunnel
ifconfig 10.10.10.100 255.255.255.254
ifconfig-noexec

# the server part

# Keep VPN connections, even if the client IP changes
float

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key - create it with 'openvpn --genkey --secret static.key'
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4
status openvpn-status.log
</pre>

The client's conf may look like that (This example even makes the tunnel the clients default address):
<pre>
# cat /etc/openvpn/client.conf
# port and interface specs

# the following is not necessary, if you bring up openvpn via Debian's init script:
daemon ovpn-my-clients-name

# behave like a ssl-webserver
port 443
proto tcp-client
remote %%%<insert-the-guest-primary-public-ip-here>%%%%
# what device tun ot tap?
dev tap

# now the ips we will use for the tunnel
ifconfig 10.10.10.101 255.255.255.254

# Keep VPN connections, even if the client IP changes
float
mssfix

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4

# set the default route
route-gateway 10.10.10.100
redirect-gateway def1
# to add special routes you can do it wihtin the openvpn client conf:
# route <dest> <mask> <gateway>

# if you need to connect via proxy (like squid)
# http-proxy s p [up] [auth] : Connect to remote host through an HTTP proxy at
# address s and port p. If proxy authentication is required,
# up is a file containing username/password on 2 lines, or
# 'stdin' to prompt from console. Add auth='ntlm' if
# the proxy requires NTLM authentication.

# http-proxy s p [up] [auth]

# http-proxy-option type [parm] : Set extended HTTP proxy options.
# Repeat to set multiple options.
# VERSION version (default=1.0)
# AGENT user-agent

# http-proxy-option type [parm]
</pre>

In the next lesson I will talk about OpenVPN's server mode, which can deal with with multiple clients connecting to one ip and one port (i.e. you only need one guest for tons or 'roadwarriros'), tls connections and pki.

Contributions welcome. :)|Signature=derjohn}}

{{Question|Question=32 vs 64 Bit? What should I take?||Details=
A: If you have the choice make the host a 64 bit one. You can run a guest as 32 bit or as 64 bit on a 64 bit host. To run it as 32 bit, you need to compile the x86_64 (a.k.a. AMD64) with the following options:

<pre>
[*] Kernel support for ELF binaries
<M> Kernel support for MISC binaries
[*] IA32 Emulation <---- without that, the entire 32bit API is not present
<M> IA32 a.out support
</pre>

You can force the guest to behave like a 32 environment like this:
<pre>
echo linux_32bit > /etc/vservers/$NAME/personality
echo i686 > /etc/vservers/$NAME/uts/machine
</pre>
(thanks cehteh for the hint!)

But you can force debootstrap to but 32 bit binaries into the guest by 'export ARCH=i386';
<pre>
export ARCH=i386 ; vserver build ....
</pre>|Signature=derjohn}}

{{Question|Question=I want to (re)mount a partition in a running guest ... but the guest has no rights (capability) to (re)mount?||Details=
A: I'll explain. I take as example your /tmp partition within the guest is too small, what will be likely the case if you stay with the 16MB default (vserver build mounts /tmp as 16 MB tempfs!).
<pre>
# vnamespace -e XID mount -t tmpfs -o remount,size=256m,mode=1777 none /var/lib/vservers/<guest>/tmp/
</pre>
Be warned that the guest will not recognize the change, as the /etc/mtab file is not updated when you mount like this. To permanently change the mount, edit /etc/vserver/<guest>/fstab on the host.|Signature=derjohn}}

{{Question|Question=How do I limit a guests RAM? I want to prevent OOM situations on the host!||Details=
A: First you can read [http://linux-vserver.org/Memory+Allocation].
If you want a recipe, do that:
1. Check the size of memory pages. On x86 and x86_64 is usually 4 KB per page.
2. Create /etc/vserver/<guest>/rlimits/
3. Check your physical memory size on the host, e.g. with "free -m". maxram = kilobytes/pagesize.
4. Limit the guests physical RAM to value smaller then maxram:

<pre>
echo %%insertYourPagesHereSmallerThanMaxram%% > /etc/vserver/<guest>/rlimits/rss
</pre>

5. Check your swapspace, e.g. with 'swapon -s'. maxswap = swapkilobytes/pagesize.
6. Limit the guest's maximum number of as pages to a value smaller than (maxram+maxswap):

<pre>
echo %%desiredvalue%% > /etc/vserver/<guest>/rlimits/as
</pre>

It should be clear this can still lead to OOM situations. Example: You have two guests and your as limit per guest is greater than 50% of (maxram+maxswap). If both guests request their maximum at the same point in time, there will be not enough mem .....|Signature=derjohn}}

{{Question|Question=Were can I get newer versions of VServer as ready made packages for Debian?||Details=
A: Here you go: http://linux-vserver.derjohn.de/ . There is also some stuff on backports.org, but my kernels are always 'devel' branch.|Signature=derjohn}}

{{Question|Question=Can I use iptables ?||Details=
Yes but right now only on the host (rootserver). Please realize that all traffic is local and will not touch the forward chain.|Signature=BeginnerFAQ}}

{{Question|Question=Try to connect to a vserver from the master or another vserver on the same host fails with
:: strace shows:
<pre>
sin_addr=inet_addr("xx.xx.xx.xx")}, yy) = -1 EINVAL (Invalid argument)
</pre>

||Details=
A: The vserver/master cannot communicate with another vserver on same host.
* check all netmasks on all interfaces (do they overlap) ?
* check policy routing (disable it temporary) ?
* check that lo is up (Networking within a host/vserver always uses lo interface)
|Signature=CommonProblems}}

{{Question|Question=#1 ERROR: capset(): Operation not permitted||Details=capabilities are not enabled in kernel-setup
please check that CONFIG_SECURITY_CAPABILITIES is loaded or included in the kernel. ( check with "cat /path_to_kernel/.config | grep -i cap ")
(2.6.11.5-vs-1.9.5 + 0.30-205)|Signature=IrcQuestions}}

{{Question|Question=How can I make 'vserver start' mount the root filesystem||Details=
mount it via /etc/vservers/vserver-name/fstab, make sure to set the option 'dev' e.g.:
<pre>/dev/drbd0 / xfs rw,dev 0 0</pre>
util-vserver 210 won't be able to find some scripts for the reboot, add into /etc/vservers/vserver-name/apps/init/cmd.stop
<pre>/etc/init.d/rc
6</pre>
|Signature=AdrianReyer}}

Problematic Programs

2006-09-19T17:48:26Z

Meandtheshell: I started to port over things from the correspondend oldwiki site

Frequently Asked Questions

2006-09-19T15:44:01Z

Meandtheshell: minor formating

<div style="margin: 2em auto 2em auto; padding: 10px; background-color: #F9ECCD; border: 1px solid #004433; text-align: center;">
[[Image:Icon-Caution.png|left]]
We currently migrate to MediaWiki from our old installation, but not all content has been migrated yet. Take a look at the [[Wiki Team]] page for instructions how to help or look at the [http://oldwiki.linux-vserver.org old wiki] to find the information not migrated yet.

'''To ease migration we created a [[List of old Documentation pages]].'''
</div>

CURRENTLY THE CONTENT OF THE OLD WIKI FAQ (AND MORE) IS BEING MIGRATED TO THIS PAGE (TASK: DERJOHN)

__TOC__

{{Question|Question=What is a 'Guest'?||Details=To talk about stuff, we need some naming. The physical machine is called 'Host' and the 'main' context running the Host Distro is called 'Host Context'. The virtual machine/distro is called 'Guest' and basically is a Distribution (Userspace) running inside a 'Guest Context'.|Signature=derjohn}}

{{Question|Question=What kind of Operating System (OS) can I run as guest?||Details=
A: With VServer you can only run Linux guests. The trick is that a guest does not run a kernel on its own (as XEN and UML do), it merely uses a virtualized host kernel-interface. VServer offers so called security contexts which make it possible to seperate one guest from each other, i.e. they cannot get data from each other. Imagine it as a chroot environment with much more security and features.|Signature=derjohn}}

{{Question|Question=Which distributions did you test?||Details=
A: Some. Check out the wiki for ready-made guest images. But you can easily build own guest images, e.g. with Debian's debootstrap. Checkout ((step-by-step Guide 2.6)) how to do that.|Signature=derjohn}}

{{Question|Question=Is VServer comparable to XEN/UML/QEMU?||Details=
A: Nope. XEN/UML/QEMU and VServer are just good friends. Because you ask, you probably know what XEN/UML/QEMU are. VServer in contrary to XEN/UML/QEMU not "emulate" any hardware you run a kernel on. You can run a VServer kernel in a XEN/UML/QEMU guest. This is confirmed to work at least with Linux 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Is VServer secure?||Details=
A: We hope so. It should be as least as secure as Linux is. We consider it much much more secure though.|Signature=derjohn}}

{{Question|Question=Performance?||Details=
A: For a single guest, we basically have native performance. Some tests showed insignificant overhead (about 1-2%) others ran faster than on an unpatched kernel. This is IMVHO significantly less than other solutions waste, especially if you have more than a single guest (because of the resource sharing).|Signature=derjohn}}

{{Question|Question=Is SMP Supported?||Details=
A: Yes, on all SMP capable kernel architectures.|Signature=derjohn}}

{{Question|Question=Resource sharing?||Details=
A: Yes ....
* memory: Dynamically.
* CPU usage: Dynamically (token bucket)|Signature=derjohn}}

{{Question|Question=Resource limiting?||Details=
A: Yes, you can set maximum limits per guest, but you can only offer guaranteed resource availability with some ticks at the time. There is the possibility to ulimit and to rlimit. Rlimit is a new feature of kernel 2.6/vs2.0.|Signature=derjohn}}

{{Question|Question=Disk I/O limiting? Is that possible?||Details=
A: Well, since vs2.1.1 linux-vserver supports a mechanism called 'I/O scheduling', which appeared in the 2.6 mainline some time ago. The mainline kernel offers several I/O schedulers:

<pre>
# cat /sys/block/hdc/queue/scheduler
noop [anticipatory] deadline cfq
</pre>

The default is anticipatory a.k.a. "AS". When running several guests on a host you probably want the I/O performance shared in a fair way among the different guests. The kernel comes with a "completely fair queueing" scheduler, CFQ, which can do that. (More on schedulers can be found at http://lwn.net/Articles/114770/)

This is how to set the scheduler to "cfq" manually:
<pre>
root# echo "cfq" > /sys/block/hdc/queue/scheduler
root# cat /sys/block/hdc/queue/scheduler
noop anticipatory deadline [cfq]
</pre>

Keep in mind that you have to do it on all physical discs. So if you run an md-softraid, do it to all physical /dev/hdXYZ discs!

If you run Debian there is a predefined way to set the /sys values at boot-time:

<pre>
# apt-get install sysfsutils
[...]

# cat /etc/sysfs.conf | grep cfq
block/sda/queue/scheduler = cfq
block/sdc/queue/scheduler = cfq

# /etc/init.d/sysfsutils restart
</pre>

For non-vserver processes and CFQ you can set by which key the kernel decides about the fairness:

<pre>
cat /sys/block/hdc/queue/iosched/key_type
pgid [tgid] uid gid
</pre>
Hint: The 'key_type'-feature has been removed in the mainline kernel recently. Don't look for it any longer :(

The default is tgid, which means to share fairly among process groups. Think every guest is treated like a own process group. It's not possible to set a scheduler strategy within a guest. All processes belonging to the same guest are treated like "noop" within the guest. So: If you run apache and some ftp-server within the _same_ guest, there is no fair scheduling between them, but there is fair scheduling between the whole guest and all other guests.

And: It's possible to tune the scheduler parameters in several ways. Have a look at /sys/block/hdc/queue/....

You need a very recent Version of VS devel, e.g. the 2.1.1-rc18 can do it. Some older version have that feature too, then it got lost and was reinvented. So: Go and get a rc18 - only in 'devel', not stable!|Signature=derjohn}}

{{Question|Question=Why isn't there a device /dev/bla? within a guest||Details=
A: Device nodes allow Userspace to access hardware (or virtual resources). Creating a device node inside the guest's namespace will give access to that device, so for security reasons, the number of 'given' devices is small.|Signature=derjohn}}

{{Question|Question=What is Unification (vunify)?||Details=
A: Unification is Hard Links on Steroids. Guests can 'share' common files (usually binaries and libraries) in a secure way, by creating hard links with special properties (immutable but unlinkable (removable)). The tool to identify common files and to unify them is called vunify.|Signature=derjohn}}

{{Question|Question=What is vhashify?||Details=
A: The successor of vunify, a tool which does unification based on hash values (which allows to find common files in arbitrary paths.)|Signature=derjohn}}

{{Question|Question=How do I manage a multi-guest setup with vhashify?||Details=
A: For 'vhashify', just do these once:

<pre>
mkdir /etc/vservers/.defaults/apps/vunify/hash /vservers/.hash
ln -0s /vservers/.hash /etc/vservers/.defaults/apps/vunify/hash/root
</pre>

Then, do this one line per vserver:

<pre>
mkdir /etc/vservers/<vservername>/apps/vunify # vhashify reuses vunify configuration
</pre>

The command 'ln' creates a link between two files. "ln -s" creates a symbolic link -- two files are linked by name. "ln -0s" uses a Vserver extention to create a unified link.|Signature=derjohn}}

{{Question|Question=With which VS version should I begin?||Details=
A: If you are new to VServer I recommend to try 2.0.+. Take "alpha utils" Version 0.30.210. In Debian Sid there appeared well running version of it recently. (It's a .210 at the time of writing).|Signature=derjohn}}

{{Question|Question=is there a way to implement "user/group quota" per VServer?||Details=
A: Yes, but not on a shared partition for now. You need to put the guest on a separate partition, setup a vroot device (to make the quota access secure), copy that into the guest, and adjust the mtab line inside the guest.|Signature=derjohn}}

{{Question|Question=what about "Quota" for a context?||Details=
A: Context quotas are now called Disk Limits (so that we can tell them apart from the user/group quotas :). They are supported out of the box (with vs2.0) for all major filesystems (Ext2/3, ReiserFS, XFS, JFS)|Signature=derjohn}}

{{Question|Question=Does it support IPv6?||Details=
A: Currently not. Some developer has to move his ... to reimplement this functionality from the V4 code (I read that on the ML ;)). Will probably be superseded by the ngnet (next generation networking) soon. There is a Wiki page regarding this: http://linux-vserver.org/IPv6|Signature=derjohn}}

{{Question|Question=I can't do all I want with the network interfaces inside the guest?||Details=
A: For now the networking is 'Host Business' -- the host is a router, and each guest is a server. You can set the capability ICMP_RAW in the context of the guest, or even the capability CAP_NET_RAW (which would even allow to sniff interfaces of other guests!). Likely to change with ngnet. |Signature=derjohn}}

{{Question|Question=Is there a web-based interface for vserver that will allow creation/deletion/configuration etc. of vserver guests?||Details=
A. [Update] Errrh, there is http://OpenVPS.org which is a set of scripts with a web-interface for webhosters/ISPs.
A. [Update] Errrh, there is http://Openvcp.org which is a distributed system (agent!) with a web-interface, with which you can build/remove guests! cool stuff! beta, try out!|Signature=derjohn}}

{{Question|Question=What is old-style and new-style config?||Details=
A. Old-style config refers to a single text-file that contains all the configuration settings. With new-style config the configuration is split into several directories and files. You should probably go for new-style config if you are asking.|Signature=derjohn}}

{{Question|Question=What is the "great flower page"?||Details=
A. Well, this page contains all configation options for vserver in version > 1.9 (I think .. I joined Linux-VServer in version 2, so I don't know for sure). The name of the page is derivived from the stylesheet(s) it contains: It displays background pictures of a very great flower, so regard it as highly optimized. It was designed by a non-designer, who asks us to create a better one. I played with the thought of creating a complete new theme for that page - but actually we all got used to the name "great flower page", so we stick to it. If you are unable to read it clearly, feel invited to join the IRC channel #vserver, we may tell you how to ;)|Signature=derjohn}}

{{Question|Question=How do I add several IPs to a vserver? ||Details=
A: First of all a single guest vserver only supports up to 16 IPs (There is a 64-IP patch available, which is in "derjohn's kernel", you need extra util-vserver anyway).
Here is a little helper-script that adds a list of IPs defined in a text file, one per line.
<pre>
#!/bin/bash
j=1
for i in `cat myiplist`; do
j=$(($j+1))
mkdir $j
echo $i > $j/ip
echo $i > $j/ip-old
echo "24" > $j/prefix
done
</pre>|Signature=derjohn}}

{{Question|Question=If my host has only one a single public IP, can I use RFC1918 IP (e.g. 192.168.foo.bar) for the guest vservers?||Details=
A: Yes, use iptables with SNAT to masquerade it.
<pre>
iptables -t nat -I POSTROUTING -s $VSERVER_NETZ ! -d $VSERVER_NETZ -j SNAT --to $EXT_IP
</pre>
See: HowtoPrivateNetworking and
http://www.tgunkel.de/it/software/doc/linux_server#h3-Vserver_Masquerading_SNAT (THX, [MUPPETS]Gonzo)|Signature=derjohn}}

{{Question|Question=If I shut down my vserver guest, the whole Internet interface ethX on the host is shut down. What happened? ||Details=
A: When you shut down a guest (''i.e. vserver foo stop''), the IP is brought down on the host also. If this IP happens to be the primary IP of the host, the kernel will not only bring down the primary IP, but also all secondary IP addresses. But in very recent kernels, there is an option ''settable'' which prevents that nasty feature. It's called "alias promotion". You may set it via sysctl by adding ''net.ipv4.conf.all.promote_secondaries=1'' in /etc/sysctl.conf or via sysctl command line.|Signature=derjohn}}

{{Question|Question=On Debian Sarge (stable) only util-vserver is 0.30-204 available, which has been reported to be buggy (I didnt check the version for longer time) How do I compile a local version of alpha util-vserver .210 on Debian?||Details=
A:
<pre>
apt-get build-dep util-vserver

./configure --prefix=/usr/local/ --enable-release \
--mandir=/usr/local/share/man \
--infodir=/usr/local/share/info \
--sysconfdir=/etc --enable-dietlibc \
--localstatedir=/var \
--with-vrootdir=/var/lib/vservers

make

make install-distribution
(Which does a make install + setting a symlink ln -s /usr/local/lib/util-vserver/vshelper /sbin/vshelper )

</pre>

To test which version you are running:
<pre>
# which vserver
/usr/local/sbin/vserver

</pre>

This should point to ..local...

If you dont want to build it yourself: On www.backports.org there are backported (for sarge) linux-images (2.6.16) with vserver-patch enabled and a updated util-vserver package as well.
|Signature=derjohn}}

{{Question|Question=I use derjohn's kernel or a differnet kernel with a more-than-16-IPs-per-guest-patch and can't use more than 16 IPs. Why?||Details=
A: You need to patch util-vserver, too. So you obviously need to recompile util-vserver (see above). In the util-vserver directory there are header files in the ./kernel/ directory. Patch like this:

<pre>
kernel/network.h:#define NB_IPV4ROOT 64
</pre>

BTW: The initial patches can be found here: http://vserver.13thfloor.at/Experimental/VARIOUS/util-vserver-0.30.196-net64.diff.bz2 and http://vserver.13thfloor.at/Experimental/VARIOUS/delta-2.6.9-vs1.9.3-net64.diff
|Signature=unknown}}

{{Question|Question=I run a Debian host and want to build an Ubuntu guest. Howto?||Details=
A: Simple ;) Assume you want to build a breezy guest on a sid host with IP 192.168.0.2 and hostname vubuntu, then do:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu
</pre>

[UPDATE] Currently there are problems in building breezy under unclear circumstances, which seems to have to do with udev. If the above didnt work, try:
<pre>
vserver vubuntu build --force -m debootstrap --hostname vubuntu.myvservers.net --netdev eth0 --interface 192.168.0.2/24 \
--context 42 -- -d breezy -m http://de.archive.ubuntu.com/ubuntu -- --exclude=udev
</pre>
In very recent versions of the utils, the problem should not occur anymore (it has to do with the 'secure-mount' if you look in the MLs)

Well, sid's debootstrap knows how to bootstrap Ubuntu linux. Make sure to have a current debootstrap package:
<pre>
apt-get update
apt-get install debootstrap
</pre>
The knowledge how to build ubuntu 'breezy badger' (which you probably want to be your guest at the time of writing) has been added recently.|Signature=derjohn}}

{{Question|Question=How do I make a vserver guest start by default?||Details=
A: At least on Debian, I can tell you how to do it with the new-style config. If your guest is called "derjohn" and you want it to be started somewhere at the of your bootstrap process, then do:
<pre>
echo "default" > /etc/vservers/derjohn/apps/init/mark
</pre>
If you want to start it earlier, please read the init script "/etc/init.d/vserver-default" to find out how to do it. In most cases you don't need to change this. On Debian the vservers are started at "90", so after most other stuff is up (networking etc.).

Besides that I created a small helper script for managing the autostart foo: ((vserver-autostart))|Signature=derjohn}}

{{Question|Question=My host works, but when I start a guest it says that it has a problem with chbind.||Details=
A: You are probably using util-vserver <= 0.30.209, which does use dynamic network contexts internally (With 0.30.210 this fact changed). So if you compiled your kernel without dynamic contexts, you may start guests, but you can't use the network context.The solution is either to switch to .210 util (or Hollow's toolset) or compile the kernel with dynamic network contexts.
SE Keyword: invalid option `nid' testme.sh|Signature=derjohn}}

{{Question|Question=When I try to ssh to the guest, I log into the host, even if I installed sshd on the guest. What's wrong here?||Details=
A: Look at /etc/ssh/sshd_config of the host:

<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
</pre>

And now change the setting to
<pre>
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
ListenAddress your.hosts.ip.here # not the guests IP!
</pre>

Then '/etc/init.d/ssh restart' on the host, after that on the guest (if you did apt-get install ssh on the guest already.)

Do I have to explain more? If the hosts sshd binds all available IP addresses on port 22 (The hosts 'sees' even all addresses of the guests!). So if the guest starts its sshd, it cant bind to port 22 any more. You need to change that setting only on the host.
(BTW: A similar approach has to be done for a lot of daemons, e.g. Apache. If the daemon does not support an explicit bind, you may use the chbind command to 'hide' IP addresses from the daemon before starting.)|Signature=derjohn}}

{{Question|Question=I did everything right, but the application foo does not start. What's up there?||Details=
A: Before asking on the IRC channel, please check out the 'problematic programs' page:
{{ProblematicPrograms}}|Signature=derjohn}}

{{Question|Question=Bind9 does not like to start in my guest.||Details=
A: Check out the 'problematic programs' page:
((ProblematicPrograms)) and/or get my [((ProblematicPrograms)) and/or get my [http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb vserver-guest-ready Debian package] for Debian Sid guests from that URL: http://linux-vserver.derjohn.de/bind9-packages/bind9-capacheck_9.3.2-2_i386.deb and check out the [http://linux-vserver.derjohn.de/bind9-packages/README.txt readme]. (Hint: This is fresh stuff. The give me Feedback)

[UPDATE] Since VServer Devel 2.1.1-rc18 you do not need to patch the userland tools anymore. The capabilities are masked.|Signature=derjohn}}

{{Question|Question=Which guest vservers are running?||Details=
A: {{vserver-stat}}. Example output:
<pre>
CTX PROC VSZ RSS userTIME sysTIME UPTIME NAME
0 77 965.1M 334.6M 14m14s18 2m28s69 1h33m46 root server
49152 7 14M 5.2M 0m00s40 0m00s30 1h30m15 chiffon
</pre>|Signature=derjohn}}

{{Question|Question=How can I reboot/halt guests?||Details=
A: It depends.
For vserver with legacy-interfaces support, you have to replace {{/sbin/halt}} in guests with vreboot and start rebootmgr in host. You also need to have a dummy <guest>.conf file in /etc/vservers for each guest. Please have a look at /etc/init.d/rebootmgr.
Vserver with native interface utilizes /dev/initctl. No changes are needed in guests. Just make sure that REBOOT capability is adjusted in guests.|Signature=derjohn}}

{{Question|Question=Do I really need the legacy-interfaces? What are these legacy-interfaces?||Details=
A: Since vserver is an ongoing project, new features might replace old ones, some might still on development. Legacy-interfaces are available for backward compability (which might be removed someday). See Q: How can I reboot/halt guests?|Signature=derjohn}}

{{Question|Question= I have a vserver running on a Linux kernel with preemption. Is VServer "preempt" safe?||Details=
A: There are no known issues about running vserver on a preemption enabled kernel. I would like to add, that the vserver kernelhackers would probably exclude that option in 'make menuconfig' if there would be an incompatibility. Just my $.02 :)|Signature=derjohn}}

{{Question|Question=Is this a new project? When was it started?||Details=
A: The first public occurance of linux-vserver was Oct 2001. The initial mail can be found here: http://www.cs.helsinki.fi/linux/linux-kernel/2001-40/1065.html
So you can expect a mature software product wich does it's magic quite well (And hey, we have a version > 2.0 ! )|Signature=derjohn}}

{{Question|Question=Can I run an OpenVPN Server in a guest?||Details=
A: Yes. I don't want to provide an in-depth OpenVPN tutorial, but want to show how I made OpenVPN work in a guest as server. I was not able to run it with a tun devive, due to a buglet in util-vserver and kernel when it comes to settings a an ip address a point to point link: If you add "ip addr add <ip> peer <mypeer> dev tun0" there is no way to map the tun0 interface into a guest, even not with a 'nodev' option. (bug confirmed to be reproducible by daniel_hoczac)

First of all you have to prepare the host with a persistent tuntap interface in tap-mode. The tools we need come from the uml-utilities.
Then you need to create a device /dev/net/tun, which the OpenVPN userspace daemon reads. Well assume 10.10.10.100 is the server IP, and 10.10.10.101 is the client ip - to be cool be choose a /31 netmask (255.255.255.254), so we have a net without broadcast and don't waste IPs :)

On the host do:
<pre>
# apt-get install uml-utilities
# cd /var/lib/vserver/<myopenvpnserver>/dev/
# ./MAKEDEV tun
(creates the dev/net/tun device accessible by te guest - even a tap interface need /dev/net/tun !)
# tunctl -t tap0
(creates the network device 'tap0' persistently)
</pre>

Then add the ip to the guest:
<pre>
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/ip
10.10.10.100
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/prefix
31
# cat /etc/vservers/<myopenvpnserver>/interfaces/1/dev
tap0
(This kind of config brings the ip when the vserver is started - only the tap0 interface has to exist already, see above!)
</pre>

Here is a sample config for the guest (which is acting as a server):

Install OpenVPN package on server and client, in the Debian case:
<pre>
# apt-get install openvpn
</pre>

The server's conf looks like that:
<pre>
# port and interface specs

# behave like a ssl-webserver
port 443
proto tcp-server

# tap device? (keep in mind you need /dev/net/tun !)
dev tap0

# now the ips we will use for the tunnel
ifconfig 10.10.10.100 255.255.255.254
ifconfig-noexec

# the server part

# Keep VPN connections, even if the client IP changes
float

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key - create it with 'openvpn --genkey --secret static.key'
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4
status openvpn-status.log
</pre>

The client's conf may look like that (This example even makes the tunnel the clients default address):
<pre>
# cat /etc/openvpn/client.conf
# port and interface specs

# the following is not necessary, if you bring up openvpn via Debian's init script:
daemon ovpn-my-clients-name

# behave like a ssl-webserver
port 443
proto tcp-client
remote %%%<insert-the-guest-primary-public-ip-here>%%%%
# what device tun ot tap?
dev tap

# now the ips we will use for the tunnel
ifconfig 10.10.10.101 255.255.255.254

# Keep VPN connections, even if the client IP changes
float
mssfix

# use compression (may also even obfuscate content filters)
comp-lzo

# use a static key
secret static.key

# dont reload the key after a SIGUSR1
persist-key

# check alive all 10 secs
keepalive 10 60

# verbosity level (from 1 to 9, 9 is max log level)
verb 4

# set the default route
route-gateway 10.10.10.100
redirect-gateway def1
# to add special routes you can do it wihtin the openvpn client conf:
# route <dest> <mask> <gateway>

# if you need to connect via proxy (like squid)
# http-proxy s p [up] [auth] : Connect to remote host through an HTTP proxy at
# address s and port p. If proxy authentication is required,
# up is a file containing username/password on 2 lines, or
# 'stdin' to prompt from console. Add auth='ntlm' if
# the proxy requires NTLM authentication.

# http-proxy s p [up] [auth]

# http-proxy-option type [parm] : Set extended HTTP proxy options.
# Repeat to set multiple options.
# VERSION version (default=1.0)
# AGENT user-agent

# http-proxy-option type [parm]
</pre>

In the next lesson I will talk about OpenVPN's server mode, which can deal with with multiple clients connecting to one ip and one port (i.e. you only need one guest for tons or 'roadwarriros'), tls connections and pki.

Contributions welcome. :)|Signature=derjohn}}

{{Question|Question=32 vs 64 Bit? What should I take?||Details=
A: If you have the choice make the host a 64 bit one. You can run a guest as 32 bit or as 64 bit on a 64 bit host. To run it as 32 bit, you need to compile the x86_64 (a.k.a. AMD64) with the following options:

<pre>
[*] Kernel support for ELF binaries
<M> Kernel support for MISC binaries
[*] IA32 Emulation <---- without that, the entire 32bit API is not present
<M> IA32 a.out support
</pre>

You can force the guest to behave like a 32 environment like this:
<pre>
echo linux_32bit > /etc/vservers/$NAME/personality
echo i686 > /etc/vservers/$NAME/uts/machine
</pre>
(thanks cehteh for the hint!)

But you can force debootstrap to but 32 bit binaries into the guest by 'export ARCH=i386';
<pre>
export ARCH=i386 ; vserver build ....
</pre>|Signature=derjohn}}

{{Question|Question=I want to (re)mount a partition in a running guest ... but the guest has no rights (capability) to (re)mount?||Details=
A: I'll explain. I take as example your /tmp partition within the guest is too small, what will be likely the case if you stay with the 16MB default (vserver build mounts /tmp as 16 MB tempfs!).
<pre>
# vnamespace -e XID mount -t tmpfs -o remount,size=256m,mode=1777 none /var/lib/vservers/<guest>/tmp/
</pre>
Be warned that the guest will not recognize the change, as the /etc/mtab file is not updated when you mount like this. To permanently change the mount, edit /etc/vserver/<guest>/fstab on the host.|Signature=derjohn}}

{{Question|Question=How do I limit a guests RAM? I want to prevent OOM situations on the host!||Details=
A: First you can read [http://linux-vserver.org/Memory+Allocation].
If you want a recipe, do that:
1. Check the size of memory pages. On x86 and x86_64 is usually 4 KB per page.
2. Create /etc/vserver/<guest>/rlimits/
3. Check your physical memory size on the host, e.g. with "free -m". maxram = kilobytes/pagesize.
4. Limit the guests physical RAM to value smaller then maxram:

<pre>
echo %%insertYourPagesHereSmallerThanMaxram%% > /etc/vserver/<guest>/rlimits/rss
</pre>

5. Check your swapspace, e.g. with 'swapon -s'. maxswap = swapkilobytes/pagesize.
6. Limit the guest's maximum number of as pages to a value smaller than (maxram+maxswap):

<pre>
echo %%desiredvalue%% > /etc/vserver/<guest>/rlimits/as
</pre>

It should be clear this can still lead to OOM situations. Example: You have two guests and your as limit per guest is greater than 50% of (maxram+maxswap). If both guests request their maximum at the same point in time, there will be not enough mem .....|Signature=derjohn}}

{{Question|Question=Were can I get newer versions of VServer as ready made packages for Debian?||Details=
A: Here you go: http://linux-vserver.derjohn.de/ . There is also some stuff on backports.org, but my kernels are always 'devel' branch.|Signature=derjohn}}

{{Question|Question=Can I use iptables ?||Details=
Yes but right now only on the host (rootserver). Please realize that all traffic is local and will not touch the forward chain.|Signature=BeginnerFAQ}}

{{Question|Question=Try to connect to a vserver from the master or another vserver on the same host fails with
:: strace shows:
<pre>
sin_addr=inet_addr("xx.xx.xx.xx")}, yy) = -1 EINVAL (Invalid argument)
</pre>

||Details=
A: The vserver/master cannot communicate with another vserver on same host.
* check all netmasks on all interfaces (do they overlap) ?
* check policy routing (disable it temporary) ?
* check that lo is up (Networking within a host/vserver always uses lo interface)
|Signature=CommonProblems}}

{{Question|Question=#1 ERROR: capset(): Operation not permitted||Details=capabilities are not enabled in kernel-setup
please check that CONFIG_SECURITY_CAPABILITIES is loaded or included in the kernel. ( check with "cat /path_to_kernel/.config | grep -i cap ")
(2.6.11.5-vs-1.9.5 + 0.30-205)|Signature=IrcQuestions}}

{{Question|Question=How can I make 'vserver start' mount the root filesystem||Details=
mount it via /etc/vservers/vserver-name/fstab, make sure to set the option 'dev' e.g.:
<pre>/dev/drbd0 / xfs rw,dev 0 0</pre>
util-vserver 210 won't be able to find some scripts for the reboot, add into /etc/vservers/vserver-name/apps/init/cmd.stop
<pre>/etc/init.d/rc
6</pre>
|Signature=AdrianReyer}}

Usage Scenarios

2006-09-18T19:42:37Z

Meandtheshell: fixed typo

For many people, virtual server may look like a great toy: Very high geekness factor. It looks cool, but probably not for everyone. '''Wrong!'''

== Usage scenarios ==

=== Consolidation and Seperation ===

As the hardware evolves, it is tempting to put more and more tasks on a server. Linux can handle it. Linux is reliable and so on. At some point, you end up with so much stuff and so many people fiddling in the same box that you worry about updating things.

Vservers address this. The same box runs multiple vservers and each one does the job it is supposed to do. If you need to upgrade to php5 for a given project you do so and only that project is affected.

Also, you can give the root password of a vserver to one administrator and he will be able to perform updates, restart services and so on without having to know about every other project hosted on a server.

=== Resource Independance ===

Since vservers are only guests on the hardware they are using, they are not aware of the specifics: They do not contain disk configurations, kernels or network configurations.

Once you have found that a project is using more resource than expected, you can move it to another box without having to fiddle here and there. A vserver is just a directory inside the host server. You tar it and copy it to another box and restart it there.

An administrator can move vservers around without his users knowing what he does.

=== Experimenting and Upgrading ===

You have this typical problem. You intend to upgrade a site either with new package (new PHP5) or new features (new version of the Web applications). After having tested all this on the development machine, you are ready to update the production server. Having some experience you do it properly

* First a good backup of the server
* Then you perform all the upgrades and install the new applications
* Fire: the new server is online and kicking.
* 2 hours later, you realise that something does not work as expected.
* To make it worse, it works fine on the development machine.
* Now it is 2 am and this has to work by 8am. Hum.

We have all experienced this. Another solution to this problem would be to install the new production server on new hardware, but this is not as easy, as you have to clone the first server (most people are not comfortable doing this) or you do not have the hardware.

Using vserver, all this is very easy

* You stop the production vserver
* You clone it (Cloning a vserver takes 1 minute unless it has a lot of data).
* You perform the upgrades in the new vserver and give it life.

Later you find it does not work as expected and you can't immediately fix it.

* You turn off the new vserver and assign it a new IP number.
* You start both the old and new. Now the old is still online and you can fiddle with the new (on a different IP address) to fix the problem.

=== Development and Testing ===

You are working on a new project. At this point you have no idea of the resources it will need (load). In general, you have a meeting and make a decision about the new hardware. But in this case we have no idea. So instead, we put the new project on a vserver on the first available linux box, including the workstation of the developer. And we develop. Once the thing is ready, we clone the vserver on some production server and give it a go.

Later once we know how popular the new service is, we can move it again to a more powerful server, as needed, without any fiddling (moving accounts, installing packages, and so on)

As a good example, Herbert has 33 vservers on his notebook allowing him to test various projects and various distributions from rh6.2 to FC3 and Debian.

=== Distribution Independence ===

We are often talking about our preferred distribution. Should we use FC, Debian or something else ? Should we give a spin to the latest and greatest ?

With vservers, the choice of a distribution is less important. When you select a distribution, you expect it will do the following

* Good hardware support/detection
* Good package technology/updates
* Good package selection
* Reliable packages

The choice is important because every service running on a box will be using the same distribution. Most distributions out there are good and reliable. But they have flaws. For example, a distribution - say XXX - is great but is not delivering the latest and greatest PHP. Now because you have decided to use XXX for some projects, it does not prevent you from using XXY for other projects. So instead of moving to XXY for everything, you move to it as you see fit, project per project.

== Other considerations ==

* '''Virtual Private Servers are running on the same kernel as the host:''' Unlike other VM solutions, vservers don't require additional memory or processing power.

* '''There are no special daemons running:''' A vserver running crond, sshd, httpd and sendmail uses the same resources as a normal Linux server running these services.

* '''No pre-allocated disk space needed:''' Vservers generally share the disk space with the host, so there is no need to pre-allocate disk space for each vserver only to find out later that your disk is full, yet each vserver is only using of tiny portion of their allocated space.

* '''Resource sharing:''' Since vservers can share binaries and libraries without interfering, a second vserver generally cost 40-100 megs of disk space only. Most of this space is a copy of the packaging database.

* '''Independent updates:''' Vservers are updated independently even if they share binaries with other vservers.

* '''32-/64-bit independence:''' 32-bit distribution vservers run normally on a 64-bit linux host, but faster, sometimes a lot.

* '''Admin tools work inside a vserver as usual''': A vserver feels like a real server from within and can be used in the same ways.

* '''A cracked vserver can't reach the host server:''' The host server may be used to safely investigate the cracked server (and fix it), something almost impossible with a typical linux installation.

== See Also ==

* [[Overview]]

Usage Scenarios

2006-09-18T19:41:54Z

Meandtheshell: fixed typo

For many people, virtual server may look like a great toy: Very high geekness factor. It looks cool, but probably not for everyone. '''Wrong!'''

== Usage scenarios ==

=== Consolidation and Seperation ===

As the hardware evolves, it is tempting to put more and more tasks on a server. Linux can handle it. Linux is reliable and so on. At some point, you end up with so much stuff and so many people fiddling in the same box that you worry about updating things.

Vservers address this. The same box runs multiple vservers and each one does the job it is supposed to do. If you need to upgrade to php5 for a given project you do so and only that project is affected.

Also, you can give the root password of a vserver to one administrator and he will be able to perform updates, restart services and so on without having to know about every other project hosted on a server.

=== Resource Independance ===

Since vservers are only guests on the hardware they are using, they are not aware of the specifics: They do not contain disk configurations, kernels or network configurations.

Once you have found that a project is using more resource than expected, you can move it to another box without having to fiddle here and there. A vserver is just a directory inside the host server. You tar it and copy it to another box and restart it there.

An administrator can move vservers around without his users knowing what he does.

=== Experimenting and Upgrading ===

You have this typical problem. You intend to upgrade a site either with new package (new PHP5) or new features (new version of the Web applications). After having tested all this on the development machine, you are ready to update the production server. Having some experience you do it properly

* First a good backup of the server
* Then you perform all the upgrades and install the new applications
* Fire: the new server is online and kicking.
* 2 hours later, you realise that something does not work as expected.
* To make it worse, it works fine on the development machine.
* Now it is 2 am and this has to work by 8am. Hum.

We have all experienced this. Another solution to this problem would be to install the new production server on new hardware, but this is not as easy, as you have to clone the first server (most people are not confortable doing this) or you do not have the hardware.

Using vserver, all this is very easy

* You stop the production vserver
* You clone it (Cloning a vserver takes 1 minute unless it has a lot of data).
* You perform the upgrades in the new vserver and give it life.

Later you find it does not work as expected and you can't immediately fix it.

* You turn off the new vserver and assign it a new IP number.
* You start both the old and new. Now the old is still online and you can fiddle with the new (on a different IP address) to fix the problem.

=== Development and Testing ===

You are working on a new project. At this point you have no idea of the resources it will need (load). In general, you have a meeting and make a decision about the new hardware. But in this case we have no idea. So instead, we put the new project on a vserver on the first available linux box, including the workstation of the developer. And we develop. Once the thing is ready, we clone the vserver on some production server and give it a go.

Later once we know how popular the new service is, we can move it again to a more powerful server, as needed, without any fiddling (moving accounts, installing packages, and so on)

As a good example, Herbert has 33 vservers on his notebook allowing him to test various projects and various distributions from rh6.2 to FC3 and Debian.

=== Distribution Independence ===

We are often talking about our preferred distribution. Should we use FC, Debian or something else ? Should we give a spin to the latest and greatest ?

With vservers, the choice of a distribution is less important. When you select a distribution, you expect it will do the following

* Good hardware support/detection
* Good package technology/updates
* Good package selection
* Reliable packages

The choice is important because every service running on a box will be using the same distribution. Most distributions out there are good and reliable. But they have flaws. For example, a distribution - say XXX - is great but is not delivering the latest and greatest PHP. Now because you have decided to use XXX for some projects, it does not prevent you from using XXY for other projects. So instead of moving to XXY for everything, you move to it as you see fit, project per project.

== Other considerations ==

* '''Virtual Private Servers are running on the same kernel as the host:''' Unlike other VM solutions, vservers don't require additional memory or processing power.

* '''There are no special daemons running:''' A vserver running crond, sshd, httpd and sendmail uses the same resources as a normal Linux server running these services.

* '''No pre-allocated disk space needed:''' Vservers generally share the disk space with the host, so there is no need to pre-allocate disk space for each vserver only to find out later that your disk is full, yet each vserver is only using of tiny portion of their allocated space.

* '''Resource sharing:''' Since vservers can share binaries and libraries without interfering, a second vserver generally cost 40-100 megs of disk space only. Most of this space is a copy of the packaging database.

* '''Independent updates:''' Vservers are updated independently even if they share binaries with other vservers.

* '''32-/64-bit independence:''' 32-bit distribution vservers run normally on a 64-bit linux host, but faster, sometimes a lot.

* '''Admin tools work inside a vserver as usual''': A vserver feels like a real server from within and can be used in the same ways.

* '''A cracked vserver can't reach the host server:''' The host server may be used to safely investigate the cracked server (and fix it), something almost impossible with a typical linux installation.

== See Also ==

* [[Overview]]