Chapter Excerpt

Using Xen PyGRUB, ionice to manage storage and disks

Solutions Provider Takeaway: This chapter excerpt gives valuable information on Xen PyGRUB, which is a tool solutions providers can use to virtualize their customers' hardware. Learn how to regulate disk access by using ionice and how to maintain storage in a shared hosting environment. You will also find out how to back up DomUS and gain remote access to the DomU.

About the book:
This chapter excerpt on Hosting untrusted users under Xen: Lessons from the trenches (download PDF) is taken from The Book of Xen: A practical guide for the system administrator. This book advises solutions providers on the best practices for Xen installation, networking, memory management and virtualized storage. You'll also find information on virtual hosting, installing and managing multiple guests, easily migrating systems and troubleshooting common Xen issues.

Storage in a Shared Hosting Environment

As with so much else in system administration, a bit of planning can save a lot of trouble. Figure out beforehand where you're going to store pristine filesystem images, where configuration files go, and where customer data will live.

For pristine images, there are a lot of conventions -- some people use /diskimages, some use /opt/xen, /var/xen or similar, some use a subdirectory of /home. Pick one and stick with it.

Configuration files should, without exception, go in /etc/xen. If you don't give xm create a full path, it'll look for the file in /etc/xen. Don't disappoint it.

As for customer data, we recommend that serious hosting providers use LVM. This allows greater flexibility and manageability than blktap-mapped files while maintaining good performance. Chapter 4 covers the details of working with LVM (or at least enough to get started), as well as many other available storage options and their advantages. Here we're confining ourselves to lessons that we've learned from our adventures in shared hosting.

Regulating Disk Access with ionice

One common problem with VPS hosting is that customers -- or your own housekeeping processes, like backups -- will use enough I/O bandwidth to slow down everyone on the machine. Furthermore, I/O isn't really affected by the scheduler tweaks discussed earlier. A domain can request data, hand off the CPU, and save its credits until it's notified of the data's arrival.

Although you can't set hard limits on disk access rates as you can with the network QoS, you can use the ionice command to prioritize the different domains into subclasses, with a syntax like:

# ionice -p <PID> -c <class> -n <priority within class>

Here -n is the knob you'll ordinarily want to twiddle. It can range from 0 to 7, with lower numbers taking precedence.

We recommend always specifying 2 for the class. Other classes exist -- 3 is idle and 1 is realtime -- but idle is extremely conservative, while realtime is so aggressive as to have a good chance of locking up the system. The within-class priority is aimed at proportional allocation, and is thus much more likely to be what you want.

Let's look at ionice in action. Here we'll test ionice with two different domains, one with the highest normal priority, the other with the lowest.

First, ionice only works with the CFQ I/O scheduler. To check that you're using the CFQ scheduler, run this command in the dom0:

# cat /sys/block/[sh]d[a-z]*/queue/scheduler

noop anticipatory deadline [cfq]

noop anticipatory deadline [cfq]

The word in brackets is the selected scheduler. If it's not [cfq], reboot with the parameter elevator = cfq.

Next we find the processes we want to ionice. Because we're using tap:aio devices in this example, the dom0 process is tapdisk. If we were using phy: devices, it'd be [xvd <domain id> <device specifier>].

# ps aux | grep tapdisk
root  1054  0.5  0.0  13588  556  ?  Sl  05:45  0:10  tapdisk
/dev/xen/tapctrlwrite1 /dev/xen/tapctrlread1
root  1172  0.6  0.0  13592  560  ?  Sl  05:45  0:10  tapdisk
/dev/xen/tapctrlwrite2 /dev/xen/tapctrlread2

Now we can ionice our domains. Note that the numbers of the tapctrl devices correspond to the order the domains were started in, not the domain ID.

# ionice -p 1054 -c 2 -n 7

# ionice -p 1172 -c 2 -n 0

To test ionice, let's run a couple of Bonnie++ processes and time them.(After Bonnie++ finishes, we dd a load file, just to make sure that conditions for the other domain remain unchanged.)

prio 7 domU tmp # /usr/bin/time -v bonnie++ -u 1 && dd if=/dev/urandom of=load

prio 0 domU tmp # /usr/bin/time -v bonnie++ -u 1 && dd if=/dev/urandom of=load

In the end, according to the wall clock, the domU with priority 0 took 3:32.33 to finish, while the priority 7 domU needed 5:07.98. As you can see, the ionice priorities provide an effective way to do proportional I/O allocation.

The best way to apply ionice is probably to look at CPU allocations and convert them into priority classes. Domains with the highest CPU allocation get priority 1, next highest priority 2, and so on. Processes in the dom0 should be ioniced as appropriate. This will ensure a reasonable priority, but not allow big domUs to take over the entirety of the I/O bandwidth.

Backing Up DomUs

As a service provider, one rapidly learns that customers don't do their own backups. When a disk fails (not if -- when), customers will expect you to have complete backups of their data, and they'll be very sad if you don't. So let's talk about backups.

Of course, you already have a good idea how to back up physical machines. There are two aspects to backing up Xen domains: First, there's the domain's virtual disk, which we want to back up just as we would a real machine's disk. Second, there's the domain's running state, which can be saved and restored from the dom0. Ordinarily, our use of backup refers purely to the disk, as it would with physical machines, but with the advantage that we can use domain snapshots to pause the domain long enough to get a clean disk image.

We use xm save and LVM snapshots to back up both the domain's storage and running state. LVM snapshots aren't a good way of implementing full copy-on-write because they handle the "out of snapshot space" case poorly, but they're excellent if you want to preserve a filesystem state long enough to make a consistent backup.

Our implementation copies the entire disk image using either a plain cp (in the case of file-backed domUs) or dd (for phy: devices). This is because we very much want to avoid mounting a possibly unclean filesystem in the dom0, which can cause the entire machine to panic. Besides, if we do a raw device backup, domU administrators will be able to use filesystems (such as ZFS on an OpenSolaris domU) that the dom0 cannot read.

An appropriate script to do as we've described might be:

#!/usr/bin/perl
my @disks,@stores,@files,@lvs;

$domain=$ARGV[0];

my $destdir="/var/backup/xen/${domain}/";
system "mkdir -p $destdir";

open (FILE, "/etc/xen/$domain") ;
while (<FILE>) {
            if(m/^disk/) {
                         s/.*[s+([^]]+)s*].*/1/;
                        @disks = split(/[,]/);

                        # discard elements without a :, since they can't be
                        # backing store specifiers
                        while($disks[$n]) {
                                     $disks[$n] =~ s/['"]//g;
                                     push(@stores,"$disks[$n]") if("$disks[$n]"=~ m/:/);
                                     $n++;

                        }
                       $n=0;

                      # split on : and take only the last field if the first
                      # is a recognized device specifier.
                      while($stores[$n]) {
                                      @tmp = split(/:/, $stores[$n]);
                                       if(($tmp[0] =~ m/file/i) || ($tmp[0] =~ m/tap/i)) {
                                                              push(@files, $tmp[$#tmp]);
                                       }
                                       elsif($tmp[0] =~ m/phy/i) {
                                                                  push(@lvs, $tmp[$#tmp]);
                                       }
                                       $n++;
                      }
          }
}
close FILE;

print "xm save $domain $destdir/${domain}.xmsaven";
system ("xm save $domain $destdir/${domain}.xmsave");

foreach(@files) {
                     print "copying $_";
                          system("cp $_ ${destdir}") ;
}
foreach $lv (@lvs) {
system("lvcreate --size 1024m --snapshot --name ${lv}_snap $lv");
}
system ("xm restore $destdir/${domain}.xmsave && gzip $destdir/${domain}.xmsave");

foreach $lv (@lvs) {
    $lvfile=$lv;
    $lvfile=~s///_/g;
    print "backing up $lv";
              system("dd if=${lv}_snap | gzip -c > $destdir/${lvfile}.gz" ) ;
              system("lvremove ${lv}_snap" );
}

Save it as, say, /usr/sbin/backup_domains.sh and tell cron to execute the script at appropriate intervals.

This script works by saving each domain, copying file-based storage, and snapshotting LVs. When that's accomplished, it restores the domain, backs up the save file, and backs up the snapshots via dd.

Note that users will see a brief hiccup in service while the domain is paused and snapshotted. We measured downtime of less than three minutes to get a consistent backup of a domain with a gigabyte of RAM -- well within acceptable parameters for most applications. However, doing a bit-for-bit
copy of an entire disk may also degrade performance somewhat. We suggest doing backups at off-peak hours.

To view other scripts in use at prgmr.com, go to http://book.xen.prgmr.com/.

Remote Access to the DomU

The story on normal access for VPS users is deceptively simple: The Xen VM is exactly like a normal machine at the colocation facility. They can SSH into it (or, if you're providing Windows, rdesktop). However, when problems come up, the user is going to need some way of accessing the machine at a
lower level, as if they were sitting at their VPS's console.

For that, we provide a console server that they can SSH into. The easiest thing to do is to use the dom0 as their console server and sharply limit their accounts.

Note: Analogously, we feel that any colocated machine should have a serial console attached to it. We discuss our reasoning and the specifics of using Xen with a serial console in Chapter 14.

An Emulated Serial Console

Xen already provides basic serial console functionality via xm. You can access a guest's console by typing xm console <domain> within the dom0. Issue commands, then type CTRL-] to exit from the serial console when you're done.

The problem with this approach is that xm has to run from the dom0 with effective UID 0. While this is reasonable enough in an environment with trusted domU administrators, it's not a great idea when you're giving an account to anyone with $5. Dealing with untrusted domU admins, as in a VPS hosting situation, requires some additional work to limit access using ssh and sudo.

First, configure sudo. Edit /etc/sudoers and append, for each user:

<username> ALL=NOPASSWD:/usr/sbin/xm console <vm name>

Next, for each user, we create a ~/.ssh/authorized_keys file like this:

no-agent-forwarding,no-X11-forwarding,no-port-forwarding,command="sudo xm
console <vm name>" ssh-rsa <key> [comment]

This line allows the user to log in with his key. Once he's logged in, sshd connects to the named domain console and automatically presents it to him, thus keeping domU administrators out of the dom0. Also, note the options that start with no. They're important. We're not in the business of providing shell accounts. This is purely a console server -- we want people to use their domUs rather than the dom0 for standard SSH stuff. These settings will allow users to access their domains' consoles via SSH in a way that keeps their access to the dom0 at a minimum.

A Menu for the Users

Of course, letting each user access his console is really just the beginning. By changing the command field in authorized_keys to a custom script, we can provide a menu with a startling array of features!

Here's a sample script that we call xencontrol. Put it somewhere in the filesystem -- say /usr/bin/xencontrol -- and then set the line in authorized_keys to call xencontrol rather than xm console.

#!/bin/bash
DOM="$1"
cat << EOF
`sudo /usr/sbin/xm list $DOM`

Options for $DOM
1. console
2. create/start
3. shutdown
4. destroy/hard shutdown
5. reboot
6. exit
EOF
printf "> "
read X
case "$X" in
*1*) sudo /usr/sbin/xm console "$DOM" ;;
*2*) sudo /usr/sbin/xm create -c "$DOM" ;;
*3*) sudo /usr/sbin/xm shutdown "$DOM" ;;
*4*) sudo /usr/sbin/xm destroy "$DOM" ;;
*5*) sudo /usr/sbin/xm reboot "$DOM" ;;
esac

When the user logs in via SSH, the SSH daemon runs this script in place of the user's login shell (which we recommend setting to /bin/false or its equivalent on your platform). The script then echoes some status information, an informative message, and a list of options. When the user enters a number, it runs the appropriate command (which we've allowed the user to run by configuring sudo).

PyGRUB, a Bootloader for DomUs

Up until now, the configurations that we've described, by and large, have specified the domU's boot configuration in the config file, using the kernel, ramdisk, and extra lines. However, there is an alternative method, which specifies a bootloader line in the config file and in turn uses that to load a kernel from the domU's filesystem.

The bootloader most commonly used is PyGRUB, or Python GRUB. The best way to explain PyGRUB is probably to step back and examine the program it's based on, GRUB, the GRand Unified Bootloader. GRUB itself is a traditional bootloader -- a program that sits in a location on the hard drive where the BIOS can load and execute it, which then itself loads and executes a kernel.

PyGRUB, therefore, is like GRUB for a domU. The Xen domain builder usually loads an OS kernel directly from the dom0 filesystem when the virtual machine is started (therefore acting like a bootloader itself). Instead, it can load PyGRUB, which then acts as a bootloader and loads the kernel from the domU filesystem.

PyGRUB is useful because it allows a more perfect separation between the administrative duties of the dom0 and the domU. When virtualizing the data center, you want to hand off virtual hardware to the customer. PyGRUB more effectively virtualizes the hardware. In particular, this means the customer can change his own kernel without the intervention of the dom0 administrator.

Note: PyGRUB has been mentioned as a possible security risk because it reads an untrusted
filesystem directly from the dom0. PV-GRUB (see "PV-GRUB: A Safer Alternative to
PyGRUB?" on page 105), which loads a trusted paravirtualized kernel from the dom0
then uses that to load and jump to the domU kernel, should improve this situation.

Note: PV-GRUB: A SAFER ALTERNATIVE TO PYGRUB?
PV-GRUB is an excellent reason to upgrade to Xen 3.3. The problem with PyGRUB is that while it's a good simulation of a bootloader, it has to mount the domU partition in the dom0, and it interacts with the domU filesystem. This has led to at least one remote-execution exploit. PV-GRUB avoids the problem by loading an executable that is, quite literally, a paravirtualized version of the GRUB bootloader, which then runs entirely within the domU.

This also has some other advantages. You can actually load the PV-GRUB binary from within the domU, meaning that you can load your first menu.lst from a read-only partition and have it fall through to a user partition, which then means that unlike my PyGRUB setup, users can never mess up their menu.lst to the point where they can't get into their rescue image.

Note that Xen creates a domain in either 32- or 64-bit mode, and it can't switch later on. This means that a 64-bit PV-GRUB can't load 32-bit Linux kernels, and vice versa.

Our PV-GRUB setup at prgmr.com starts with a normal xm config file, but with no bootloader and a kernel= line that points to PV-GRUB, instead of the domU kernel.

kernel = "/usr/lib/xen/boot/pv-grub-x86_64.gz"

extra = "(hd0,0)/boot/grub/menu.lst"

disk = ['phy:/dev/denmark/horatio,xvda,w','phy:/dev/denmark/rescue,xvde,r']

Note that we call the architecture-specific binary for PV-GRUB. The 32-bit (PAE) version is pv-grub-x86_32.

This is enough to load a regular menu.lst, but what about this indestructible rescue image of which I spoke? Here's how we do it on the new prgmr.com Xen 3.3 servers. In the xm config file:

kernel = "/usr/lib/xen/boot/pv-grub-x86_64.gz"
extra = "(hd1,0)/boot/grub/menu.lst"
disk = ['phy:/dev/denmark/horatio,xvda,w','phy:/dev/denmark/rescue,xvde,r']

Then, in /boot/grub/menu.lst on the rescue disk:

default=0
timeout=5

title Xen domain boot
root (hd1)
kernel /boot/pv-grub-x86_64.gz (hd0,0)/boot/grub/menu.lst
title CentOS-rescue (2.6.18-53.1.14.el5xen)
root (hd1)
kernel /boot/vmlinuz-2.6.18-53.1.14.el5xen ro root=LABEL=RESCUE
initrd /boot/initrd-2.6.18-53.1.14.el5xen.img
title CentOS installer
root (hd1)
kernel /boot/centos-5.1-installer-vmlinuz
initrd /boot/centos-5.1-installer-initrd.img
title NetBSD installer
root (hd1)
kernel /boot/netbsd-INSTALL_XEN3_DOMU.gz

The first entry is the normal boot, with 64-bit PV-GRUB. The rest are various types of rescue and install boots. Note that we specify (hd1) for the rescue entries; in this case, the second disk is the rescue disk.

The normal boot loads PV-GRUB and the user's /boot/grub/menu.lst from (hd0,0). Our default user-editable menu.lst looks like this:

default=0
timeout=5
title CentOS (2.6.18-92.1.6.el5xen)
root (hd0,0)
kernel /boot/vmlinuz-2.6.18-92.1.6.el5xen console=xvc0
root=LABEL=PRGMRDISK1 ro
initrd /boot/initrd-2.6.18-92.1.6.el5xen.img

PV-GRUB only runs on Xen 3.3 and above, and it seems that Red Hat has no plans to backport PV-GRUB to the version of Xen that is used by RHEL 5.x.

Making PyGRUB Work

The domain's filesystem will need to include a /boot directory with the appropriate files, just like a regular GRUB setup. We usually make a separate block device for /boot, which we present to the domU as the first disk entry in its config file.

To try PyGRUB, add a bootloader= line to the domU config file:

bootloader = "/usr/bin/pygrub"

About the authors:
Luke S. Crawford is a Xen consultant, working on corporate server consolidation in a Fortune 100 corporate environment. Crawford also works on a Xen hosting venture at prgmr.com.

Chris Takemura is a recent graduate and occasional Xen consultant. He is currently working on a Xen hosting venture at prgmr.com.

Of course, this being Xen, it may not be as simple as that. If you're using Debian, make sure that you have libgrub, e2fslibs-dev, and reiserfslibs-dev installed. (Red Hat Enterprise Linux and related distros use PyGRUB with their default Xen setup, and they include the necessary libraries with the Xen packages.)

Even with these libraries installed, it may fail to work without some manual intervention. Older versions of PyGRUB expect the virtual disk to have a partition table rather than a raw filesystem. If you have trouble, this may be the culprit.

With modern versions of PyGRUB, it is unnecessary to have a partition table on the domU's virtual disk.

Self-Support with PyGRUB

At prgmr.com, we give domU administrators the ability to repair and customize their own systems, which also saves us a lot of effort installing and supporting different distros. To accomplish this, we use PyGRUB and see to it that every customer has a bootable read-only rescue image they can boot into if their OS install goes awry. The domain config file for a customer who doesn't want us to do mirroring looks something like the following.

bootloader = "/usr/bin/pygrub"

memory = 512
name = "lsc"
vif = [ 'vifname=lsc,ip=38.99.2.47,mac=aa:00:00:50:20:2f,bridge=xenbr0' ] 

disk = [
                            'phy:/dev/verona/lsc_boot,sda,w', 
                            'phy:/dev/verona_left/lsc,sdb,w', 
                            'phy:/dev/verona_right/lsc,sdc,w', 
                            'file://var/images/centos_ro_rescue.img,sdd,r'
] 

Note that we're now exporting four disks to the virtual host: a /boot partition on virtual sda, reserved for PyGRUB; two disks for user data, sdb and sdc; and a read-only CentOS install as sdd.

A sufficiently technical user, with this setup and console access, needs almost no help from the dom0 administrator. He or she can change the operating system, boot a custom kernel, set up a software RAID, and boot the CentOS install to fix his setup if anything goes wrong.

Setting Up the DomU for PyGRUB

The only other important bit to make this work is a valid /grub/menu.lst, which looks remarkably like the menu.lst in a regular Linux install. Our default looks like this and is stored on the disk exported as sda:

default=0
timeout=15

title centos
              root (hd0,0) 
              kernel /boot/vmlinuz-2.6.18-53.1.6.el5xen console=xvc0 root=/dev/sdb ro
              initrd /boot/initrd-2.6.18-53.1.6.el5xen.XenU.img

title generic kernels
              root (hd0,0) 
              kernel /boot/vmlinuz-2.6-xen root=/dev/sdb
              module /boot/initrd-2.6-xen

title rescue-disk
              root (hd0,0) 
              kernel /boot/vmlinuz-2.6.18-53.1.6.el5xen console=xvc0 root=LABEL=RESCUE
ro
              initrd /boot/initrd-2.6.18-53.1.6.el5xen.XenU.img

Note: /boot/grub/menu.lst is frequently symlinked to either /boot/grub/grub.conf or /etc/grub.conf. /boot/grub/menu.lst is still the file that matters.

As with native Linux, if you use a separate partition for /boot, you'll need to either make a symlink at the root of /boot that points boot back to . or make your kernel names relative to /boot.

Here, the first and default entry is the CentOS distro kernel. The second entry is a generic Xen kernel, and the third choice is a read-only rescue image. Just like with native Linux, you can also specify devices by label rather than disk number.

Note: Working with partitions on virtual disks
In a standard configuration, partition 1 may be /boot, with partition 2 as /. In that case, partition 1 would have the configuration files and kernels in the same format as for normal GRUB.

It's straightforward to create these partitions on an LVM device using fdisk. Doing so for a file is a bit harder. First, attach the file to a loop, using losetup:

# losetup /dev/loop1 claudius.img

Then create two partitions in the usual way, using your favorite partition editor:

# fdisk /dev/loop1

Then, whether you're using an LVM device or loop file, use kpartx to create device nodes from the partition table in that device:

# kpartx -av /dev/loop1

Device nodes will be created under /dev/mapper in the format devnamep#. Make a filesystem of your preferred type on the new partitions:

# mke2fs /dev/mapper/loop1p1
# mke2fs -j /dev/mapper/loop1p2

# mount /dev/mapper/loop1p2 /mnt
# mount /dev/mapper/loop1p1 /mnt/boot

Copy your filesystem image into /mnt, make sure valid GRUB support files are in /mnt/boot (just like a regular GRUB setup), and you are done.

Wrap-Up

This chapter discussed things that we've learned from our years of relying on Xen. Mostly, that relates to how to partition and allocate resources between independent, uncooperative virtual machines, with a particular slant toward VPS hosting. We've described why you might host VPSs on Xen; specific allocation issues for CPU, disk, memory, and network access; backup methods; and letting customers perform self-service with scripts and PyGRUB.

Note that there's some overlap between this chapter and some of the others. For example, we mention a bit about network configuration, but we go into far more detail on networking in Chapter 5, Networking. We describe xm save in the context of backups, but we talk a good deal more about it and how it relates to migration in Chapter 9. Xen hosting's been a lot of fun. It hasn't made us rich, but it's presented a bunch of challenges and given us a chance to do some neat stuff.


Hosting Untrusted Users under Xen: Lessons from the Trenches
  Managing Xen shared resources: Credit scheduler and Xen scheduler
  Monitoring Xen network traffic and usage with network resources
  Using Xen PyGRUB, ionice to manage storage and disks

Printed with permission from No Starch Press Inc . Copyright 2009. The Book of Xen: A Practical Guide for the System Administrator by Chris Takemura and Luke S. Crawford. For more information about this title and other similar books, please visit No Starch Press Inc.


This was first published in February 2010

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: