Игорь Олемской — практические заметки по системному администрированию Linux CentOS

Архив тега ‘kernel’

On vSwap and 042stab04x kernel improvements (перепечатка)

Комментариев нет

vSwap

The best feature of the new (RHEL6-based) 042 series of the OpenVZ kernels is definitely vSwap. The short story is, we used to have 22 user beancounter parameters which every seasoned OpenVZ user knows by heart. Each of these parameters is there for a reason, but 22 knobs are a bit too complex to manage for a mere mortal, especially bearing in mind that
  • many of them are interdependent;
  • the sum of all limits should not exceed the resources of a given physical server.
Keeping this configuration optimal (or even consistent) is quite a challenging task even for a senior OpenVZ admin (with a probable exception of an ex airline pilot). This complexity is the main reason why there are multiple articles and blog entries complaining OpenVZ is worse than Xen, or that it is not suitable for hosting Java apps. We do have some workarounds to mitigate this complexity, such as:

This is still not the way to go. While we think high of our users, we do not expect all of them to be ex airline pilots. To solve the complexity, the number of per-container knobs and handles should be reduced to some decent number, or at least most of these knobs should be optional.

We worked on that for a few years, and the end result is called vSwap (where V is for Vendetta, oh, pardon me, Virtual).

vSwap concept is as simple as a rectangular. For each container, there are only two required parameters: the memory size (known as physpages) and the swap size (swappages). Almost everyone (not only an admin, but even an advanced end user) knows what is RAM and what is swap. On a physical server, if there is not enough memory, the system starts to swap out memory pages to disk, then swap in some other pages, which results in severe performance degradation but it keeps the system from failing miserably.

It's about the same with vSwap, except that
  • RAM and swap are configured on a per container basis;
  • no I/O is performed until it is really necessary (this is why swap is virtual).

Some VSwap internals

Now, there are only two knobs per container on a dashboard, namely RAM and swap, and all the complexity is hidden under the hood. I am going to describe just a bit of that undercover mechanics and explain what does the "Reworked VSwap kernel memory accounting" line from the 042stab040.1 kernel changelog stands for.

The biggest problem is, RAM for containers is not just RAM. First of all, there is a need to distinguish between
  • the user memory,
  • the kernel memory,
  • the page cache,
  • and the directory entry cache.

The user memory is more or less clear, it is simply the memory that programs allocate for themselves to run. It is relatively easy to account for, and it is relatively simple to limit it (but read on).

The kernel memory is really complex thingie. Right, it is the memory that kernel allocates for itself in order for programs in a particular container to run. This includes a lot of stuff I'd rather not dive into, if I want to keep this piece as an article not a tome. Having said that, two particular kernel memory types are worth explaining.

First is the page cache, the kernel mechanism that caches disk contents in memory (that would be unused otherwise) to minimize the I/O. When a program reads some data from a disk, that data are read into the page cache first, and when a program writes to a disk, data goes to the page cache (and then eventually are written (flushed) to disk). In case of repeated disk access (which happens quite often) data is taken from a page cache, not from the real disk, which greatly improves the overall system performance, since a disk is much slower than RAM. Now, some of the page cache is used on behalf of a container, and this amount must be charged into "RAM used by this container" (i.e. physpages).

Second is the directory entry cache (dcache for short) is yet another sort of cache, and another sort of the kernel memory. Disk contents is a tree of files and directories, and such a tree is quite tall and wide. In order to read the contents of, say, /bin/sh file, kernel have to read the root (/) directory, find 'bin' entry in it, read /bin directory, find 'sh' entry in it and finally read it. Although these operations are not very complex, there is a multitude of those, they take time and are repeated often for most of the "popular" files. In order to improve performance, kernel keeps directory entries in memory — this is what dcache is for. The memory used by dcache should also be accounted and limited, since otherwise it's easily exploitable (not only by root, but also by an ordinary user, since any user is free to change into directories and read files).

Now, the physical memory of a container is the sum of its user memory, the kernel memory, the page cache and the dcache. Technically, dcache is accounted into the kernel memory, then kernel memory is accounted into the physical memory, but it's not overly important.

Improvements in the new 042stab04x kernels

Better reclamation and memory balancing

What to do if a container hit a physical memory limit? Free some pages by writing their contents to the abovementioned virtual swap. Well, not quite yet. Remember that there is also a page cache and a dcache, so the kernel can easily discard some of the pages from these caches, which is way cheaper than swapping out.

The process of finding some free memory is known as reclamation. Kernel needs to decide very carefully when to start reclamation, how many and what exact pages to reclaim in every particular situation, and when it is the right time to swap out rather than discard some of the cache contents.

Remember, we have four types of memory (kernel, user, dcache and page cache) and only one knob which limits the sum of all these. It would be easier for the kernel, but not for the user, to have separate limits for each type of memory. But, for the user convenience and simplicity, the kernel only have one knob for these four parameters, so it needs to balance between those four. One major improvement in 042stab040 kernel is that such balancing is now performed better.

Stricter memory limit

During the lifetime of a container, the kernel might face a situation when it needs more kernel memory, or user memory, or perhaps more dcache entries, and the memory for the container is tight (i.e. close to the limit), so it needs to either reclaim or swap. The problem is there are some situations when neither reclamation nor swapping is possible, so the kernel can either fail miserably (say by killing a process) or go beyond the limit and hope that everything will be fine and mommy won't notice. Another big improvement in 042stab040 kernel is it reduces the number of such situations, in other words, the new kernel obeys memory limit in a more strict way.

Polishing

Finally, the kernel is now in a pretty good shape, so we can afford some polishing, minor optimizations, and fine tuning. Such polishing was performed in a few subsystems, including checkpointing, user beancounters, netfilter, kernel NFS server and VZ disk quota.

Some numbers

Totally, there are 53 new patches in 042stab040.1, compared to previous 039 kernels. On top of that, 042stab042.1 adds another 30. We hope that the end result is improved stability and performance.

14.11.2011

Announcing rhel6-testing kernel branch/repo (перепечатка)

Комментариев нет

Instead of having a nice drink in a bar, I spent this Friday night splitting the RHEL6-based OpenVZ kernel branch/repository into two, so now we have 'rhel6' and 'rhel6-testing' branches/repos. Let me explain why.

When we made an initial port of OpenVZ to RHEL6 kernel and released the first kernel (in October 2010, named 042test001), I created a repository named openvz-kernel-rhel6 (or just rhel6), and this repository was marked as «development, unstable». When, after almost a year, we announced it as «testing» and then, finally, «stable» (in August 2011, named 042stab035.1).

After that, all the kernels in that repository were supposed to be stable, because they are incremental improvements of the kernel we call stable. In theory it is. In practice, of course, there can always be new bugs (both introduced by us and by Red Hat folks releasing their kernel updates which we rebase to). Thus a kernel update from a repo which is supposed to be stable can do bad things.

Better late than never, I have fixed the situation tonight by basically renaming «rhel6» repository into «rhel6-testing», and creating a new repository called just «rhel6». For now, I put 042stab037.1 (which is the latest kernel which has passed our internal QA) into rhel6 (aka stable), while all the other kernels, up to and including 042stab039.3, are in rhel6-testing repo.

Now, very similar to what we do with RHEL5 kernels, all the new fresh-from-the-build-farm kernels will appear in rhel6-testing repo, at about the same time they go to internal QA. Then, the kernels which will have QA approval will appear in rhel6 (aka -stable) repo. What it means for you as a user is you can now choose whether to stay at the bleeding edge and have the latest stuff, or to take a conservative approach and have less frequent and delayed updates, but be more confident about kernel quality and stability.

A few links:
* Stable RHEL6-based OpenVZ kernels
* Testing RHEL6-based OpenVZ kernels
* OpenVZ yum repository setup file
* Official announce of rhel6-testing

15.10.2011

LinuxCon Europe 2011 in Prague is coming (перепечатка)

Комментариев нет

And we are coming to Prague, too! This time, there will be as many as six people and two talks from us, plus we will held a memory cgroup controller meeting.

The following OpenVZ/Parallels people are coming:

  • James Bottomley, Parallels virtualization CTO
  • Kir Kolyshkin, OpenVZ project manager
  • Pavel Emelyanov, OpenVZ kernel team leader (he's also taking part in Linux Kernel Summit)
  • Glauber Costa, OpenVZ kernel developer
  • Maxim Patlasov, OpenVZ kernel developer
  • Andrey Vagin, OpenVZ kernel developer

Two talks will be presented. Since linuxsymposium.org site is currently down, let me quote talk descriptions here.

1. Container in a file by Maxim Patlasov.

One of the feature differences between hypervisors and containers is the ability to store a virtual machine image in a single file, since most containers exist as a chroot within the host OS rather than as fully independent entities. However, the ability to save and restore state in a machine image file is invaluable in managing virtual machine life cycles in the data centre.

This talk will début a new loopback device which gives all the advantages of virtual machine images by storing the container in a file
while preserving the benefits of sharing significant portions with the host OS. We will compare and contrast the technology with the
traditional loopback device, and describe some changes to the ext4 filesystem which make it more friendly to new loopback device needs.

This talk will be technical in nature but should be accessible to people interested in cloud, virtualisation and container technologies.

2. OpenVZ and Linux kernel testing by Andrey Vagin.

One of the less appealing but very important part of software development is testing. This talk tries to summarize our 10+ years of experience in Linux kernel testing (including OpenVZ and Red Hat Enterprise Linux kernels). Overall description of our test system is provided, followed by details on some of the interesting test cases developed. Finally, a few anecdotal cases of bugs found will be presented.

In a sense, the talk is an answer to Andrew Morton's question from 2007: «I'm curious. For the past few months, people@openvz.org have discovered (and fixed) an ongoing stream of obscure but serious and quite long-standing bugs. How are you discovering these bugs?»

Talk is of interest to those concerned about kernel quality, and in general to people doing development and testing.

Finally, there will be a memcg meeting. Since LinuxCon will be right after the Kernel Summit, a number of kernel guys will still be there so anyone interested in cgroups can come. This meeting is a continuation of our recent discussion at Linux Plumbers (see etherpad and presentations).

See you all in Prague in less than a month!

RHEL6 goes stable! (перепечатка)

Комментариев нет

Guys, I am very proud to inform you that today we mark RHEL6 kernel branch as stable. Below is a copy-paste from the relevant announce@ post. I personally highly recommend RHEL6-based OpenVZ kernel to everyone — it is a major step forward compared to RHEL5.

In the other news, Parallels has just released Virtuozzo Containers for Linux 4.7, bringing the same cool stuff (VSwap et al) to commercial customers. Despite being only a «dot» (or «minor») release, this product incorporates an impressive amount of man-hours of best Parallels engineers.

== Stable: RHEL6 ==

This is to announce that RHEL6-based kernel branch (starting from kernel 042stab035.1) is now marked as stable, and it is now the recommended branch to use.

We are not aware of any major bugs or show-stoppers in this kernel. As always, we still recommend to test any new kernels before rolling out to production.

New features of RHEL6-based kernel branch (as compared to previous stable kernel branch, RHEL5) includes better performance, better scalability (especially on high-end SMP systems), and better resource management (notably, vSwap support — see http://wiki.openvz.org/VSwap).

RHEL6 kernels can be downloaded from http://wiki.openvz.org/Download/kernel/rhel6

== Frozen: 2.6.27, 2.6.32 ==

Also, from now we no longer maintain the following kernel branches:

* 2.6.27
* 2.6.32

No more new releases of the above kernels are expected. Existing users (if any) are recommended to switch to other (maintained) branches, such as RHEL6-2.6.32 or RHEL5-2.6.18.

This change does not affect vendor OpenVZ kernels (such as Debian or Ubuntu) — those will still be supported for the lifetime of their distributions via the usual means (i.e. bugzilla.openvz.org).

== Development: none ==

Currently, there are no non-stable kernels in development. Eventually we will port to Linux kernel 3.x, but it might not happen this year. Instead, we are currently focused on bringing more of OpenVZ features to mainstream Linux kernels.

Regards, OpenVZ team.

30.08.2011

Linux Plumbers: Containers and CGroups miniconf (перепечатка)

Комментариев нет

We have finally filed a number of proposals for the up-coming Containers and CGroups miniconf to be held during Linux Plumbers Conference, 7 to 9 September 2011 in Santa Rosa, CA.

From those proposals, one can clearly see what are our plans regarding the mainline integration. In a few words: dcache management, memory and CPU cgroup controllers improvements, container enter, improved /proc virtualization, checkpoint/restart [mostly] in userspace (of which I have blogged recently), and making vzctl work with mainline kernel containers. Oh, and the interesting loopback-like block device to hold a container filesystem (a.k.a. ploop).

Quite a lot of interesting stuff, what do you think?

Checkpoint/restart (mostly) in user space (перепечатка)

Комментариев нет

There is a good article at lwn.net telling about one of our latest development.

We have checkpoint/restart (CPT) and live migration in OpenVZ for ages (well, OK, since 2007 or so), allowing for containers to be freely moved between physical servers without any service interruption. It is a great feature which is valued by our users. The problem is we can't merge it upstream, ie to vanilla kernel.

Various people from our team worked on that, and they all gave up. Then, Oren Laadan was trying very hard to merge his CPT implementation — unfortunately it didn't worked out very well either. The thing is, checkpointing is a complex thing, and the patch implementing it is very intrusive.

Recently, our kernel team leader Pavel Emelyanov got a new idea of moving most of the checkpointing complexity out of the kernel and into user space, thus minimizing the amount of the in-kernel changes needed. In about two weeks of time he wrote a working prototype. So far the reaction is mostly positive, and he's going to submit a second RFC version for review to lkml.

For more details, read the lwn.net article. After all, while I am sitting next to Pavel, Mr. Corbet ability to explain complex stuff in simple terms is way better than mine.

Xen 4.1 on Fedora 15 with Linux 3.0 (перепечатка)

Комментариев нет

If you haven't noticed already, full Xen dom0 support was added in the Linux 3.0 kernel. This means there's no longer a need to drag patches forward from old kernels and work from special branches and git repositories when building a kernel for dom0.

Something else you might not have noticed is that the Fedora kernel team has quietly slipped Linux 3.0 into Fedora 15's update channels in disguise. Click that link, scroll down, and you'll see «Rebase to 3.0. Version reports as 2.6.40 for compatibility with older userspace.» Although I'm not a fan of calling something what it isn't (2.6.40 doesn't exist on kernel.org), I can understand some of the reasoning behind the choice.

This change makes the Xen installation on Fedora 15 pretty trivial. To get started, update your kernel to the latest if you're not already on Fedora's 2.6.40 kernels:

yum -y upgrade kernel

We need three more packages (quite a few dependencies will roll in with them):

yum -y install xen libvirt python-virtinst

The xen package reels in the hypervisor itself along with libraries and command line tools (like xl and xm). Libvirt gives us easy access to VM management with the virsh command and python-virtinst gives us the handy virt-install command to make OS installations easy.

Once those packages are installed, we need to make some adjustments in your grub configuration. Open /boot/grub/menu.lst in your text editor of choice and add something like this at the bottom:

title Fedora + Xen (2.6.40-4.fc15.x86_64)
        root (hd0,1)
	kernel /boot/xen.gz
        module /boot/vmlinuz-2.6.40-4.fc15.x86_64 ro root=/dev/sda1
        module /boot/initramfs-2.6.40-4.fc15.x86_64.img

Ensure that the root (hd0,1) is applicable to your system (adjust it if it isn't). Also, check the kernel version to ensure it matches your installed kernel and adjust the root= portion to match your root volume. Flip the default line to a value which will boot your new grub entry and ensure the timeout is set to a reasonable number if you need to temporarily switch back to your original grub entry at boot time. (Hey, we all make mistakes.)

I take one extra precaution and change the UPDATEDEFAULT=yes line to no in /etc/sysconfig/kernel. This ensures that future kernel updates don't trample the entry you've just made. Keep in mind that you'll need to manually update your grub configuration when you do kernel upgrades later.

Cross your fingers and reboot. If your system doesn't reboot properly, reboot it again and choose your old kernel from the grub menu. Double-check your configuration for fat-fingering and give it another try. If your system boots and pings but you have no output via a monitor, don't fret. There's a patch for the problem which should appear soon in Linux 3.0. The impatient can snag a kernel source RPM, add the patch file, and build a local kernel (or you can download my local build from when I did it).

Log in and verify that you booted into the dom0:

[root@xenbox ~]# xm dmesg | head -n 5
 __  __            _  _    _   _   ____     __      _ ____
 \ \/ /___ _ __   | || |  / | / | |___ \   / _| ___/ | ___|
  \  // _ \ '_ \  | || |_ | | | |__ __) | | |_ / __| |___ \
  /  \  __/ | | | |__   _|| |_| |__/ __/ _|  _| (__| |___) |
 /_/\_\___|_| |_|    |_|(_)_(_)_| |_____(_)_|  \___|_|____/

Once you're done with that, make sure libvirtd is running:

/etc/init.d/libvirtd start; chkconfig libvirtd on

Try installing a VM:

virt-install \
  --paravirt \
  --name=testvm \
  --ram=512 \
  --vcpus=4 \
  --file /dev/vmstorage/testvm \
  --graphics vnc,port=5905 --noautoconsole \
  --autostart --noreboot \
  --location=http://mirrors.kernel.org/debian/dists/squeeze/main/installer-amd64/

You should have a VM installation underway pretty quickly and it will be visible via port 5905 on the local host. Enjoy the power and freedom of your brand new type 1 hypervisor.

Xen 4.1 on Fedora 15 with Linux 3.0 is a post from: Major Hayden's Racker Hacker blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

Keep all old kernels when upgrading via yum (перепечатка)

Комментариев нет

Some might call me paranoid, but I get nervous when my package manager automatically removes a kernel. I logged into my Fedora 15 VM this morning and found this:

================================================================================
 Package        Arch           Version                   Repository        Size
================================================================================
Installing:
 kernel         x86_64         2.6.35.13-92.fc14         updates           22 M
Removing:
 kernel         x86_64         2.6.35.11-83.fc14         @updates         104 M
 
Transaction Summary
================================================================================
Install       1 Package(s)
Remove        1 Package(s)

Fedora 15's default behavior is to keep three kernels: the latest one and the two previous versions. However, this behavior may be counter-productive if you compile your own modules, or if you have compatibility issues with subsequent kernel versions.

You can change how yum handles kernel packages with some simple changes to your /etc/yum.conf. The installonly_limit option controls how many old packages are kept:

installonly_limit Number of packages listed in installonlypkgs to keep installed at the same time. Setting to 0 disables this feature. Default is '0'.

I disabled the functionality altogether by setting installonly_limit to 0:

#installonly_limit=3
installonly_limit=0

It's important to keep in mind that you will need to purge these packages from your system yourself now. Kernel packages can occupy a fair amount of disk space, so make a note to go back and clean them up when you no longer need them.

Keep all old kernels when upgrading via yum is a post from: Major Hayden's Racker Hacker blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

Kernel 2.6.27 repin aka "Unexpected return" (перепечатка)

Комментариев нет

You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, kernel 2.6.27-repin.1 is released, rebasing to latest upstream kernel (2.6.27.57), and fixing OpenVZ bug #1593.

The thing is, this kernel is called after Ilya Repin, a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.

Please enjoy: Ilya Repin. Unexpected return. 1884—1888.

Kernel 2.6.27 repin aka "Unexpected return" (перепечатка)

Комментариев нет

You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, kernel 2.6.27-repin.1 is released, rebasing to latest upstream kernel (2.6.27.57), and fixing OpenVZ bug #1593.

The thing is, this kernel is called after Ilya Repin, a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.

Please enjoy: Ilya Repin. Unexpected return. 1884—1888.

26.01.2011