You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, , rebasing to latest upstream kernel (2.6.27.57), and fixing .
The thing is, this kernel is called after , a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.
You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, , rebasing to latest upstream kernel (2.6.27.57), and fixing .
The thing is, this kernel is called after , a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.
I have added vswap confguration samples to vzctl git. Basically, you set physpages and swappages and leave every other beancounter at unlimited. For example, this is how ve-vswap-256m-conf.sample looks like:
# UBC parameters (in form of barrier:limit)
PHYSPAGES="0:256M"
SWAPPAGES="0:512M"
KMEMSIZE="unlimited"
LOCKEDPAGES="unlimited"
PRIVVMPAGES="unlimited"
SHMPAGES="unlimited"
NUMPROC="unlimited"
VMGUARPAGES="unlimited"
OOMGUARPAGES="unlimited"
NUMTCPSOCK="unlimited"
NUMFLOCK="unlimited"
NUMPTY="unlimited"
NUMSIGINFO="unlimited"
TCPSNDBUF="unlimited"
TCPRCVBUF="unlimited"
OTHERSOCKBUF="unlimited"
DGRAMRCVBUF="unlimited"
NUMOTHERSOCK="unlimited"
DCACHESIZE="unlimited"
NUMFILE="unlimited"
NUMIPTENT="unlimited"
# Disk quota parameters (in form of softlimit:hardlimit)
DISKSPACE="1G"
DISKINODES="200000"
QUOTATIME="0"
# CPU fair scheduler parameter
CPUUNITS="1000"
As you can see, physpages (ie RAM size) is set to 256 megabytes, while swappages (ie swap size) is set to 512 megabytes, all the other beancounters are unlimited. Wow, it's never been easier to configure your containers!
Now, we can utilize this stuff using RHEL6 based kernel. This is what we see from inside the container:
[root@localhost ~]# vzctl enter 103
entered into CT 103
[root@localhost /]# free
total used free shared buffers cached
Mem: 262144 23936 238208 0 0 10968
-/+ buffers/cache: 12968 249176
Swap: 524288 0 524288
I have added vswap confguration samples to vzctl git. Basically, you set physpages and swappages and leave every other beancounter at unlimited. For example, this is how ve-vswap-256m-conf.sample looks like:
# UBC parameters (in form of barrier:limit)
PHYSPAGES="0:256M"
SWAPPAGES="0:512M"
KMEMSIZE="unlimited"
LOCKEDPAGES="unlimited"
PRIVVMPAGES="unlimited"
SHMPAGES="unlimited"
NUMPROC="unlimited"
VMGUARPAGES="unlimited"
OOMGUARPAGES="unlimited"
NUMTCPSOCK="unlimited"
NUMFLOCK="unlimited"
NUMPTY="unlimited"
NUMSIGINFO="unlimited"
TCPSNDBUF="unlimited"
TCPRCVBUF="unlimited"
OTHERSOCKBUF="unlimited"
DGRAMRCVBUF="unlimited"
NUMOTHERSOCK="unlimited"
DCACHESIZE="unlimited"
NUMFILE="unlimited"
NUMIPTENT="unlimited"
# Disk quota parameters (in form of softlimit:hardlimit)
DISKSPACE="1G"
DISKINODES="200000"
QUOTATIME="0"
# CPU fair scheduler parameter
CPUUNITS="1000"
As you can see, physpages (ie RAM size) is set to 256 megabytes, while swappages (ie swap size) is set to 512 megabytes, all the other beancounters are unlimited. Wow, it's never been easier to configure your containers!
Now, we can utilize this stuff using RHEL6 based kernel. This is what we see from inside the container:
[root@localhost ~]# vzctl enter 103
entered into CT 103
[root@localhost /]# free
total used free shared buffers cached
Mem: 262144 23936 238208 0 0 10968
-/+ buffers/cache: 12968 249176
Swap: 524288 0 524288
Hard CPU limit (ability to specify that you don't want this container to use more than X per cent of CPU no matter what) is back in latest RHEL6-based kernel, .
The feature was only available for the stable (i.e RHEL4 and RHEL5-based) kernels, and was missing from all of our development kernels from 2.6.20 to 2.6.32. So while it was always there in stable branches, the feeling is like it's back.
In order to use CPU limit feature, set the limit using vzctl set $CTID --cpulimit X, where X is in per cent of one single CPU. For example, if you have single 2 GHz CPU and want container 123 to use no more than 1 GHz, use vzctl set 123 --cpulimit 50. If you have 2 GHz quad-core system and want to use no more than 4 GHz, use vzctl set 123 --cpulimit 200. Well, in the second case it might be better to just use --cpus 2. Anyways, see vzctl man page.
Hard CPU limit (ability to specify that you don't want this container to use more than X per cent of CPU no matter what) is back in latest RHEL6-based kernel, .
The feature was only available for the stable (i.e RHEL4 and RHEL5-based) kernels, and was missing from all of our development kernels from 2.6.20 to 2.6.32. So while it was always there in stable branches, the feeling is like it's back.
In order to use CPU limit feature, set the limit using vzctl set $CTID --cpulimit X, where X is in per cent of one single CPU. For example, if you have single 2 GHz CPU and want container 123 to use no more than 1 GHz, use vzctl set 123 --cpulimit 50. If you have 2 GHz quad-core system and want to use no more than 4 GHz, use vzctl set 123 --cpulimit 200. Well, in the second case it might be better to just use --cpus 2. Anyways, see vzctl man page.
We have just released . It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.
RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.
While SL6 is still at its infancy ( and plan to release beta 1 at Jan 7 2011), it it worth trying since it's based on a very stable set of sources from RHEL6. Repositories and stuff are available from
We have just released . It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.
RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.
While SL6 is still at its infancy ( and plan to release beta 1 at Jan 7 2011), it it worth trying since it's based on a very stable set of sources from RHEL6. Repositories and stuff are available from
One of the most interesting topics I've seen so far during my training at this week is . In short, SystemTap allows you to dig out a bunch of details about your running system relatively easily. It takes scripts, converts them to C, builds a kernel module, and then runs the code within your script.
HOLD IT: The steps below are definitely not meant for those who are new to Linux. Utilizing SystemTap on a production system is a bad idea — it can chew up significant resources while it runs and it can also cause a running system to kernel panic if you're not careful with the packages you install.
These instructions will work well with Fedora, CentOS and Red Hat Enterprise Linux. Luckily, the SystemTap folks put together some instructions for and as well.
Before you can start working with SystemTap on your RPM-based distribution, you'll need to get some prerequisites together:
WHOA THERE: Ensure that the kernel-devel and kernel-debuginfo* packages that you install via yum match up with your running kernel. If there's a newer kernel available from your yum repo, yum will pull that one. If it's been a while since you updated, you'll either need to upgrade your current kernel to the latest and reboot or you'll need to hunt down the corresponding kernel-devel and kernel-debuginfo* packages from a repository. Installing the wrong package version can lead to kernel panics. Also, bear in mind that the debuginfo packages are quite large: almost 200MB in Red Hat/CentOS and almost 300MB in Fedora.
You can't write the script in just any language. SystemTap uses an odd syntax to get things going:
The iotime.stp script dumps out the reads and writes occurring on the system in real time. After starting the script above, I accessed my cacti instance on the server and immediately started seeing some reads as apache began picking up PHP files to parse.
Consider a situation in which you need to decrease interrupts on a Linux machine. This is vital for laptops and systems that need to remain in low power states. Some might suggest powertop
On this particular system, it's pretty obvious that the ethernet interface is causing a lot of interrupts.
If you want more examples, keep hunting around in the systemtap-testsuite package (remember rpm -ql systemtap-testsuite) or review the on SystemTap's site.
Thanks again to Phil Hopkins at Rackspace for giving us a detailed explanation of system profiling during training.
The that I wrote recently will need some adjustments as I've fallen hard for the performance and reliability of DRBD and OCFS2. As a few of my sites were gaining in popularity, I noticed that GlusterFS simply couldn't keep up. High I/O latency and broken replication threw a wrench into my love affair with GlusterFS and I knew there had to be a better option.
Diagram of two web nodes with a replicated filesystem using DRBD & OCFS2
I've shared my configuration with my coworkers and I've received many good questions about it. Let's get to the Q&A:
How does the performance compare to GlusterFS?
On Gluster's best days, the data throughput speeds were quite good, but the latency to retrieve the data was often much too high. Page loads on this site were taking upwards of 3-4 seconds with GlusterFS latency accounting for well over 75% of the delays. For small files, GlusterFS's performance was about 20-25x slower than accessing the disk natively. The performance hit for DRBD and OCFS2 is usually between 1.5-3x for small files and difficult to notice for large file transfers.
Couldn't you keep the data separate and then sync it with rsync?
Everyone knows that rsync can be a resource consuming monster and it seems wasteful to call rsync via a cron job to keep my data in sync. There are some periods of the day where the actual data on the web root rarely changes. There are other times where it changes rapidly and I'd end up with nodes out of sync for a few minutes.
To get the just-in-time synchronization that I want, I'd have to run rsync at least once a minute. If the data isn't changing over a long period, rsync would end up crushing the disk and consuming CPU for no reason. DRBD only syncs data when data changes. Also, all reads with DRBD are done locally. This makes is a highly efficient and effective choice for instant synchronization.
Why OCFS2? Isn't that overkill?
When you use DRBD in dual-primary mode, it's functionally equivalent to having a raw storage device (like a SAN) mounted in two places. If you threw an ext4 filesystem onto a LUN on your SAN and then mounted it on two different servers, you'd be in bad shape very quickly. Non-clustered filesystems like ext3 or ext4 can't handle being mounted in more than one environment.
OCFS2 is built primarily to be mounted in more than one place and it comes with its own distributed locking manager (DLM). The configuration files for OCFS2 are extremely simple and you mount it like any other filesystem. It's been part of the mainline Linux kernel since 2.6.19.
What happens when you lose one of the nodes?
The configuration shown above can operate with just one node in an emergency. When the failed node comes back online, DRBD will resync the block device and you can mount the OCFS2 filesystem as you normally would.
You're using an Oracle product? Really?
You've got me there. I'm not a fan of how they treat the open source community with regards to some of their projects, but the OCFS2 filesystem is robust, free, and it meets my needs.