Игорь Олемской — практические заметки по системному администрированию Linux CentOS

Архив тега ‘kernel’

Kernel 2.6.27 repin aka "Unexpected return" (перепечатка)

Комментариев нет

You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, kernel 2.6.27-repin.1 is released, rebasing to latest upstream kernel (2.6.27.57), and fixing OpenVZ bug #1593.

The thing is, this kernel is called after Ilya Repin, a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.

Please enjoy: Ilya Repin. Unexpected return. 1884—1888.

Kernel 2.6.27 repin aka "Unexpected return" (перепечатка)

Комментариев нет

You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, kernel 2.6.27-repin.1 is released, rebasing to latest upstream kernel (2.6.27.57), and fixing OpenVZ bug #1593.

The thing is, this kernel is called after Ilya Repin, a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called «Unexpected Return», and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.

Please enjoy: Ilya Repin. Unexpected return. 1884—1888.

26.01.2011

news from the VSwap front (перепечатка)

Комментариев нет

I have added vswap confguration samples to vzctl git. Basically, you set physpages and swappages and leave every other beancounter at unlimited. For example, this is how ve-vswap-256m-conf.sample looks like:

# UBC parameters (in form of barrier:limit)
PHYSPAGES="0:256M"
SWAPPAGES="0:512M"
KMEMSIZE="unlimited"
LOCKEDPAGES="unlimited"
PRIVVMPAGES="unlimited"
SHMPAGES="unlimited"
NUMPROC="unlimited"
VMGUARPAGES="unlimited"
OOMGUARPAGES="unlimited"
NUMTCPSOCK="unlimited"
NUMFLOCK="unlimited"
NUMPTY="unlimited"
NUMSIGINFO="unlimited"
TCPSNDBUF="unlimited"
TCPRCVBUF="unlimited"
OTHERSOCKBUF="unlimited"
DGRAMRCVBUF="unlimited"
NUMOTHERSOCK="unlimited"
DCACHESIZE="unlimited"
NUMFILE="unlimited"
NUMIPTENT="unlimited"

# Disk quota parameters (in form of softlimit:hardlimit)
DISKSPACE="1G"
DISKINODES="200000"
QUOTATIME="0"
# CPU fair scheduler parameter
CPUUNITS="1000"

As you can see, physpages (ie RAM size) is set to 256 megabytes, while swappages (ie swap size) is set to 512 megabytes, all the other beancounters are unlimited. Wow, it's never been easier to configure your containers!

Now, we can utilize this stuff using RHEL6 based kernel. This is what we see from inside the container:

[root@localhost ~]# vzctl enter 103
entered into CT 103
[root@localhost /]# free
             total       used       free     shared    buffers     cached
Mem:        262144      23936     238208          0          0      10968
-/+ buffers/cache:      12968     249176
Swap:       524288          0     524288
[root@localhost /]# cat /proc/user_beancounters
Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
      103:  kmemsize                  4722976              4853726  9223372036854775807  9223372036854775807                    0
            lockedpages                     0                    0  9223372036854775807  9223372036854775807                    0
            privvmpages                  4296                13875  9223372036854775807  9223372036854775807                    0
            shmpages                       31                   31  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            numproc                        33                   33  9223372036854775807  9223372036854775807                    0
            physpages                    5984                 5985                    0                65536                    0
            vmguarpages                     0                    0  9223372036854775807  9223372036854775807                    0
            oomguarpages                 2696                 2696  9223372036854775807  9223372036854775807                    0
            numtcpsock                      4                    4  9223372036854775807  9223372036854775807                    0
            numflock                        5                    6  9223372036854775807  9223372036854775807                    0
            numpty                          1                    1  9223372036854775807  9223372036854775807                    0
            numsiginfo                     12                   18  9223372036854775807  9223372036854775807                    0
            tcpsndbuf                   69760                    0  9223372036854775807  9223372036854775807                    0
            tcprcvbuf                   65536                    0  9223372036854775807  9223372036854775807                    0
            othersockbuf                 2312                10768  9223372036854775807  9223372036854775807                    0
            dgramrcvbuf                     0                    0  9223372036854775807  9223372036854775807                    0
            numothersock                   51                   53  9223372036854775807  9223372036854775807                    0
            dcachesize                1172451              1172451  9223372036854775807  9223372036854775807                    0
            numfile                       370                  390  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      14                   14  9223372036854775807  9223372036854775807                    0

[root@localhost /]# cat /proc/meminfo
MemTotal:         262144 kB
MemFree:          238208 kB
Cached:            10968 kB
Active:            16956 kB
Inactive:           1384 kB
Active(anon):       6352 kB
Inactive(anon):     1020 kB
Active(file):      10604 kB
Inactive(file):      364 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        524288 kB
SwapFree:         524288 kB
Dirty:                 0 kB
AnonPages:          7364 kB
Mapped:             3416 kB
Shmem:               124 kB
Slab:               4012 kB
SReclaimable:       1088 kB
SUnreclaim:         2924 kB

news from the VSwap front (перепечатка)

Комментариев нет

I have added vswap confguration samples to vzctl git. Basically, you set physpages and swappages and leave every other beancounter at unlimited. For example, this is how ve-vswap-256m-conf.sample looks like:

# UBC parameters (in form of barrier:limit)
PHYSPAGES="0:256M"
SWAPPAGES="0:512M"
KMEMSIZE="unlimited"
LOCKEDPAGES="unlimited"
PRIVVMPAGES="unlimited"
SHMPAGES="unlimited"
NUMPROC="unlimited"
VMGUARPAGES="unlimited"
OOMGUARPAGES="unlimited"
NUMTCPSOCK="unlimited"
NUMFLOCK="unlimited"
NUMPTY="unlimited"
NUMSIGINFO="unlimited"
TCPSNDBUF="unlimited"
TCPRCVBUF="unlimited"
OTHERSOCKBUF="unlimited"
DGRAMRCVBUF="unlimited"
NUMOTHERSOCK="unlimited"
DCACHESIZE="unlimited"
NUMFILE="unlimited"
NUMIPTENT="unlimited"

# Disk quota parameters (in form of softlimit:hardlimit)
DISKSPACE="1G"
DISKINODES="200000"
QUOTATIME="0"
# CPU fair scheduler parameter
CPUUNITS="1000"

As you can see, physpages (ie RAM size) is set to 256 megabytes, while swappages (ie swap size) is set to 512 megabytes, all the other beancounters are unlimited. Wow, it's never been easier to configure your containers!

Now, we can utilize this stuff using RHEL6 based kernel. This is what we see from inside the container:

[root@localhost ~]# vzctl enter 103
entered into CT 103
[root@localhost /]# free
             total       used       free     shared    buffers     cached
Mem:        262144      23936     238208          0          0      10968
-/+ buffers/cache:      12968     249176
Swap:       524288          0     524288
[root@localhost /]# cat /proc/user_beancounters
Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
      103:  kmemsize                  4722976              4853726  9223372036854775807  9223372036854775807                    0
            lockedpages                     0                    0  9223372036854775807  9223372036854775807                    0
            privvmpages                  4296                13875  9223372036854775807  9223372036854775807                    0
            shmpages                       31                   31  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            numproc                        33                   33  9223372036854775807  9223372036854775807                    0
            physpages                    5984                 5985                    0                65536                    0
            vmguarpages                     0                    0  9223372036854775807  9223372036854775807                    0
            oomguarpages                 2696                 2696  9223372036854775807  9223372036854775807                    0
            numtcpsock                      4                    4  9223372036854775807  9223372036854775807                    0
            numflock                        5                    6  9223372036854775807  9223372036854775807                    0
            numpty                          1                    1  9223372036854775807  9223372036854775807                    0
            numsiginfo                     12                   18  9223372036854775807  9223372036854775807                    0
            tcpsndbuf                   69760                    0  9223372036854775807  9223372036854775807                    0
            tcprcvbuf                   65536                    0  9223372036854775807  9223372036854775807                    0
            othersockbuf                 2312                10768  9223372036854775807  9223372036854775807                    0
            dgramrcvbuf                     0                    0  9223372036854775807  9223372036854775807                    0
            numothersock                   51                   53  9223372036854775807  9223372036854775807                    0
            dcachesize                1172451              1172451  9223372036854775807  9223372036854775807                    0
            numfile                       370                  390  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            dummy                           0                    0                    0                    0                    0
            numiptent                      14                   14  9223372036854775807  9223372036854775807                    0

[root@localhost /]# cat /proc/meminfo
MemTotal:         262144 kB
MemFree:          238208 kB
Cached:            10968 kB
Active:            16956 kB
Inactive:           1384 kB
Active(anon):       6352 kB
Inactive(anon):     1020 kB
Active(file):      10604 kB
Inactive(file):      364 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        524288 kB
SwapFree:         524288 kB
Dirty:                 0 kB
AnonPages:          7364 kB
Mapped:             3416 kB
Shmem:               124 kB
Slab:               4012 kB
SReclaimable:       1088 kB
SUnreclaim:         2924 kB

25.01.2011

cpulimit is back in RHEL6 based kernel (перепечатка)

Комментариев нет

Hard CPU limit (ability to specify that you don't want this container to use more than X per cent of CPU no matter what) is back in latest RHEL6-based kernel, 042test006.1, which has just been released.

The feature was only available for the stable (i.e RHEL4 and RHEL5-based) kernels, and was missing from all of our development kernels from 2.6.20 to 2.6.32. So while it was always there in stable branches, the feeling is like it's back.

In order to use CPU limit feature, set the limit using vzctl set $CTID --cpulimit X, where X is in per cent of one single CPU. For example, if you have single 2 GHz CPU and want container 123 to use no more than 1 GHz, use vzctl set 123 --cpulimit 50. If you have 2 GHz quad-core system and want to use no more than 4 GHz, use vzctl set 123 --cpulimit 200. Well, in the second case it might be better to just use --cpus 2. Anyways, see vzctl man page.

cpulimit is back in RHEL6 based kernel (перепечатка)

Комментариев нет

Hard CPU limit (ability to specify that you don't want this container to use more than X per cent of CPU no matter what) is back in latest RHEL6-based kernel, 042test006.1, which has just been released.

The feature was only available for the stable (i.e RHEL4 and RHEL5-based) kernels, and was missing from all of our development kernels from 2.6.20 to 2.6.32. So while it was always there in stable branches, the feeling is like it's back.

In order to use CPU limit feature, set the limit using vzctl set $CTID --cpulimit X, where X is in per cent of one single CPU. For example, if you have single 2 GHz CPU and want container 123 to use no more than 1 GHz, use vzctl set 123 --cpulimit 50. If you have 2 GHz quad-core system and want to use no more than 4 GHz, use vzctl set 123 --cpulimit 200. Well, in the second case it might be better to just use --cpus 2. Anyways, see vzctl man page.

20.01.2011

RHEL6-based kernel: try today not next year! (перепечатка)

Комментариев нет

We have just released a new RHEL6-based kernel, 042test005. It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.

RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.

While SL6 is still at its infancy (they have recently released alpha 3 and plan to release beta 1 at Jan 7 2011), it it worth trying since it's based on a very stable set of sources from RHEL6. Repositories and stuff are available from http://ftp.scientificlinux.org/linux/scientific/6rolling/

RHEL6-based kernel: try today not next year! (перепечатка)

Комментариев нет

We have just released a new RHEL6-based kernel, 042test005. It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.

RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.

While SL6 is still at its infancy (they have recently released alpha 3 and plan to release beta 1 at Jan 7 2011), it it worth trying since it's based on a very stable set of sources from RHEL6. Repositories and stuff are available from http://ftp.scientificlinux.org/linux/scientific/6rolling/

Tap into your Linux system with SystemTap (перепечатка)

Комментариев нет

One of the most interesting topics I've seen so far during my RHCA training at Rackspace this week is SystemTap. In short, SystemTap allows you to dig out a bunch of details about your running system relatively easily. It takes scripts, converts them to C, builds a kernel module, and then runs the code within your script.

HOLD IT: The steps below are definitely not meant for those who are new to Linux. Utilizing SystemTap on a production system is a bad idea — it can chew up significant resources while it runs and it can also cause a running system to kernel panic if you're not careful with the packages you install.

These instructions will work well with Fedora, CentOS and Red Hat Enterprise Linux. Luckily, the SystemTap folks put together some instructions for Debian and Ubuntu as well.

Before you can start working with SystemTap on your RPM-based distribution, you'll need to get some prerequisites together:

yum install gcc systemtap systemtap-runtime systemtap-testsuite kernel-devel
yum --enablerepo=*-debuginfo install kernel-debuginfo kernel-debuginfo-common

WHOA THERE: Ensure that the kernel-devel and kernel-debuginfo* packages that you install via yum match up with your running kernel. If there's a newer kernel available from your yum repo, yum will pull that one. If it's been a while since you updated, you'll either need to upgrade your current kernel to the latest and reboot or you'll need to hunt down the corresponding kernel-devel and kernel-debuginfo* packages from a repository. Installing the wrong package version can lead to kernel panics. Also, bear in mind that the debuginfo packages are quite large: almost 200MB in Red Hat/CentOS and almost 300MB in Fedora.

You can't write the script in just any language. SystemTap uses an odd syntax to get things going:

#! /usr/bin/env stap
probe begin { println("hello world") exit () }

Just run the script with stap:

# stap -v helloworld.stp
Pass 1: parsed user script and 73 library script(s) using 94380virt/21988res/2628shr kb, in 140usr/30sys/167real ms.
Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) using 94776virt/22516res/2692shr kb, in 10usr/0sys/5real ms.
Pass 3: using cached /root/.systemtap/cache/bc/stap_bc368822da380b943d4e845ee15ed047_773.c
Pass 4: using cached /root/.systemtap/cache/bc/stap_bc368822da380b943d4e845ee15ed047_773.ko
Pass 5: starting run.
hello world
Pass 5: run completed in 0usr/20sys/285real ms.

The systemtap-testsuite package gives you a tubload of extremely handy SystemTap scripts. For example:

# cd /usr/share/systemtap/testsuite/systemtap.examples/io/
# stap iotime.stp
15138470 6351 (httpd) access /usr/share/cacti/index.php read: 0 write: 0
15142243 6351 (httpd) access /usr/share/cacti/include/auth.php read: 0 write: 0
15143780 6351 (httpd) access /usr/share/cacti/include/global.php read: 0 write: 0
15144099 6351 (httpd) access /etc/cacti/db.php read: 0 write: 0
15187641 6351 (httpd) access /usr/share/cacti/lib/adodb/adodb.inc.php read: 106486 write: 0
15187664 6351 (httpd) iotime /usr/share/cacti/lib/adodb/adodb.inc.php time: 218
15194965 6351 (httpd) access /usr/share/cacti/lib/adodb/adodb-time.inc.php read: 0 write: 0
15195692 6351 (httpd) access /usr/share/cacti/lib/adodb/adodb-iterator.inc.php read: 0 write: 0
   ... output continues ...

The iotime.stp script dumps out the reads and writes occurring on the system in real time. After starting the script above, I accessed my cacti instance on the server and immediately started seeing some reads as apache began picking up PHP files to parse.

Consider a situation in which you need to decrease interrupts on a Linux machine. This is vital for laptops and systems that need to remain in low power states. Some might suggest powertop for that, but why not give SystemTap a try?

# cd /usr/share/systemtap/testsuite/systemtap.examples/interrupt/
# stap interrupts-by-dev.stp
        ohci_hcd:usb3 :      1
        ohci_hcd:usb4 :      1
            hda_intel :      1
                 eth0 :      2
                 eth0 :      2
                 eth0 :      2
                 eth0 :      2
                 eth0 :      2
                 eth0 :      2

On this particular system, it's pretty obvious that the ethernet interface is causing a lot of interrupts.

If you want more examples, keep hunting around in the systemtap-testsuite package (remember rpm -ql systemtap-testsuite) or review the giant list of examples on SystemTap's site.

Thanks again to Phil Hopkins at Rackspace for giving us a detailed explanation of system profiling during training.

Tap into your Linux system with SystemTap is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef

Keep web servers in sync with DRBD and OCFS2 (перепечатка)

Комментариев нет

The guide to redundant cloud hosting that I wrote recently will need some adjustments as I've fallen hard for the performance and reliability of DRBD and OCFS2. As a few of my sites were gaining in popularity, I noticed that GlusterFS simply couldn't keep up. High I/O latency and broken replication threw a wrench into my love affair with GlusterFS and I knew there had to be a better option.

DRBD, OCFS2, apache, varnish, and LVS

Diagram of two web nodes with a replicated filesystem using DRBD & OCFS2

I've shared my configuration with my coworkers and I've received many good questions about it. Let's get to the Q&A:

How does the performance compare to GlusterFS?
On Gluster's best days, the data throughput speeds were quite good, but the latency to retrieve the data was often much too high. Page loads on this site were taking upwards of 3-4 seconds with GlusterFS latency accounting for well over 75% of the delays. For small files, GlusterFS's performance was about 20-25x slower than accessing the disk natively. The performance hit for DRBD and OCFS2 is usually between 1.5-3x for small files and difficult to notice for large file transfers.

Couldn't you keep the data separate and then sync it with rsync?
Everyone knows that rsync can be a resource consuming monster and it seems wasteful to call rsync via a cron job to keep my data in sync. There are some periods of the day where the actual data on the web root rarely changes. There are other times where it changes rapidly and I'd end up with nodes out of sync for a few minutes.

To get the just-in-time synchronization that I want, I'd have to run rsync at least once a minute. If the data isn't changing over a long period, rsync would end up crushing the disk and consuming CPU for no reason. DRBD only syncs data when data changes. Also, all reads with DRBD are done locally. This makes is a highly efficient and effective choice for instant synchronization.

Why OCFS2? Isn't that overkill?
When you use DRBD in dual-primary mode, it's functionally equivalent to having a raw storage device (like a SAN) mounted in two places. If you threw an ext4 filesystem onto a LUN on your SAN and then mounted it on two different servers, you'd be in bad shape very quickly. Non-clustered filesystems like ext3 or ext4 can't handle being mounted in more than one environment.

OCFS2 is built primarily to be mounted in more than one place and it comes with its own distributed locking manager (DLM). The configuration files for OCFS2 are extremely simple and you mount it like any other filesystem. It's been part of the mainline Linux kernel since 2.6.19.

What happens when you lose one of the nodes?
The configuration shown above can operate with just one node in an emergency. When the failed node comes back online, DRBD will resync the block device and you can mount the OCFS2 filesystem as you normally would.

You're using an Oracle product? Really?
You've got me there. I'm not a fan of how they treat the open source community with regards to some of their projects, but the OCFS2 filesystem is robust, free, and it meets my needs.

Where's the how-to?
It's coming soon! Stay tuned.

Keep web servers in sync with DRBD and OCFS2 is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef