Игорь Олемской — практические заметки по системному администрированию Linux CentOS

Архив тега ‘glusterfs’

Xen Summit 2011: My Takeways (перепечатка)

Комментариев нет

Xen Summit 2011 LogoQuite a few people who couldn't make it to Xen Summit 2011 this year asked me to write a post summarizing my takeaways from the event. I'm not generally one to back down from peer pressure, so read on if you're interested about the discussions at this year's Summit.

The feeling I had at last year's summit is that Xen was on the verge of losing traction in the market. Very few distributions still had Xen support going forward and much of the discussion was around the lack of dom0 support in upstream Linux kernels. Distribution vendors were hesitant to drag patches forward into modern kernels and this made it much more difficult to get Xen working for many people.

Major at the Golden Gate BridgeThis year was quite different. The number of attendees was up, the venue was much better, and there was an obvious buzz of energy in the room. As many of the presenters noted, this excitement stemmed from the upstream dom0 support in Linux 3.0. This inclusion is a huge win and it helps to drive Xen forward since the developers don't have to worry about dragging patches forward. They can focus on improving performance, adding features, and tightening security.

Many of the discussions this year focused on security and performance. Ian Pratt discussed Xen's ability to view memory pages of virtual machines via an API to detect malware running inside the instance. Memory pages could be identified and marked as not executable or applications could be triggered when a VM attempts to touch a particular memory page. Also, the whole VM could be frozen if needed.

There's also a big push to bring code out of the dom0 and push it into utility VM's. Driver domains could manage the network or I/O infrastructure and this would further reduce the amount of privileged code actively running in dom0. There is already very little code required for the Xen hypervisor itself (much much less than the Linux kernel — I'm looking at you, KVM) and this reduces the attack surface for potential compromises of the hypervisor. Some projects even aim to restart driver domains multiple times per minute to ensure that any malicious code injected into those virtual machines can't exist for long periods.

Pradeep Vincent from Amazon talked about how Amazon uses Xen and the pain points they have with its current architecture. Much of his discussion was around scaling problems (and we see many of the same issues at Rackspace). Higher performance could easily be gained by multi-threaded operations in dom0 when attaching block devices and creating virtual network interfaces. He also saw some areas for performance gains in the pvops I/O code.

Quite a few of the talks centered on the ARM architecture and what Xen is able to do on those systems after Samsung published their port in 2008. HVM is on the way for ARM and it might even show up in Xen 4.2. Some demos of Xen on mobile phones from Samsung were amazing. They showed how an attacker could compromise the web browser on the phone with a keylogger, but that application was running in a VM. Once the user switched back to the phone's main menu, the keylogger couldn't access the keystrokes any longer. After that, a simple close of the browser killed the VM and destroyed the malicious code.

Xen 4.2 should be available in early 2012 and the feature list is staggering. Improvements to libxenlight, pvops performance (even in HVM), and guest memory sharing should be available with the new release. Nested virtualization (run a hypervisor inside a hypervisor) is also coming in Xen 4.2 and I'm sure Xzibit will be a huge fan. This should streamline hypervisor testing, allow for embedded hypervisor options and extend the capabilities of client hypervisors. Remus should be available in 4.2 as well, but it might be marked as experimental. OVMF will be added as a BIOS option for UEFI (along with the standard SeaBIOS) and this should allow for Mac OS X guests. UEFI allows Windows to boot faster since it switches to PV mode sooner and it allows for simpler platform certification for software vendors.

Mike McClurg's presentation on XCP was pretty important to me since Rackspace is a big consumer of XenServer. If you're not familiar with XCP, it's basically open-source XenServer which runs on bleeding edge (and sometimes unstable) components. XCP 1.5 and XenServer 6 should be available in November with Xen 4.1 and Linux 2.6.32. GPU passthrough, up to 1TB RAM, and disaster recovery will be available. Another goal for the XCP team is to work closely with OpenStack via Project Olympus. Mike's vision is to have XCP become the configuration of choice for open source clouds. Project Kronos was also extremely interesting. It's essentially XCP's XenAPI stack running on Debian and Ubuntu. You'd be able to install either OS on a physical server and run XCP's services on it for a fully OSS hypervisor.

Konrad Wilk gave an update on Linux pvops and it appears there is a shift to get Xen working well on a desktop. This includes 3D graphics support, S3/hibernate capabilities and various bug fixes. There's also a push to get PV functionality into HVM and get HVM functionality into PV. Driver/device domains were discussed again in Patrick Kolp's talk and he had plenty of graphs showing performance changes when regularly restarting device domains. The performance dips were almost negligible with 10 second restarts and the security gains were significant.

There were several other great presentations on other topics like GlusterFS, OpenStack Nova, and Linpicker (from the NSA!). If these types of things interests you, keep your eyes peeled for Xen Summit 2012 next year. The weather in the bay area is well worth the trip. ;)

Xen Summit 2011: My Takeways is a post from: Major Hayden's Racker Hacker blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

FUDCon 2011: Day One (перепечатка)

Комментариев нет

The first day of FUDCon 2011 in Tempe is coming to a close tonight and I'm completely exhausted. As promised, I'll try to summarize the day and cover the talks which I attended.

The day started out with Jared Smith's «State of Fedora» address. The audio has already been posted on the wiki, but the speech was very positive overall. He talked about some of the struggles that have happened in the past and how they'll probably happen again in some form or another. It was pretty inspirational and you could obviously tell that people in the room were energized by it.

After the address, all of the talks were pitched in BarCamp format. It was a very efficient and entertaining way to create a schedule for the conference. Everyone had 15-20 seconds to present their talk and then they had to rush outside to post their topic on the wall. We all had the opportunity to go outside and vote for the talks that sounded interesting. Once the votes were tallied, the schedule was set and the conference was fully underway.

The first talk for me was about Marek Goldmann's BoxGrinder. (Note: If you Google for BoxGrinder, make sure that you enter it as a single word. You'll get some wild unrelated results if you use two words.) In short, BoxGrinder gives you the ability to have a kickstart-ish method for automatically building images for virtual machine environments. It's completely plugin-based, so you can have different platform and delivery plugins depending on where your VM needs to be deployed. For example, you could deploy a VM with BoxGrinder that is in a format for VMWare (platform) and is delivered to the target server via SFTP (delivery). The public cloud plugins are only compatible with Amazon's products, but I'm eager to change that during one of the upcoming hackfests.

The Sheepdog talk started up right after lunch and although it was interesting, I think it left most people with quite a few questions when it was over. However, I think people are generally apprehensive when anyone tries to do anything innovative with storage. Losing data due to a bug is a big concern and many of the questions went deeper into data safety than performance and functionality.

Next up was Dave Malcolm's talk about the different implementations of python. This was definitely an eye-opening talk for my coworker and I. Dave covered CPython, Jython, PyPy and various other implementations and compared their advantages and disadvantages. I'm still pretty new to Python (I'm clutching on to ruby, PHP and perl still), but this talk really had me thinking about which implementations are best for a particular environment or task. It was quite a bit of fun to learn about some of the deep underpinnings of Python and how they differ depending on the specific implementation.

Jeff Darcy's talk about CloudFS was very intriguing. I've been a fan of GlusterFS recently, but I eventually moved away due to a lack of enterprise features and degrading performance. Jeff is working to add in encryption and authentication without rewriting the filesystem itself. There are quite a few tricky problems involved in the encryption portion due to partial writes and general security during the handshake process. CloudFS could potentially be a network filesystem which could be shared by multiple tenants with their own individual namespaces and segregated UID's. This could be a big win for providers as they could offer up large amounts of storage in an organized fashion without too many management headaches.

We wrapped up the day of talks with Chris Lalancette's presentation about Deltacloud. In short, it's a bag of daemons that allow you to manage multiple public or private clouds. Everything from image management to provisioning are included in the project. Questions were raised about whether another application was needed since vendor-specific libraries are abundant and libcloud offers many of the same features in a simpler package.

Tonight's social event was FUDPub at ASU's Memoral Union building. The food and drinks were excellent (thanks to Rackspace!) and it was a great opportunity to relax and talk with other Fedora users and developers. We had the opportunity to meet people from around the world while playing round after round of bowling and billiards. The discussions were extremely valuable, but as I said before, it was quite tiring.

I've compiled the FUDCon photos I've taken into a Flickr photo set.

That's the end of today's summary. I'll try to keep this going tomorrow as well. Thanks for reading!

FUDCon 2011: Day One is a post from: Major Hayden's Racker Hacker blog.

Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.

Switching from GlusterFS to DRBD and OCFS2 (перепечатка)

Комментариев нет

As my uptime reports have shown, and as some of you have reported, my blog's load time has increased steadily over the past few weeks. It turns out that one of my VM's was on a physical machine that had some trouble and I was reaching a point where GlusterFS's replicate functionality couldn't meet my performance needs.

Instead of using GlusterFS as I had before in my redundant cloud hosting guide, I decided to use DRBD in dual-primary mode with OCFS2 as the clustering filesystem on top of it. The performance is quite good so far:

Pingdom Response Time Graph for rackerhacker.com

Pingdom Response Time Graph for rackerhacker.com

I switched over the DNS late last night and the response time has fallen from the two to three second range (during times of low load) to right around one second per request. In addition to the reduced load times, I can support higher concurrency without significant performance degradation.

Don't worry — I'll make a detailed post on this topic later along with a guide on how to set it up yourself.

Switching from GlusterFS to DRBD and OCFS2 is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef

Very unscientific GlusterFS benchmarks (перепечатка)

Комментариев нет

I've been getting requests for GlusterFS benchmarks from every direction lately and I've been a bit slow on getting them done. You may suspect that you know the cause of the delays, and you're probably correct. ;-)

Quite a few different sites argue that the default GlusterFS performance translator configuration from glusterfs-volgen doesn't allow for good performance. You can find other sites which say you should stick with the defaults that come from the script. I decided to run some simple tests to see which was true in my environment.

Here's the testbed:

  • GlusterFS 3.0.5 running on RHEL 5.4 Xen guests with ext3 filesystems
  • one GlusterFS client and two GlusterFS servers are running in separate Xen guests
  • cluster/replicate translator is being used to keep the servers in sync
  • the instances are served by a gigabit network

It's about time for some pretty graphs, isn't it?

iozone re-reader benchmark results with default glusterfs translators from glusterfs-volgeniozone re-reader benchmark results with no glusterfs translators

The test run on the left used default stock client and server volume files as they come from glusterfs-volgen. The test run on the right used a client volume file with no performance translators (the server volume file was untouched). Between each test run, the GlusterFS mount was unmounted and remounted. I repeated this process four times (for a total of five runs) and averaged the data.

You'll have to forgive the color mismatches and the lack of labeling on the legend (that's KB/sec transferred) as I'm far from an Excel expert.

The graphs show that running without any translators at all will drastically hinder read caching in GlusterFS — exactly as I expected. Without any translators, the performance is very even across the board. Since my instances had 256MB of RAM each, their iocache translator was limited to about 51MB of cache. That's reflected in the graph on the left — look for the vertical red/blue divider between the 32MB and 64MB file sizes. I'll be playing around with that value soon to see how it can improve performance for large and small files.

Keep in mind that this test was very unscientific and your results may vary depending on your configuration. While I hope to have more detailed benchmarks soon, this should help some of the folks who have been asking for something basic and easy to understand.

Very unscientific GlusterFS benchmarks is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef

One month with GlusterFS in production (перепечатка)

Комментариев нет

As many of you might have noticed from my previous GlusterFS blog post and my various tweets, I've been working with GlusterFS in production for my personal hosting needs for just over a month. I've also been learning quite a bit from some of the folks in the #gluster channel on Freenode. On a few occasions I've even been able to help out with some configuration problems from other users.

There has been quite a bit of interest in GlusterFS as of late and I've been inundated with questions from coworkers, other system administrators and developers. Most folks want to know about its reliability and performance in demanding production environments. I'll try to do my best to cover the big points in this post.

First off, here's now I'm using it in production: I have two web nodes that keep content in sync for various web sites. They each run a GlusterFS server instance and they also mount their GlusterFS share. I'm using the replicate translator to keep both web nodes in sync with client side replication.

Here are my impressions after a month:

I/O speed is often tied heavily to network throughput
This one may seem obvious, but it's not always true in all environments. If you deal with a lot of small files like I do, a 40mbit/sec link between the Xen guests is plenty. Adding extra throughput didn't add any performance to my servers. However, if you wrangle large files on your servers regularly, you may want to consider higher throughput links between your servers. I was able to push just under 900mbit/sec by using dd to create a large file within a GlusterFS mount.

Network and I/O latency are big factors for small file performance
If you have a busy network and the latency creeps up from time to time, you'll find that your small file performance will drop significantly (especially with the replicate translator). Without getting too nerdy (you're welcome to read the technical document on replication), replication is an intensive process. When a file is accessed, the client goes around to each server node to ensure that it not only has a copy of the file being read, but that it has the correct copy. If a server didn't save a copy of a file (due to disk failure or the server being offline when the file was written), it has to be synced across the network from one of the good nodes.

When you write files on replicated servers, the client has to roll through the same process first. Once that's done, it has to lock the file, write to the change log, then do the write operation, drop the change log entries, and then unlock the file. All of those operations must be done on all of the servers. High latency networks will wreak havoc on this process and cause it to take longer than it should.

It's quite obvious that if you have a fast, low-latency network between your servers, slow disks can still be a problem. If the client is waiting on the server nodes' disks to write data, the read and write performance will suffer. I've tested this in environments with fast networks and very busy RAID arrays. Even if the network was very underutilized, slow disks could cut performance drastically.

Monitoring GlusterFS isn't easy
When the client has communication problems with the server nodes, some weird things can happen. I've seen situations where the client loses connections to the servers (see the next section on reliability) and the client mount simply hangs. In other situations, the client has been knocked offline entirely and the process is missing from the process tree by the time I logged in. Your monitoring will need to ensure that the mount is active and is responding in a timely fashion.

There's a handy script which allows you to monitor GlusterFS mounts via nagios that Ian Rogers put together. Also, you can get some historical data with acrollet's munin-glusterfs plugin.

GlusterFS 3.x is pretty reliable
When I first started working with GlusterFS, I was using a version from the 2.x tree. The Fedora package maintainer hadn't updated the package in quite some time, but I figured it should work well enough for my needs. I found that the small file performance was lacking and the nodes often had communication issues when many files were being accessed or written simultaneously. This improved when I built my own RPMs of 3.0.4 (and later 3.0.5) and began using those instead.

I did some failure testing by hard cycling the server and client nodes and found some interesting results. First off, abruptly pulling clients had no effects on the other clients or the server nodes. The connection eventually timed out and the servers logged the timeout as expected.

Abruptly pulling servers led to some mixed results. In the 2.x branch, I saw client hangs and timeouts when I abruptly removed a server. This appears to be mostly corrected in the 3.x branch. If you're using replicate, it's important to keep in mind that the first server volume listed in your client's volume file is the one that will be coordinating the file and directory locking. Should that one fall offline quickly, you'll see a hiccup in performance for a brief moment and the next server will be used for coordinating the locking. When your original server comes back up, the locking coordination will shift back.

Conclusion
I'm really impressed with how much GlusterFS can do with the simplicity of how it operates. Sure, you can get better performance and more features (sometimes) from something like Lustre or GFS2, but the amount of work required to stand up that kind of cluster isn't trivial. GlusterFS really only requires that your kernel have FUSE support (it's been in mainline kernels since 2.6.14).

There are some things that GlusterFS really needs in order to succeed:

  • Documentation — The current documentation is often out of date and confusing. I've even found instances where the documentation contradicts itself. While there are some good technical documents about the design of some translators, they really ought to do some more work there.
  • Statistics gathering — It's very difficult to find out what GlusterFS is doing and where it can be optimized. Profiling your environment to find your bottlenecks is nearly impossible with the 2.x and 3.x branches. It doesn't make it easier when some of the performance translators actually decrease performance.
  • Community involvement — This ties back into the documentation part a little, but it would be nice to see more participation from Gluster employees on IRC and via the mailing lists. They're a little better with mailing list responses than other companies I've seen, but there is still room for improvement.

If you're considering GlusterFS for your servers but you still have more questions, feel free to leave a comment or find me on Freenode (I'm 'rackerhacker').

One month with GlusterFS in production is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef

GlusterFS on the cheap with Rackspace's Cloud Servers or Slicehost (перепечатка)

Комментариев нет

High availability is certainly not a new concept, but if there's one thing that frustrates me with high availability VM setups, it's storage. If you don't mind going active-passive, you can set up DRBD, toss your favorite filesystem on it, and you're all set.

If you want to go active-active, or if you want multiple nodes active at the same time, you need to use a clustered filesystem like GFS2, OCFS2 or Lustre. These are certainly good options to consider but they're not trivial to implement. They usually rely on additional systems and scripts to provide reliable fencing and STONITH capabilities.

What about the rest of us who want multiple active VM's with simple replicated storage that doesn't require any additional elaborate systems? This is where GlusterFS really shines. GlusterFS can ride on top of whichever filesystem you prefer, and that's a huge win for those who want a simple solution. However, that means that it has to use fuse, and that will limit your performance.

Let's get this thing started!

Consider a situation where you want to run a WordPress blog on two VM's with load balancers out front. You'll probably want to use GlusterFS's replicated volume mode (RAID 1-ish) so that the same files are on both nodes all of the time. To get started, build two small Slicehost slices or Rackspace Cloud Servers. I'll be using Fedora 13 in this example, but the instructions for other distributions should be very similar.

First things first — be sure to set a new root password and update all of the packages on the system. This should go without saying, but it's important to remember. We can clear out the default iptables ruleset since we will make a customized set later:

# iptables -F
# /etc/init.d/iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:        [  OK  ]

GlusterFS communicates over the network, so we will want to ensure that traffic only moves over the private network between the instances. We will need to add the private IP's and a special hostname for each instance to /etc/hosts on both instances. I'll call mine gluster1 and gluster2:

10.xx.xx.xx gluster1
10.xx.xx.xx gluster2

You're now ready to install the required packages on both instances:

yum install glusterfs-client glusterfs-server glusterfs-common glusterfs-devel

Make the directories for the GlusterFS volumes on each instance:

mkdir -p /export/store1

We're ready to make the configuration files for our storage volumes. Since we want the same files on each instance, we will use the --raid 1 option. This only needs to be run on the first node:

# glusterfs-volgen --name store1 --raid 1 gluster1:/export/store1 gluster2:/export/store1
Generating server volfiles.. for server 'gluster2'
Generating server volfiles.. for server 'gluster1'
Generating client volfiles.. for transport 'tcp'

Once that's done, you'll have four new files:

  • booster.fstab — you won't need this file
  • gluster1-store1-export.vol — server-side configuration file for the first instance
  • gluster2-store1-export.vol — server-side configuration file for the second instance
  • store1-tcp.vol — client side configuration file for GlusterFS clients

Copy the gluster1-store1-export.vol file to /etc/glusterfs/glusterfsd.vol on your first instance. Then, copy gluster2-store1-export.vol to /etc/glusterfs/glusterfsd.vol on your second instance. The store1-tcp.vol should be copied to /etc/glusterfs/glusterfs.vol on both instances.

At this point, you're ready to start the GlusterFS servers on each instance:

/etc/init.d/glusterfsd start

You can now mount the GlusterFS volume on both instances:

mkdir -p /mnt/glusterfs
glusterfs /mnt/glusterfs/

You should now be able to see the new GlusterFS volume in both instances:

# df -h /mnt/glusterfs
Filesystem            Size  Used Avail Use% Mounted on
/etc/glusterfs/glusterfs.vol
                      9.4G  831M  8.1G  10% /mnt/glusterfs

As a test, you can create a file on your first instance and verify that your second instance can read the data:

[root@gluster1 ~]# echo "We're testing GlusterFS" > /mnt/glusterfs/test.txt
.....
[root@gluster2 ~]# cat /mnt/glusterfs/test.txt
We're testing GlusterFS

If you remove that file on your second instance, it should disappear from your first instance as well.

Obviously, this is a very simple and basic implementation of GlusterFS. You can increase performance by making dedicated VM's just for serving data and you can adjust the default performance options when you mount a GlusterFS volume. Limiting access to the GlusterFS servers is also a good idea.

If you want to read more, I'd recommend reading the GlusterFS Technical FAQ and the GlusterFS User Guide.


Thank you for your e-mails! I'll be expanding on this post later with some sample benchmarks and additional tips/tricks, so please stay tuned.

GlusterFS on the cheap with Rackspace's Cloud Servers or Slicehost is a post from: Major Hayden's Racker Hacker blog.

c0b6ad7e-f251-11df-b20b-4040336e00ef