Вы видели когда нибудь папочку весом 1Gb? Имеется в виду вес не содержимого низлежащих файлов, а вес самой папочки, т.е. листинг имен файлов весом 1Gb… около 13млн инод внутри…
При скорости удаления около 100 файлов в секунду, а файлов более 10 млн, можно идти спать….
Дальнейшее вскрытие показало что в папочке были файлы сессий php. На Debian Lenny почему то выпиливатель старых файлов сессий выключен по дефолту:
hosted-by:~>grep-B5 session.gc_probability /etc/php5/apache2/php.ini
; This is disabled in the Debian packages, due to the strict permissions
; on /var/lib/php5. Instead of setting this here, see the cronjob at
; /etc/cron.d/php5, which uses the session.gc_maxlifetime setting below.
; php scripts using their own session.save_path should make sure garbage
; collection is enabled by setting session.gc_probability
;session.gc_probability = 0
Вопиющее распиздяйство. Хотя скорей всего дело в нестандартном расположении session.save_path, который ставит ISP Manager.
As promised in one of my about dual-primary DRBD and OCFS2, I've compiled a step-by-step guide for Fedora. These instructions should be somewhat close to what you would use on CentOS or Red Hat Enterprise Linux. However, CentOS and Red Hat don't provide some of the packages needed, so you will need to use other software repositories like or .
In this guide, I'll be using two Fedora 14 instances in the with separate public and private networks. The instances are called server1 and server2 to make things easier to follow.
NOTE: All of the instructions below should be done on both servers unless otherwise specified.
First, we need to set up DRBD with two primary nodes. I'll be using loop files for this setup since I don't have access to raw partitions.
The net section is telling DRBD to do the following:
allow-two-primaries — Generally, DRBD has a primary and a secondary node. In this case, we will allow both nodes to have the filesystem mounted at the same time. Do this only with a clustered filesystem. If you do this with a non-clustered filesystem like ext2/ext3/ext4 or reiserfs, you will have data corruption. Seriously!
after-sb-0pri discard-zero-changes — DRBD detected a split-brain scenario, but none of the nodes think they're a primary. DRBD will take the newest modifications and apply them to the node that didn't have any changes.
after-sb-1pri discard-secondary — DRBD detected a split-brain scenario, but one node is the primary and the other is the secondary. In this case, DRBD will decide that the secondary node is the victim and it will sync data from the primary to the secondary automatically.
after-sb-2pri disconnect — DRBD detected a split-brain scenario, but it can't figure out which node has the right data. It tries to protect the consistency of both nodes by disconnecting the DRBD volume entirely. You'll have to tell DRBD which node has the valid data in order to reconnect the volume. Use extreme caution if you find yourself in this scenario.
If you'd like to read about DRBD split-brain behavior in more detail, .
I generally turn off the usage reporting functionality in DRBD within /etc/drbd.d/global_common.conf:
global {
usage-count no;
}
Now we can create the volume and start DRBD:
drbdadm create-md r0
/etc/init.d/drbd start && chkconfig drbd on
You may see some errors thrown about having two primaries but neither are up to date. That can be fixed by running the following command on the primary node only:
drbdsetup /dev/drbd0 primary -o
If you run cat /proc/drbd on the secondary node, you should see the DRBD sync running:
We're now ready to move on to configuring OCFS2. Only one package is needed:
yum -y install ocfs2-tools
Ensure that you have your servers and their private IP addresses in /etc/hosts before proceeding. Create the /etc/ocfs2 directory and place the following configuration in /etc/ocfs2/cluster.conf (adjust the server names and IP addresses):
cluster:
node_count = 2
name = web
node:
ip_port = 7777
ip_address = 10.181.76.0
number = 1
name = server1
cluster = web
node:
ip_port = 7777
ip_address = 10.181.76.1
number = 2
name = server2
cluster = web
Now it's time to configure OCFS2. Run service ocfs2 configure and follow the prompts. Use the defaults for all of the responses except for two questions:
Answer «y» to «Load O2CB driver on boot»
Answer «web» to «Cluster to start on boot»
Start OCFS2 and enable it at boot up:
chkconfig o2cb on && chkconfig ocfs2 on
/etc/init.d/o2cb start && /etc/init.d/ocfs2 start
Create an OCFS2 partition on the primary node only:
mkfs.ocfs2 -L "web" /dev/drbd0
Mount the volumes and configure them to automatically mount at boot time. You might be wondering why I do the mounting within /etc/rc.local. I chose to go that route since mounting via fstab was often unreliable for me due to the incorrect ordering of events at boot time. Using rc.local allows the mounts to work properly upon every reboot.
At this point, you should be all done. If you want to test OCFS2, copy a file into your /mnt/storage mount on one node and check that it appears on the other node. If you remove it, it should be gone instantly on both nodes. This is a great opportunity to test reboots of both machines to ensure that everything comes up properly at boot time.
is a post from: Major Hayden's blog.
Thanks for following the blog via the RSS feed. Please don't copy my posts or quote portions of them without attribution.
The that I wrote recently will need some adjustments as I've fallen hard for the performance and reliability of DRBD and OCFS2. As a few of my sites were gaining in popularity, I noticed that GlusterFS simply couldn't keep up. High I/O latency and broken replication threw a wrench into my love affair with GlusterFS and I knew there had to be a better option.
Diagram of two web nodes with a replicated filesystem using DRBD & OCFS2
I've shared my configuration with my coworkers and I've received many good questions about it. Let's get to the Q&A:
How does the performance compare to GlusterFS?
On Gluster's best days, the data throughput speeds were quite good, but the latency to retrieve the data was often much too high. Page loads on this site were taking upwards of 3-4 seconds with GlusterFS latency accounting for well over 75% of the delays. For small files, GlusterFS's performance was about 20-25x slower than accessing the disk natively. The performance hit for DRBD and OCFS2 is usually between 1.5-3x for small files and difficult to notice for large file transfers.
Couldn't you keep the data separate and then sync it with rsync?
Everyone knows that rsync can be a resource consuming monster and it seems wasteful to call rsync via a cron job to keep my data in sync. There are some periods of the day where the actual data on the web root rarely changes. There are other times where it changes rapidly and I'd end up with nodes out of sync for a few minutes.
To get the just-in-time synchronization that I want, I'd have to run rsync at least once a minute. If the data isn't changing over a long period, rsync would end up crushing the disk and consuming CPU for no reason. DRBD only syncs data when data changes. Also, all reads with DRBD are done locally. This makes is a highly efficient and effective choice for instant synchronization.
Why OCFS2? Isn't that overkill?
When you use DRBD in dual-primary mode, it's functionally equivalent to having a raw storage device (like a SAN) mounted in two places. If you threw an ext4 filesystem onto a LUN on your SAN and then mounted it on two different servers, you'd be in bad shape very quickly. Non-clustered filesystems like ext3 or ext4 can't handle being mounted in more than one environment.
OCFS2 is built primarily to be mounted in more than one place and it comes with its own distributed locking manager (DLM). The configuration files for OCFS2 are extremely simple and you mount it like any other filesystem. It's been part of the mainline Linux kernel since 2.6.19.
What happens when you lose one of the nodes?
The configuration shown above can operate with just one node in an emergency. When the failed node comes back online, DRBD will resync the block device and you can mount the OCFS2 filesystem as you normally would.
You're using an Oracle product? Really?
You've got me there. I'm not a fan of how they treat the open source community with regards to some of their projects, but the OCFS2 filesystem is robust, free, and it meets my needs.
As my uptime reports have shown, and as some of you have reported, my blog's load time has increased steadily over the past few weeks. It turns out that one of my VM's was on a physical machine that had some trouble and I was reaching a point where GlusterFS's replicate functionality couldn't meet my performance needs.
Instead of using as I had before in my , I decided to use in dual-primary mode with as the clustering filesystem on top of it. The performance is quite good so far:
Pingdom Response Time Graph for rackerhacker.com
I switched over the DNS late last night and the response time has fallen from the two to three second range (during times of low load) to right around one second per request. In addition to the reduced load times, I can support higher concurrency without significant performance degradation.
Don't worry — I'll make a detailed post on this topic later along with a guide on how to set it up yourself.
Today, on my 28th birthday, I'm finally delivering on a promise to my readers which I made about two months ago. I've on how to host a web application redundantly in a cloud environment. While it's still a bit of a rough draft, it should be a good starting point for those who haven't worked in virtualized environments before. Also, it may show some of the more experienced systems administrators a new way to do things.
The guide:
As always, if you find anything in the guide that needs improvement, I'm all ears.
High availability is certainly not a new concept, but if there's one thing that frustrates me with high availability VM setups, it's storage. If you don't mind going active-passive, you can set up , toss your favorite filesystem on it, and you're all set.
If you want to go active-active, or if you want multiple nodes active at the same time, you need to use a clustered filesystem like , or . These are certainly good options to consider but they're not trivial to implement. They usually rely on additional systems and scripts to provide reliable and capabilities.
What about the rest of us who want multiple active VM's with simple replicated storage that doesn't require any additional elaborate systems? This is where really shines. GlusterFS can ride on top of whichever filesystem you prefer, and that's a huge win for those who want a simple solution. However, that means that it has to use , and that will limit your performance.
Let's get this thing started!
Consider a situation where you want to run a WordPress blog on two VM's with load balancers out front. You'll probably want to use GlusterFS's replicated volume mode (RAID 1-ish) so that the same files are on both nodes all of the time. To get started, build two small Slicehost slices or Rackspace Cloud Servers. I'll be using Fedora 13 in this example, but the instructions for other distributions should be very similar.
First things first — be sure to set a new root password and update all of the packages on the system. This should go without saying, but it's important to remember. We can clear out the default iptables ruleset since we will make a customized set later:
# iptables -F
# /etc/init.d/iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables: [ OK ]
GlusterFS communicates over the network, so we will want to ensure that traffic only moves over the private network between the instances. We will need to add the private IP's and a special hostname for each instance to /etc/hosts on both instances. I'll call mine gluster1 and gluster2:
10.xx.xx.xx gluster1
10.xx.xx.xx gluster2
You're now ready to install the required packages on both instances:
Make the directories for the GlusterFS volumes on each instance:
mkdir -p /export/store1
We're ready to make the configuration files for our storage volumes. Since we want the same files on each instance, we will use the --raid 1 option. This only needs to be run on the first node:
# glusterfs-volgen --name store1 --raid 1 gluster1:/export/store1 gluster2:/export/store1
Generating server volfiles.. for server 'gluster2'
Generating server volfiles.. for server 'gluster1'
Generating client volfiles.. for transport 'tcp'
Once that's done, you'll have four new files:
booster.fstab — you won't need this file
gluster1-store1-export.vol — server-side configuration file for the first instance
gluster2-store1-export.vol — server-side configuration file for the second instance
store1-tcp.vol — client side configuration file for GlusterFS clients
Copy the gluster1-store1-export.vol file to /etc/glusterfs/glusterfsd.vol on your first instance. Then, copy gluster2-store1-export.vol to /etc/glusterfs/glusterfsd.vol on your second instance. The store1-tcp.vol should be copied to /etc/glusterfs/glusterfs.vol on both instances.
At this point, you're ready to start the GlusterFS servers on each instance:
/etc/init.d/glusterfsd start
You can now mount the GlusterFS volume on both instances:
mkdir -p /mnt/glusterfs
glusterfs /mnt/glusterfs/
You should now be able to see the new GlusterFS volume in both instances:
# df -h /mnt/glusterfs
Filesystem Size Used Avail Use% Mounted on
/etc/glusterfs/glusterfs.vol
9.4G 831M 8.1G 10% /mnt/glusterfs
As a test, you can create a file on your first instance and verify that your second instance can read the data:
If you remove that file on your second instance, it should disappear from your first instance as well.
Obviously, this is a very simple and basic implementation of GlusterFS. You can increase performance by making dedicated VM's just for serving data and you can adjust the default performance options when you mount a GlusterFS volume. Limiting access to the GlusterFS servers is also a good idea.
If you want to read more, I'd recommend reading the and the .
Thank you for your e-mails! I'll be expanding on this post later with some sample benchmarks and additional tips/tricks, so please stay tuned.