October 25, 2017
A Very Odd Couple: Apache Web Server Response Time and IPFilter
As those who have been with me on this miserable excuse for a website for awhile can likely attest, I am a fairly unabashed fan of *nix systems. I have been running FreeBSD at home for more than sixteen years and, for several years during that period, it was the only operating system I used. No Windows; no Macs; and sometimes no graphical user interfaces. Back in my truly hardcore days (i.e., Berkeley, circa 2002), I didn’t even bother to own a router; why waste the money on some piece of shit hardware (and back then, they really were pieces of shit, even more so than now) when I could route and NAT all the traffic through a computer I had built? Hell, I even flirted with OpenBSD, for crying out loud!
Of course, not all is beer and skittles when it comes to open source software: hardware incompatibility, broken device drivers, lacking (or worse, inaccurate) documentation, and maddeningly cryptic bugs will all sap your will to live if you spend enough time with it. And, so it shouldn’t come as a surprise that over the years, my once unwavering dedication to all things *nix began to waver. I gave up running my own homebrew
router sometime in 2004, and tearfully bade farewell to hosting my own e-mail in 2007.
So much the worse that I did, as those sysadmin skills would have undoubtedly come in handy in debugging one of the most cryptic bugs with my systems that I have encountered to date: massive Apache webserver (httpd
) performance degradation resulting from a IPFilter glitch. For those interested, more on the problem, the symptoms by which it manifest itself, and also the eventual solution is below.
The Problem
For many months, I had been experiencing a major and quite inexplicable problem with the response time of the webserver that hosts this site (and a few others, including my photo gallery and my personal site). (For background, this site is hosted on a colocated server running FreeBSD (what else?) and uses Apache.) The sites, especially image downloads, were quite literally crawling.
Because the response time issue generally resolved itself (at least briefly) when I restarted httpd
, I initially (and for many months of sporadic attempts at debugging thereafter) suspected that it had something to do with memory leaks. If Apache was swapping to disk because of memory leaks, that would, of course, be fatal for response time, and, as I noted last year, the crusty old software on which this site still runs (i.e., mod_perl) is quite prone to memory leaks. But no matter how much I tweaked or tuned Apache, I couldn't get any detectable change in response time.
By early 2017, I had given up on trying to debug the issue, and instead decided to focus on finally killing off mod_perl and launching the new version of this site (v3
), which has been in on-and-off development for nearly a decade without ever quite making it to the finish line. So, essentially, until I found the time to completely rewrite the site, it would just be non-functional (along with all other sites hosted on the machine). Not quite the ideal result.
A Breakthrough
A breakthrough in this seemingly intractable problem came about quite randomly a couple weeks ago. I was installing some software updates on the server, and because the port management software was having trouble pulling down some patches, I had temporarily disabled the firewall on the machine. By complete chance, I also happened to load a page on my photo gallery when the firewall was off and voilà! It loaded in an instant (including several images)!
Could it be that the firewall was killing Apache? How? Why?! I turned the firewall back on, and loaded the same page again; it took several seconds to load. What the fuck?
So, at that point, I was pretty sure I had found the source of the problem, but I still had no clue as to what the solution could be. Back in the halcyon days of running my own router, I had spent a lot of time playing around with my firewall configuration, but by this point, my firewall rule set was well established; it had not changed in years and there had never an issue with Apache before. What was going on?
A Most Obscure Bug
Hours of research and numerous wild goose chases led nowhere. I had again hit a brick wall and was now considering dumping IPFilter in favor of another firewall (such as IPFW or PF). But the idea of having to rewrite and test all the existing rules in a completely different software was daunting and I was quite reluctant to do it.
As a last ditch effort, I decided to do a deep dive into the network traffic coming in and out of the machine to try to deduce what was getting messed up. While scanning through netstat
and ipfstat
dumps after accessing discrete pages on my website, I noticed something unusual:
996 block out log first quick on [Interface] all
Huh? Why were so many packets trying to get out but getting blocked? DNS queries? No. DNS queries were explicitly permitted in the firewall. Digging through the logs generated by ipmon
, I then came across log entries like the following, coming in rapidly when I accessed a page:
Oct 25 19:41:47 hegemon ipmon[97188]: 19:49:52.149937 2x [Interface] @0:21 b [Server IP address],80 -> [Client IP address],59222 PR tcp len 20 4396 -A OUT bad
That meant that outgoing packets were somehow getting corrupted, and then blocked by the firewall. Turning off the default rule that blocked all outgoing packets that did not meet certain criteria confirmed it; the site was responsive when that rule was not in effect.
The Solution
Finally, we had gotten to the $64,000 question—why the fuck were outgoing packets getting corrupted?! I finally stumbled across what I think is likely the answer with some googling. In a 2007 (!) e-mail thread on the freebsd-sparc64
distribution list (!!), someone had encountered a similar error in a release candidate of FreeBSD 6.3 (!!!) and one of the responses suggested a fix by disabling checksum offloading capabilities for the hardware interface:
ifconfig [Interface] -txcsum -rxcsum
PC load letter? Nevertheless, I ran that code, enabled blocking of all outgoing packets by default, and started everything up again. And like a charm, it all worked.
Postscript
So, what happened here? Why was it working fine before and suddenly stopped working?
The first part of the puzzle is somewhat simple: sometime in March 2016, my colocation service upgraded the hardware running my server and in the course of that process, the hardware interface changed. Presumably the previous interface did not have checksum offloading capabilities (or those capabilities were not enabled or the issue with IPFilter did not manifest itself with that particular interface). Because I was sure that the server response time issues were caused by memory leaks in mod_perl, I never put together the emergence of the problem with the upgrade of the server and hardware network interface.
As to what the issue is with IPFilter and checksum offloading capabilities, frankly, I don’t have the faintest idea. The 2007 e-mail thread suggested a bug in IPFilter, but I have not been able to confirm either the existence of the bug or any purported solution via some light Google research. It does seem odd that a bug that was surfaced in 2007 would still be around, but perhaps not impossible if IPFilter is not being actively developed.
So, that’s it? No magnificent explanation? Too bad, so sad? Recent experiences painfully wading through the innards of obscure systems notwithstanding, I have really no desire to figure this one out. Frankly, if IPFilter is no longer being developed, I think it’s probably a sign that I should move my firewalls to a different system, not figure out this fucking obscure bug. Even I have my limits.
But just in case anyone else comes across this issue, I thought I would document it for posterity. You’re welcome, dear readers.
(Note that interface names and IP addresses have been redacted for privacy throughout.)
beast debug mode
Posted by Streeter | October 31, 2017 14:47:35 -0700 | Permalink