A tale of a resurrected bug
Crap, it's not working...
I had just finished moving one of our larger Plone deployments to a new host. Everything had gone, with one or two minor exceptions, exceedingly smoothly, and I was looking forward to setting aside the laptop for the rest of the evening. (Yes, that does happen occasionally ;) ...But then I got an IM...
[our client]: 22:06:50 i am getting:
Bad gateway The proxy server received an invalid response from an upstream server
Well, shit. That's weird... Normally I might have expected this 502 error from squid when Zope is restarting, but in this case Zope is directly behind apache, and is running fine. And the error's only occurring on one particular page...what the heck? Also... why is this a problem now when it definitely wasn't happening on the old server? My entire software stack is installed from a buildout with pinned versions, and I copied the apache configuration, so what variables could be left?
I could probably have already known at this point that the evening was shot. But what to do?
Time for some inspection...
Okay, well obviously Apache doesn't like something it's getting from Zope. Maybe a Zope bug of some sort? Let's check the apache error log and see what it tells us:
[Fri Jan 23 22:12:05 2009] [warn] proxy: bad HTTP/1.1 header returned by /@@sub_trial (POST)
Gee, thanks for all the detail, apache. Well, at least this confirms my suspicions that some header from Zope is wrong. But what could it be? Time to hit Zope directly and see what we can see in the headers. Open up the port in the firewall, enter the Virtual Host Monster URL to the problem page, view the headers in Firebug...what have we got?
Date: Mon, 26 Jan 2009 08:35:33 GMT Server: Zope/(Zope 2.10.5-final, python 2.4.4, linux2) ZServer/1.1 Plone/3.0.3 Content-Type: text/html; charset=utf-8 Location: https://www.xxxxx.org/thankyou_trial Set-Cookie: statusmessages="A+RXZWxjb21lISBZb3UgYXJlIG5vdyBsb2dnZWQgaW4uaW5mbwTEWW91ciBUcmlhbCBTdWJzY3JpcHRpb24gSXMgTm93IEFjdGl2ZS5pbmZv"; Path=/ Via: 1.1 xxxxx.org Vary: Accept-Encoding,User-Agent Content-Encoding: gzip Content-Length: 20 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive
Well, the page loads fine when I hit the site directly in Zope, and those headers look pretty innocuous. At this point my only guess was that the gzipping was screwing things up somehow, and thus followed a half-hour-long digression of confirming the gzip settings in CacheFu, confirming the mod_deflate settings in apache, and finally disabling gzipping entirely in Firefox. All of which served only to tell me in the end that the problem had nothing to do with gzipping.
Pull out the big guns: Google
At this point I did what any self-respecting software professional would do: I typed the error message into Google. This is where things really start to get interesting...
A query of
zope "bad HTTP/1.1 header" netted me a few irrelevant-looking results. But aha, also an old e-mail thread that describes a similar error, and that has a link to a Plone trac ticket that...well, "Plone Hotfix 20071106 breaks long status messages" doesn't exactly sound relevant, but look, there's a comment that mentions the bug results in a 502 gateway error when running behind apache -- which matches my experience -- and it even says the bug is still present in Plone 3.0.3, the version I'm using on this particular site. And the description of the issue -- a problem with the statusmessage cookie header -- sounds reasonably close to the diagnosis I was guessing at. Hey, maybe we're on to something!
But something is still not right. The comments on the ticket also say that a newer version of the hotfix that introduced the bug, PloneHotfix20071106, was released to fix the issue all the way back in November 2007. And my buildout already includes this updated version -- it's right there as:
(Note the "-2"). Hmm, now that I think about it, maybe I have a vague recollection of encountering this same issue sometime in the distant past, and updating that hotfix in the buildout. But, obviously the fix is not working now. So what's going on?
And then it hit me.
David Glick: 23:14:50 holy crap, I think the real problem here is related to the recent plone.org upgrade David Glick: 23:14:56 no shit :)
Let me explain. During the recent plone.org upgrade, all the product tarball files stored in PloneSoftwareCenter were migrated from being stored in the ZODB to being stored in a big directory on the filesystem using FileSystemStorage and collective.psc.externalstorage. (This is part of the efforts to make PSC function as a pypi-style repository, a cool ability which I hope Alex Clark and Tarek Ziad?? will publicize before too long.) Now, the thing I noticed that made me think "aha!" is that the filename listed for the hotfix tarball in my buildout has no version suffix. And it didn't take me much longer to check the older release of that hotfix and see that it has the exact same tarball filename. And then, knowing what I did about the plone.org upgrade, it didn't take much imagination to think that the two releases must now be accessing the same file on the filesystem, and the "-2" release was getting the wrong data. And this was quickly confirmed by inspection of the actual Hotfix code that had been fetched into my buildout.
To express my feelings at this point, echoing the immortal words of our Zope Pope:
Case closed? You betcha. I stopped using the broken tarball and checked out the correct version of the hotfix from the svn tag instead. After restarting Zope, no more 502 error when I hit the form. And I followed up soon thereafter by replacing the hotfix tarball with the most recent release so that others shouldn't run into this same issue.
Note, however, that there's no guarantee at this point that other packages aren't experiencing a similar issue, so keep your wits about you when installing old-style product tarballs from plone.org. However, I have filed a report and trust that Alex or Tarek will be able to work on it before too long.
Of course, it goes without saying that I was rather lucky to have the familiarity with the plone.org upgrade needed to correctly diagnose the cause of this issue. But otherwise, just another day in the life of a Plone hacker...