Sunday 16 November 2014

The 10 commandments of maintainable web services

Here is a list of the ten core elements needed for a development to deployment phase infrastructure to provide a stable service for your web applications. Along with minimizing time wasted on bugs and issues that are unrelated to functional development, and slashing the maintenance time and cost - compared to systems without them. I guess it could also be called automation, automation, automation ...

It should be noted that just because an application is a legacy one. It does not mean that all of this infrastructure cannot be retro fitted to it. *
  1. Standard environment
    A set of consistently built and upgraded deployment phase environments - dev, demo, train, prod for the full application stack - e.g. app server, cache, web server and storage. All development and deployment is done on these entirely standard (ideally config management / virtualised) cloned environments. If random desktop / laptop computers must be used then ideally a virtual box build version should be provided for dev, to match the deployment ones.
    For web applications the server side will be a single environment, but if client side software is involved it may require multiple standard environments for build and test.
  2. Automated build
    Run one command or press one button to create a full application stack instance on any of the deployment environments. Including production. So this should be everything above the standard environment and ideally include storage too (see data automation). Each developer can build numbers of deployment instances in the same automated fashion. Builds should be remotely runnable for plugging into Continuous Integration, C.I., servers etc.
  3. Automated release management
    Particularly important is that no manual tasks are needed to deploy to production. A push button C.I. driven deploy should be used where each deployment is retained in a full log accompanied by summary deployment note and related software packages release history and source tag. This full logging of changes ties into software service change management concepts. If unforeseen dependency system issues develop a lot later, they can then potentially be tied to the highly detailed timestamped change logging that this provides.
    Automating the roll-out means that you should also automate the roll back. You hopefully will test well enough not to need a safety net, but not bothering to use one is reckless.
    Another common loop hole is that release only covers the application layer. The standard environment, storage etc. are all part of the stack and changes in them are also releases, and need the same release management controls in place.
  4. Revision control of the entire application stack
    Everything in the application should be versioned. So all the source code of course. But also all the deployment and automation code. The third party components should all have their own versions (if not download, version and deploy from your own local repository). That includes the application specific environment configuration, eg. Apache virtual host configuration.
    Build automation should allow specification of tags (or to a date or to any previous release - logged via C.I.)
    The code dependency stack should also be versioned - so the versions of every component for a system release. Language specific build tools such as Maven, Pip, Ant, Phing, Bundler, Buildout etc. provide this. The standard environment(s) should also be versioned via their config management tool. 
  5. Integrated documentation
    Core documentation should be written and versioned with the source code, each package should at least have README and a release HISTORY tied to each production releases version number. These need to be kept up to date with the rest of the source. Separate wikis for fuller / less technical docs are fine - but documentation of changes in functional specification need to use the same version control as the code - unless all your code has rigorous processes around a versioning integrated issue tracker - that is most reliably done by putting documentation in the code.
    Ideally the language's packaging tools should have a system to extract embedded documentation and comments into HTML on a software repository server - for easy reference.
    Automation to keep the web documentation up to date should be implemented. 
  6. Software upgrade process
    Major version platform upgrades should always be performed within a year of release date. Not just for security patch reasons (these must be carried out within a month at most). Ideally the former within a few months and the latter within a few days. Any longer and code divergence can make the upgrade hill too big a cost to scale, or compromise systems data. Major language / framework (as well as releases) - should not require significant system outages. These may not be automated to set up, but they should be automated to roll over between upgrades - so if you are without a multi-server load balanced layer in part of your application stack - then downtime should still be under a minute at most, e.g. an Apache or Database restart.
  7. Automated testing
    They may not provide great coverage but a minimal test suite is a necessity to allow confirmation of success for the automation infrastructure.
    Good test coverage means that complex functional errors or regressions can be written as tests and added to periodic builds - so ensuring that future releases are free of them - but a set of minimal functional or black box tests are sufficient to cover basic confirmation that automated environment upgrades, or minor application fix releases do not caused critical failures. These tests can also be tied to monitoring / timed load testing - to check upcoming releases for performance regressions.
  8. Data automationThis involves data fixtures, automated schema generation and synchronisation.
    With the advent of an object relational mapper (ORM) as standard in today's web applications. Then your system should have a full data abstraction layer, in even the most micro web application framework. In turn that means today's application code should contain within it the means to generate all of the data layer. Ideally ORM's should provide the means to abstract fully the database implementation, to generate that implementation within a range of RDBMS and to generate data fixtures for it, for building populated new development instances or for testing.
    As standard the test harness will setup and teardown the data layers.
    More mature ORMs will also have schema migration tools. These are essential for full automated release management, since invariably a significant release will involve a change to the data schema, or at least a new entry in the database. A synchronisation tool will tend to use meta-programming to automatically generate the migration code that synchronises the schema - that migration is then released (or rolled back) as part of the code release - keeping the data storage in the release management loop. Any data modification (DML) - that the application requires can be added to the DDL of the schema migration. These tools will also have introspection code to detect that data migration is required if connected to a previous version of the database. Bespoke applications may not have the tool, but at worst they should have data creation and migration code written and packaged with newly released versions - manual database tinkering around the same time as the code release, is not acceptable.
  9. Package management
    Application layer package management will always be language specific, but any language should offer it. Ideally a package repository should be maintained for each language your services use. These may be core to the language like PyPi and RubyGems or for languages without them in the core there are commercial offerings like Nexus for Java.
    This caters for version dependency management and reliable upgrade. Of course to use a package manager fully, you should package all your application source code. Ad hoc scripts or  framework app archives, raw class and resource bundles etc. - Just say no. If you are going to release your code rather than chuck it over the wall ... package it and version it. So all your code should be in jars, eggs, gems - or whatever your language likes to call them.
    Not only that you should apply the same rules to splitting up packages as you apply to splitting up code into classes. Some packages may be dependent on others - but each separate component of the application should be a different package - to allow it to be separately version controlled and released. To encourage encapsulation and hence allow for packages to be reused, retired or replaced without replacing the whole application's code base.
    (NB: Environment package management will be operating system specific and that should be implemented as part of the standard environment config management layer - no building from source here!)
  10. Monitoring
    One of the most important issues with logging and error notification is the cry wolf factor. You need to ensure that you draw the line in the right place for what are critical errors - ie. those that generate notifications to people. You can have over reporting initially if it makes you hammer down on all those bugs to get a reasonable level. But the one thing that makes monitoring ineffective, is over reporting, if a system is emailing you a hundred stack traces a day, and have been for the last month - or the critical log is equally verbose - you filter the emails and ignore the log. You need critical bug notifications to be rare enough that you jump straight on fixing them when they are sent. Of course don't over do it, ideally you shouldn't ever be in the position when the only reason you know that a service is down is because an end user has phoned up to tell you. If your monitoring is good enough it will always do that, for all but the most involved functional errors.
    You also need standard uptime monitoring such as Nagios or the like to notify if services have failed completely (unable to send application layer errors) for each of the layers - web, storage, cache, environment.
    Plus load logging for each, response time logging, etc. Most importantly you need to retain the logging over time and hence be able to look back at problems vs. change management data (see automated release management) to be able to diagnose many service issues and ideally predict and forestall them.

Walk the walk

So do I have the ten commandments in place for all our production systems in my current work place? In part, we have for all our Python Django web applications (although some are a bit sparse in places - eg. monitoring, release management below the application layer). But our Java architecture only has packaged components, although work is being done for new Java Spring systems to provide automated build, ideally some tests and the need for monitoring is recognized. Hopefully we will  tick all ten boxes for it too, eventually. So we will have as solidly maintainable a Java Spring platform as we have with our Python Django infrastructure.

However the concern is perhaps as much with all our legacy or outsourced systems integration code. These have none of these components and no realistic likelihood of getting them. Hence there is a  huge support burden that results, diverting time away from providing them and leading to unreliable services. Add to that the problem of how platforms can be frozen, whilst still in use, as with our legacy Python (zope) architecture and then rot and lose the maintenance infrastructure that they had, (Our old CMS went live with half of the above features - now it has none) and the picture becomes a little bleak. Here the answer is perhaps to start to implement much more hard nosed rules wrt. to retiring systems, if they have replacements, whether or not those replacements fully cover the same functional space. Essentially this is a management, not a technical issue.

With a much reduced set of critical legacy systems and appropriate resourcing it would be possible to retrograde add the commandments to them, and bring all services up to a similar quality control.

However  the problem is greatly exacerbated by 'new' legacy bought in systems. So by this I mean third party supplier systems that we run and have to maintain (eg. regular upgrades, performance monitor etc.) that do not have most of the above features. Unfortunately something that appears true of all the smaller supplier's systems procured recently - ie. companies with under 10 core developers. Perhaps because most of them are providing products that actually are legacy ie. have not been written, or fully rewritten, in the last 6 years (for the full rant on this topic see the ten commandments of software procurement!)

* Fixing the legacy and external systems

There are plenty of configuration management and shell framework tools that can be applied to automate even the messiest old legacy systems. The key rule here is you don't need to write any of the infrastructure in the legacy code base. So use your standard CI server, shell framework and config management tools - don't add more procedural platform specific code (e.g. raw shell scripts).
Modern automation tools should all be pretty platform independent - although if running Windows and Unix you may be better using a different shell framework for each, eg. Fabric and PowerShell, possibly the same for config management tools.

If the code contains closed source compiled components with no versioning. Then the binaries can still be put into version control and release numbers assigned. At worst decompilation tools can be used - if there no other reasonable way to fix or replace the components.

Similarly black box testing tools can be applied to any software, and if none of the technical team know what that code is doing - end users can provide a basic functional spec of what its meant to do, and these few basic stories used to create some minimal BDD tests.
Data in / data out dumps and comparisons can also be used as a basis for manually maintained fixtures. Legacy components can be split up and packaging added to them ... but then much more work along this line of legacy code re-factoring and we start to raise the question of respecify / rewrite / replace being more cost effective.

Wednesday 29 October 2014

Fixing third party Django packages for Python 3

With the release of Django 1.7 it could be argued that the balance has finally tipped towards Python 3 being its preferred platform. Well given Python 2.7 is the last 2.* then its probably time we all thought about moving to Python 3 for our Django deployments.

Problem is those pesky third party package developers, because unless you are determined wheel reinventor (unlikely if you use Django!) - you are bound to have a range of third party eggs in your Django sites. As one of those pesky third party developers myself, it is about time I added Python 3 compatibility to my Django open source packages.

There are a number of resources related to porting Python from 2 to 3, including specifically for Django, but hopefully this post may still prove useful as a summarised approach for doing it for your Django projects or third party packages. Hopefully it isn't too much work and if you have been writing Python as long as me, it may also get you out of any legacy syntax  habits you have.

So lets get started, first thing is to set up Django 1.7 with Python 3
For repeatable builds we want pip and virtualenv - if they are not there.
For a linux platform such as Ubuntu you will have python3 installed as standard (although not yet the default python) so if you just add pip3 that lets you add the rest ...

Install Python 3 and Django for testing


sudo apt-get install python3-pip
(OR sudo easy_install3 pip)
sudo pip3 install virtualenv



So now you can run virtualenv with python3 in addition to the default python (2.*)

virtualenv --python=python3 myenv3
cd myenv3
bin/pip install django


Then add a src directory for putting the egg in you want to make compatible with Python 3 so an example from git (of course you can do this as one pip line if the source is in git)


mkdir src
git clone https://github.com/django-pesky src/django-pesky
bin/pip install -e src/django-pesky


Then run the django-pesky tests (assuming nobody uses an egg without any tests!)
so the command to run pesky's test may be something like the following ...

bin/django-admin.py test pesky.tests --settings=pesky.settings
One rather disconcerting thing that you will notice with tests is that the default assertEqual message is truncated in Python 3 where it wasn't in Python 2 with a count of the missing characters in square brackets, eg.

AssertionError: Lists differ: ['Failed to open file /home/jango/myenv/sr[85 chars]tem'] != []


Common Python 2 to Python 3 errors


And wait for those errors. The most common ones are:

  1. print statement without brackets
  2. except Error as err (NOT except Error, err)
  3. File open and file methods differ.
    Text files require better quality encoding - so more files default to bytes because strings in Python 3 are all stored in unicode
    (On the down side this may need more work for initial encoding clean up *,
    but on the plus side functional errors due to bad encoding are less likely to occur)
  4. There is no unicode() method in Python 3 since all strings are now unicode - ie. its become str() and hence strings no longer need the u'string' marker 
  5. Since unicode is not available as a method, it is not used for Django models default representation. Hence just using
    def __str__(self):
            return self.name
    is the future proofed method. I actually found that models with __unicode__ and __str__ methods may not return any representation, rather than the __str__ one being used, as one might assume, in Django 1.7 and Python 3
  6. dictionary has_key has gone, must use in (if key in dict)

* I found more raw strings were treated as bytes by Python 3 and these then required raw_string.decode(charset) to avoid them going into the database string (eg. varchar) fields as pseudo-bytes, ie. strings that held 'élément' as '\xc3\xa9l\xc3\xa9ment' rather than bytes, ie. b'\xc3\xa9l\xc3\xa9ment'

Ideally you will want to maintain one version but keep it compatible with Python 2 and 3,
since this is both less work and gets you into the habit of writing transitional Python :-)

Test the same code against Python 2 and 3


So to do that you want to be running your tests with builds in both Pythons.
So repeat the above but with virtualenv --python=python2 myenv2
and just symlink the src/django-pesky to the Python 2 src folder.

Now you can run the tests for both versions against the same egg code -
and make sure when you fix for 3 you don't break for 2.

For current Django 1.7 you would just need to support the latest Python 2.7
and so the above changes are all compatible except for use of unicode() and how you call open().

Version specific code


However in some cases you may need to write code that is specific to 2 or 3.
If that occurs you can either use the approach of latest or anything else (cross fingers)

try:
    latest version compatible code (e.g. Python 3 - Django 1.7)
except:
    older version compatible code (e.g. Python 2 - Django < 1.7)

Or you can use specific version targetting ...

import sys, django
django_version = django.get_version().split('.')

if sys.version_info['major'] == 3 and django_version[1] == 7:
    latest version
elif sys.version_info['major'] == 2 and django_version[1] == 6:
    older django version
else:
    older version


where ...

django.get_version() -> '1.6' or '1.7.1'
sys.version_info() -> {'major':3, 'minor':4, 'micro':0, 'releaselevel':'final', 'serial':0}

Summary

So how did I get on with my first egg, django-csvimport ? ... it actually proved quite time consuming since the csv.reader library was far more sensitive to bad character encoding in Python 3 and so a more thorough manual alternative had to be implemented for those important edge cases - which the tests are aimed to cover. After all if a CSV file is really well encoded and you already have a model for it - it hardly needs a pesky third party egg for CSV imports - just a few django shell lines using the csv library will do the job.







Thursday 3 July 2014

Spring MVC setup on Ubuntu

Recently setting up Spring MVC on Ubuntu 14 with Netbeans wasn't entirely obvious for a newbie, so I thought I would document it in case it saved somebody 10 minutes!


First install Apache and Tomcat, if you haven't got them already...

sudo apt-get install apache2


sudo apt-get install tomcat7 tomcat7-docs tomcat7-admin tomcat7-examples

You should also have the default openjdk for tomcat and ant build tool and git

sudo apt-get install default-jdk ant git

Edit the tomcat-users.xml netbeans requires a user with the manager-script role
(NOTE: you shouldn't give the same user all these roles in a production Tomcat!
Also note that these manager roles have changed from Tomcat 6)

sudo emacs /etc/tomcat7/tomcat-user.xml


<tomcat-users>
  <role rolename="manager-gui"/>
  <role rolename="manager-script"/>
  <role rolename="manager-jmx"/>
  <role rolename="manager-status"/>
  <role rolename="admin-gui"/>
  <role rolename="admin-script"/>
  <user username="admin" password="admin" roles="manager-gui,manager-script,manager-jmx,manager-status,admin-gui,admin-script"/>
</tomcat-users>


Should restart tomcat after editing this ...

sudo service tomcat7 restart

Now you should be able to go to http://localhost:8080 and see

It works !

If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations! ...

Click on the link to the manager and get the management screen

If the login fails - reinstall apache and tomcat - it worked for me!

For Netbeans to find Apache OK you have to put the config directory where it expects it ...

sudo ln -s /etc/tomcat7/ /usr/share/tomcat7/conf

Note that the tomcat location, ie for the deploy directory is in

/usr/var/lib/tomcat7

Now install Netbeans, latest version is 8, either by download and install or

sudo apt-get install netbeans

Start up netbeans and go to  Tools > Plugins

Pick the Available plugins tab

Search for web and tick Spring MVC - plus any others you fancy!

Restart Netbeans

Add a new project

  1. Choose New Project (Ctrl-Shift-N; ⌘-Shift-N on Mac) from the IDE's File menu. Select the Java Web category, then under Projects select Web Application. Click Next.
  2. In Project Name, type in HelloSpring. Click Next.
  3. Click the Add... button next to the server drop down
  4. Select the Apache Tomcat or TomEE server in the Server  list, click Next
    Enter  Server Location: /usr/share/tomcat7
    Enter the username and password from your tomcat-users.xml above and untick the create user box, if everything is working then it will accept this and add Tomcat to your server drop down list 
    (it shouldn't need to try to add the user unless that user isn't already properly set up with the manager-script role in Tomcat)
  5. In Step 4, the Frameworks panel, select Spring Web MVC.
  6. Select Spring Framework 3.x in the Spring Library drop-down list. 
    Spring Web MVC displayed in the Frameworks panel

 Click Finish and you should have a skeleton Spring MVC project, pressing the Play button should build it and run it up, then launch your chosen browser with the home page of that project via the Apache Tomcat you have setup.
Any changes should get auto-deployed and popped up in the browser again by pressing play.


Friday 2 May 2014

Lessons learned from setting up a website on Amazon EC2

I recently got involved with helping someone sort out their website on an Amazon EC2 instance, it had been a few years since I had the need to do anything with EC2, I realised that I was a novice in this world - and it raised a number of issues related to deploying to EC2 and performance.

So I thought it may be useful to run through them for any other EC2 novices who are asked to do something similar, and want to learn from my rather blundering progress through this :-)

Apologies for those of you are already well familiar with EC2 for covering some of the basics.

The system moodpin.co.uk was based on a commercial PHP application, Pintastic.
So this allows you to set up a site like pinterest.com or wanelo.com
These sort of sites are for creating subject specific photo sharing social media systems, so like Instagram, Picassa etc. but focussed around communities of shared (usually commerical) interest. For example buying shoes, interior decor etc.
The common UI that they tend to present are big scrolling pages of submitted images related to topics for sharing, comment and discussion.

So this system sends out a lot of notification emails, involves displaying hundreds of images per page - the visual pin board - and to help with performance has custom caching built in - triggered by cron jobs.

Hence we have a number of cron jobs with the caching ones running every couple of minutes. To me this appeared a pretty crude caching mechanism - but my job was not to rewrite the application, but just tweak the code and get it all running OK.
The code mainly uses a standard MVC approach like everything else these days!

So demonstrating how outdated my knowledge of EC2 or this application were. I thought OK - first of all what platform is it. It was Amazon's own Linux - this uses yum rather than apt for package installs so as distros go its perhaps more Redhat-like than Debian.

For those unfamiliar with the basics - go to Amazon web services and sign up!
You can then choose to add some of the 40 odd different services that are available under the EC2 umbrella.

Once you have signed up to a few of these, you get a management console that links to a control dashboard for each service. The first step usually being - the one with the computer instances on, EC2. From there you can pick an AMI (ie. operating system image), a zone - eg. US West (Oregon) and use it to create a new instance. Add an SSH key pair for shell access and then fire it up and download the pem file so you can ssh into your new Amazon box.

So the client wanted the usual little tweaks to PHP code,  CSS tweaking - so easy stuff its just web development ... done in a jiffy (well after digging through the MVC layers, templating language, cache issues and CSS inheritance etc. for a fairly complex PHP app you have never come across before, when PHP is not exactly your favourite language ... jiffyish maybe)
Then we got to the more SysAdmin related requests ... lets just say I probably shouldn't rush out and buy a DevOps tee-shirt just yet ...

'Get email working'

  1. Try to send an email from the web application - write a plain PHP script that just sends a test email - just run mail from the linux command line ... Got it there is no MTA installed! 
  2. Install an MTA - sendmail. Go back up that stack of actions and they are all working ... hurray that was easy.
  3. A week or so later ... 'emails stopped working'
  4. Go back to step 1. and yep - emails stopped working
  5. Look at the mail logs and see what the problem is.
  6. Realise that there are masses of emails being sent out ... but all of it is bouncing back as unverified.
  7. Think ... wow that pintastic site's notifier is busy - must be getting lots of traffic *
  8. So why has Amazon started bouncing all the email?
  9. Search Amazon's docs. Amazon has a very minimal test quota allowed for email. Once that quota is filled, unverified email will be blocked.
  10. Amazon has historically been one of the main sources of SPAM machines, that history means that it has to set up a much more elaborate mechanism for validating email that most hosting companies, and it no longer allows direct emailing from EC2 boxes (apart from minimal test quotas)
  11. So what we need to do is set up our mail to be sent via the Amazon SES service - add SES service and enable it
  12. So now we need to send authorised emails to the Amazon SES gateway that will then forward them on to the outside world
  13. Try to get sendmail to send authenticated emails, follow guide but it continues to bounce with authentication failure, give up and install postfix, follow the 20 steps of setting up the SASL password etc., and eventually it doesn't bounce with authentication errors - hurray!
  14. But the email still bounces. So we need to verify all our sending email addresses - managed by the SES console - or use DKIM to get the whole domain verified and signed from which we are sending.
  15. Modify the emails used by the sending software to ones which we can receive and validate - send and validate them. Our emails are working again.
  16. Leave it a few days, we are not sending email anymore, boooo!
  17. Check all the SES documentation, surprise, surprise SES also has quota limits for test level only, and you have to formally apply to get those limits lifted.
  18. Contact the client and get him to make a formal request for quota lifting on his account.
  19. *As part of the investigation check that email log a little more closely, it seems rather large, and we seem to be using up our quotas really quickly ... ah the default setup for unix cron sends an email for every job that returns text. The pintastic cache job returns text, so we are sending a pointless email every two minutes ... or trying to ... whoops. Make sure no cron or other unix system command is acting as a SPAM bot.
  20. A few days later - Amazon say our quota has been lifted
  21. Our emails have started sending again ... and they are still sending today !!!
Clients response, OK thanks, by the way since we added all the start up data / ie. uploaded images, the site takes at least two minutes to render the home page - or times out altogether.
Hmmm I did kinda notice that ... but hey he hadn't asked me to make the site actually usable speed wise ... until now!

'Why is the site, really, really slow?'


Hmmm wow it really is slow, lots of the time it just dies, that PHP cache thingy can't be doing much, so whats the problem.

  1. Lets look at the web site, wow it takes 5 minutes for the page to come back ... so this isnt exactly Apache bench territory ... run up a few tabs looking at the home page ... and it starts just returning server timeouts.
  2. So whats happening on the server ... whats killing the box ... top tells us that its Apache killing us here - with 50 odd processes spawning and sucking up all memory and CPU.
  3. So we check out our Apache config and its the usual PHP orientated config of MPM prefork. But what are the values set ... they are for a great big multiprocessor cadillac of a machine, whilst ours is more of a smart car in its scale. 
  4. Lesson is that Amazon AMI's are certainly not smart enough to have different image configs for different hardware specs of instances they provide. So it appears they default their configs to suiting the top of the range instances (since I guess they cost the most). If you have a minimal hardware spec box ... you should reconfigure hardware related parameters for the software you run on it ... or potentially it will fail.
  5. Slash all those servers, clients etc. values to the number of servers and processes the server can actually deliver. Slightly trial and error here ... but eventually we got MaxClients 30 instead of 500 etc. and give it a huge timeout.

    <IfModule prefork.c>
    StartServers       4
    MinSpareServers    2
    MaxSpareServers  10
    ServerLimit      30
    MaxClients       30
    MaxRequestsPerChild  4000
    </IfModule>
  6. Now lets hammer our site again ... hurray it doesn't completely fall over ... one day it may return a page, but its horribly horribly slow still ie. 3 minutes absolute top speed - further home page requests the slower they get.
  7. So lets get some stat.s, access the page with browser web dev network tools. Whats taking the time here. Hmmm web page a second, not great but acceptable, JS and CSS 0.25 sec, OK. Images hmmm images hmmm for the home page particularly ... 3-6 minutes ... so basically unusable.
  8. So time to bite the bullet we know Apache can be slower at serving static pages if its not optimised for it - especially if resources are limited (its processes have a bigger memory overhead), thats why the Apache foundation has another web server, Apache Trafficserver , for that job
  9. But whats the standard static server (thats grabbed half of Apache's share of the web in the last few years), yep nginx
  10. So lets set up the front end of our site as nginx acting as a reverse proxy to Apache just doing the PHP work, with nginx serving all images. So modify Apache to just serve on 8080 on localhost and flip the site over to an nginx front end, with the following nginx conf ...

    server {        listen       80;
            server_name  moodpin.co.uk;
                                                                                                                                                                                                       
            location ^~ /(cache|cms|uploads) {
                     root   /var/www/html/;
                     expires 7d;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }
                                                                                                                                                                                                      
            location ~* \.(css|rdf|xml|ico|txt|gif|jpg|png|jpeg)$ {
                     expires 365d;
                     root  /var/www/html/;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }

          location / {
                proxy_pass         http://127.0.0.1:8080/;
     
    Wow, wow, so take that 3-6 minutes and replace it with 1-2 seconds.
  11. So how many images on the home page - about 150 plus more with scrolling ... so that means we have a site that is on average under 0.5% dynamic code driven content and 99.5% static content/requests per page.
    That is a very very static site - hence the 100 x faster speed!
  12. So there you go client take that souped up smart car and go 
  13. Client replies ... ummm sites down - server proxy timeout error
  14. Go to Google and check, so we have to make sure that nginx has timeout settings greater than Apache's - and nginx default timeout is 60 seconds
  15. Make nginx _timeout settings into 10 minutes ... sounds bad, try the site, and it consistently delivers pages in 3 seconds or so assume that the scrolling request update page nature of the app, makes the timeout required much longer than the apparent time Apache is delivering PHP within?
  16. Show the client again, hes happy.
  17. Few days later ... this bit of the sites not working now
  18. Check the code, discover that there is a handful of javascript files used by the system that are not really static - they are PHP templates generating javascript that appear static. Remove js file types from the list of files above in the nginx config. Hurray generated javascript served from Apache PHP now. Bit of site works again
  19. OK we are done ... don't run Apache bench against the site ... if the client actually gets any users and it cant' cope - tell him to upgrade his instance.

    I hope my tails of devops debuggery are useful to you, Bye!
       

    Monday 13 January 2014

    Postgres character set conversion woes

    I had to struggle with sorting out some badly encoded data in Postgresql over the last day or so.
    This proved considerably more hassle than I expected, partly due to my ignorance of the correct syntax to use to convert textual data.

    So on that basis I thought I would share my pain!

    There are a number of issues with character sets in relational databases.

    For a Postgres database the common answers often relate to fixing the encoding of the whole database. So if this is the problem the fixes are often just a matter of setting your client encoding to match that of the database. Or to dump the database then create a new one with the correct encoding set, and reload the dump.

    However there are cases where the encoding is only problematic for certain fields in the database, or where you are creating views via database links between two live databases of different encodings - and so need to fix the encoding on the fly via these views.

    Ideally you have two databases that are both correctly encoded, but just use different encodings.
    If this is the case you can just use convert(data, 'encoding1', 'encoding2') for the relevant fields in the view.

    Then you come to the sort of case I was dealing with. Where the encoding is too mashed for this to work. So where strings have been pushed in as raw byte formats that either don't relate to any proper encoding, or use different encodings in the same field.

    In these cases any attempt to run a convert encoding function will fail, because there is no consistent 'encoding1'

    The symptoms of such data is that it fails to display. So is sometimes its difficult to notice until
    the system / programming language that is accessing the data throws encoding errors.
    In my case the pgAdmin client failed to display the whole field so although the field appears blank, matches with like '%ok characs%' or length(field) still work OK. Whilst the normal psql command displayed all the characters except for the problem ones, which were just missing from the string.

    This problem has two solutions:

    1. Repeat the dump and rebuild approach with the correct encoding, but to write a custom script in Perl, Python or the like to fix the mashed encoding - assuming that the mashing is not so entirely random as to be fixable via an automated script*. If it isn't - then you either have to detect and chuck away bad data - or manually fix things!

    2. Fix the problem fields via pl/sql, pl/python or pl/perl functions that process these to replace known problem characters in the data.

    I chose to use pl/sql since I had a limited set of these problem characters, so didn't need the full functionality of Python or Perl. However in order for pl/sql to be able to handle the characters for fixing, I did need to turn the problem fields into raw byte format.

    I found that the conversion back and forth to bytea was not well documented, although the built in functions to do so were relatively straight forward...

    Text to Byte conversion => text_field::bytea

    Byte to Text conversion => encode(text_field::bytea, 'escape')

    So employing these for fixing the freaky characters that were used in place of escaping quotes in my source data ...

    CREATE OR REPLACE FUNCTION encode_utf8(text)
      RETURNS text AS
    $BODY$
    DECLARE
        encoding TEXT;
    BEGIN
        -- single quote as superscript a underline and Yen characters              
                                                
        IF position('\xaa'::bytea in $1::TEXT::BYTEA) > 0 THEN
            RETURN encode(overlay($1::TEXT::BYTEA placing E'\x27'::bytea from position('\xaa'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
        END IF;

        -- double quote as capital angstroms character                                                                                                                              
        IF position('\xa5'::bytea in $1::TEXT::BYTEA) > 0 THEN
            RETURN encode(overlay($1::TEXT::BYTEA placing E'\x22'::bytea from position('\xa5'::bytea in $1::TEXT::BYTEA) for 1), 'escape');
        END IF;
        RETURN $1;
    END;
    $BODY$

    Unfortunately the Postgres byte string functions don't include an equivalent to a string replace and the above function assumes just one  problem character per field (my use case), but it could be adapted to loop through each character and fix it via use of overlay.
    So the function above allows for dynamic data fixing of improperly encoded text in views from a legacy database that is still in use - via a database link to a current UTF8 database.

    * For example in Python you could employ chardet to autodetect possible encoding and apply conversions per field (or even per character)

    Monday 6 January 2014

    WSGI functional benchmark for a Django Survey Application

    I am currently involved in the redevelopment of a survey creation tool, that is used by most of the UK University sector. The application is being redeveloped in Django, creating surveys in Postgresql and writing the completed survey data to Cassandra.
    The core performance bottleneck is likely to be the number of concurrent users who can simultaneously complete surveys. As part of the test tool suite we have created a custom Django command that uses a browser robot to complete any survey with dummy data.
    I realised when commencing this WSGI performance investigation that this functional testing tool could be adapted to act as a load testing tool.
    So rather than just getting general request statistics - I could get much more relevant survey completion load data.

    There are a number of more thorough benchmark posts of raw pages using a wider range of WSGI servers - eg. http://nichol.as/benchmark-of-python-web-servers , however they do not focus so much on the most common ones that  serve Django applications, or address the configuration details of those servers. So though less thorough, I hope this post is also of use.

    The standard configuration to run Django in production is the dual web server set up. In fact Django is pretty much designed to be run that way, with contrib apps such as static files provided to collect images, javascript, etc. for serving separately to the code. Recognizing that in production a web server optimized for serving static files is going to be very different from one optimized for a language runtime environment, even if they are the same web server, eg. Apache. So ideally it would be delivered via two differently configured, separate server Apaches. A fast and light static configured Apache on high I/O hardware, and a mod_wsgi configured Apache on large memory hardware. In practise Nginx may be easier to configure for static serving, or for a larger globally used app, perhaps a CDN.
    This is no different from optimising any web application runtime, such as Java Tomcat. Separate static file serving always offers superior performance.

    However these survey completion tests, are not testing static serving, simpler load tests suffice for that purpose. They are testing the WSGI runtime performance for a particular Django application.

    Conclusions

    Well you can draw your own, for what load you require, of a given set hardware resource! You could of course just upgrade your hardware :-)

    However clearly uWSGI is best for consistent performance at high loads, but
    Apache MPM worker outperforms it when the load is not so high. This is likely to be due to the slightly higher memory per thread that Apache uses compared to uWSGI

    Using the default Apache MPM process may be OK, but can make you much more open to DOS attacks, via a nasty performance brick wall. Whilst daemon mode may result in more timeout fails as overloading occurs.

    Gunicorn is all Python so easier to set up for multiple django projects on the same hardware, and performs consistently across different loads, if not quite as fast overall.

    I also tried a couple of other python web servers, eg. tornado, but the best I could get was over twice as slow as these three servers, they may well have been configured  incorrectly, or be less suited to Django, either way I did not pursue them.

    Oh and what will we use?

    Well probably Apache MPM worker will do the trick for us, with a separate proxy front-end Apache configured for static file serving.
    At least that way, its all the same server that we need to support, and one that we are already well experienced in. Also our static file demands are unlikely to be sufficient to warrant use of Nginx or a CDN.

    I hope that these tests may help you, if not make a decision, maybe at least decide to try out testing a few WSGI servers and configs, for yourself. Let me know if your results differ widely from mine. Especially if there are some vital performance related configuration options I missed!

    Running the functional load test

    To run the survey completion tool via number of concurrent users and collect stat.s on this, I wrapped it up in test scripts for locust.

    So each user completes one each of seven test surveys.
    The locust server can then be handed the number of concurrent users to test with and the test run fired off for 5 minutes, over which time around 3-4000 surveys are completed.

    The number of concurrent users tested with was 10, 50 and 100
    With our current traffic peak loads will probably be around the 20 users mark with averages of 5 to 10 users. However there are occasional peaks higher than that. Ideally with the new system we will start to see higher traffic, where the 100 benchmark may be of more relevance.

    Fails

    A number of bad configs for the servers produced a lot of fails, but with a good config these seem to be very low. So all 3 x 5 minute test runs for each setup created around 10,000 surveys, these are the actual number of fails in 10,000
    so insignificant perhaps ...

    Apache MPM process = 1
    Apache MPM worker = 0
    Apache Daemon = 4
    uWSGI = 0
    Gunicorn = 1

    (so the fastest two configs both had no fails, because neither ever timed out)

    Configurations

    The test servers were run on the same virtual machine, the spec of which was
    a 4 x Intel 2.4 GHz CPU machine with  4Gb RAM
    So optimum workers / processes = 2 * CPUs + 1= 9

    The following configurations were arrived at by tinkering with the settings for each server until optimal speed was achieved for 10 concurrent users.
    Clearly this empirical approach may result in very different settings for your hardware, but at least it gives some idea of the appropriate settings - for a certain CPU / memory spec. server.

    For Apache I found things such as WSGIApplicationGroup being set or not was important, hence its inclusion, with a 20% improvement when on for MPM prefork or daemon mode, or off for MPM worker mode.

    Apache mod_wsgi prefork

    WSGIScriptAlias / /virtualenv/bin/django.wsgi
    WSGIApplicationGroup %{GLOBAL}

    Apache mod_wsgi worker

    WSGIScriptAlias / /virtualenv/bin/django.wsgi

    <IfModule mpm_worker_module>
    #  ThreadLimit    1000
        StartServers         10
        ServerLimit          16
        MaxClients          400
        MinSpareThreads      25
        MaxSpareThreads     375
        ThreadsPerChild      25
        MaxRequestsPerChild   0
    </IfModule>

    Apache mod_wsgi daemon

    WSGIScriptAlias / /virtualenv/bin/django.wsgi
    WSGIApplicationGroup %{GLOBAL}

    WSGIDaemonProcess testwsgi \
        python-path=/virtualenv/lib/python2.7/site-packages \
        user=testwsgi group=testwsgi \
        processes=9 threads=25 umask=0002 \
        home=/usr/local/projects/testwsgi/WWW \
        maximum-requests=0

    WSGIProcessGroup testwsgi

    uWSGI

    uwsgi --http :8000  --wsgi-file wsgi.py --chdir /virtualenv/bin \
                                   --workers=9 --buffer-size=16384 --disable-logging


    Gunicorn

    django-admin.py run_gunicorn -b :8000 --workers=9 --keep-alive=5