Monday, February 13, 2012

When you put too big of a pool in the cloud, it rains

Sometimes its too easy to forget just how complex an enterprise app server really is. An enormous amount of work has gone into hiding the inner workings of an enterprise container so that developers can hurry up and start writing their business logic. After using fantastic tools like Spring for a few months its easy to get into a rhythm and forget what a servlet even is. Why should we care about the low level components that make up the frameworks we use in every-day development? Because understanding those parts while you're working with them allows you to configure and use them properly; missing that understanding can lead to problems that are hard to track down and a pain to fix. Here's an example from a recent issue I ran into where I got to do some experimenting with thread pools:

The Project

Over the past few months I've been developing the backend of a new social networking app. I've been using several of the Spring tools (namely Spring Core 3.0) and I've set up the project to utilize AspectJ. I don't have an enormous amount of experience with aspect-oriented programming but every time I use it I am amazed at how powerful of a tool it is. The main form of communication between the phone and the server is over REST calls, some of which compile information from external APIs before responding. One of these services must complete 4 different HTTP requests each time it is called to prepare the response. I had done my due diligence of setting up those requests to run concurrently to keep the total request time lower and had used the @Async annotation provided by Spring to set it up.

The Problem

I originally set up my Spring XML configuration to include pretty much the same executor described in the reference documentation:
<task:executor id="executor"
               pool-size="5-25"
               queue-capacity="100"
               rejection-policy="CALLER_RUNS" />
I thought that those numbers sounded pretty reasonable; the pool seemed flexible, the queue seemed like a nice round size, and the caller runs policy would alert the server health checks that load was high without throwing exceptions and losing tasks. I included the <task:annotation-driven ... /> element so that it would be found and used with the @Async annotations I then used in the code.

A few weeks later a few of us were using the app at the same time when we noticed it start to act strange. The client requests started either timing out or incorrectly returning no result. I opened the monitoring panel and hit the refresh button in the app 6 times consecutively and watched the VM become unresponsive for over a minute. At that point we hadn't done much load testing and in the interest of keeping costs down the application had been configured to use a very small cloud VM with only around 600mb of RAM and limited processing power. Our next move was to shift to a larger instance but we didn't see a dramatic increase in performance. While the VM was responsive to me while SSHed into it the requests passing through the server still seemed to choke under increased load.

I realized at that point that we were sending requests through an HTTP connection across the internet to another service that likely involved database access and some computation - in 4 times the quantity than what we were receiving. These requests were happening concurrently through the executor described above, which I was also using in other parts of the application to handle tasks that were more computationally intensive and less prone to block for long periods of time waiting for network traffic. After some research I saw that a number of things were going on:
  1. When you set the queue capacity on a ThreadPoolTaskExecutor, you end up directly influencing what type of queue you get and how the thread count is managed. Tasks are given to threads until all of the "core" threads are actively executing; once this has happened the executor fills the queue with new tasks until the queue is full. Since we had 100 spots in the queue we had to wait until all were filled before creating new threads beyond the core 5 specified. This means that the buffer I thought was there from the higher maximum size had to wait until 100 tasks had already been queued to have an impact.
  2. When the queue and active threads were already partially in use by other types of tasks from other parts of the app the queue was much easier to fill. Since the long network tasks occupied threads while blocking the congestion only worsened under high load because of its heterogeneity.
  3. If the executor was at capacity with the maximum 25 threads and 100 queue spots chances are that the little instance I was running was pretty close to full load. When it reached that point the caller-runs policy allowed more threads to initiate network traffic and further reduce available bandwidth. This transformed it from the early warning policy I wanted into a policy that pretty much allowed through only the traffic that I didn't want increasing.

The Fix

I realized that I had two categories of tasks I was submitting for asynchronous execution: short, processor-intensive tasks and long, blocking network tasks. Since one was holding up the other and had different optimal configurations, I defined two separate executors:

Executor burstExecutor longNetworkTaskExecutor
Pool Sizes Different core and maximum sizes, depends on available processors Identical core and maximum sizes so that in-progress network traffic is limited
Queue SynchronousQueue to ensure pool growth when necessary and caller-runs policy usage when under high load An unbounded queue to allow tasks to accumulate when under heavy load without being rejected
Rejected Task Policy Caller-runs policy to add a negative feedback mechanism since tasks are processor-limited, not bandwidth; we don't anticipate a task starting a network call so allowing the start of execution won't further reduce available bandwidth No rejection policy (unbounded queue)

I then needed to find a way to designate the tasks so that they would be submitted to the correct executor. I defined two annotations: @AsyncBurstTask and @AsyncLongNetworkTask. Once that was done I created two sub-aspects of Spring's AbstractAsyncExecutionAspect to identify the appropriate annotated methods. The AbstractAsyncExecutionAspect handles the submission of the task to the executor providing that the sub-aspect defines which join points to apply the advice to. Lastly I needed to inject the executors into the Aspect instances at runtime (part of the AbstractAsyncExecutionAspect). To do this I added bean definitions similar to this one to my Spring configuration XML:
<bean class="com.solsticeconsulting.async.AsyncBurstAnnotationAspect"
      factory-method="aspectOf">
    <property name="executor" ref="burstExecutor" />
</bean>
And that's it! The executors handle a heavy load much more smoothly than the previous configuration. These changes could have been made via other mechanisms not involving AspectJ but I found that with other ways I tried it performance took a large hit because of the heavier runtime processing needed. I did quite a bit of testing to find the optimal sizes for the thread pools but even under intense testing I did not experience the bottleneck in performance seen with the single executor setup.

With that problem solved I can go back to playing with my high level tools where my regular issues seem so much simpler.