Showing posts with label Concurrent. Show all posts
Showing posts with label Concurrent. Show all posts

Thursday 22 September 2011

Single Writer Principle

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource.  Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we'll focus on the scalability limitations of the systems design. 

I keep seeing people just accept having multiple writers as the norm.  There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches.  One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy. 

Mutual Exclusion

Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy.  Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order.  This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use.  Those waiting to enter the critical section, in advance of performing the mutation must queue, and this queuing effect (Little's Law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control

Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime.  If a change has happened in the meantime you repeat the process until successful.  This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion.  If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day.  Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware!  The ability to perform the changes atomically to data is made possible by CAS instructions offered by the hardware.

Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.

Managing Contention vs. Doing Real Work

CPUs can typically process one or more instructions per cycle.  For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel.  If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns.  To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing.  If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses.  Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.

Now what does this have to do with managing contention?  Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations.  These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed.  At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released.  This rescheduling of the blocked thread will result in cache pollution.  The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses. 

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work.  The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory. 

MethodTime (ms)
One Thread300
One Thread with Memory Barrier4,700
One Thread with CAS5,700
Two Threads with CAS18,000
One Thread with Lock10,000
Two Threads with Lock118,000

This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor.   I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”.  This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs???  By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points.  As multi-core becomes ever more significant another approach is required.  My last post illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.

Single Writer Designs

Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread?  It is actually easier than you think in my experience.  It is OK if multiple threads, or other execution contexts, read the same data.  CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system.  This has a cost but it scales very well.

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem.  You can also scale up without limitation until the hardware is saturated.  There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle.  On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores.  The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

So how can we drive towards single writer designs?  I’ve found it is a very natural thing.  Consider how humans, or any other autonomous creatures of nature, operate with their model of the world.  We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use.  We mutate the state in our heads based on inputs (events/messages) we receive via our senses.  As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs.  None of us reach directly into each other’s heads and mess with the neurons.  If we did this it would be a serious breach of encapsulation!  Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk!  Who's bright idea was it to allow public access to fields of an object?  You deserve your own special hell. 

At university I studied transputers and interesting languages like Occam.  I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state.  I’m sure some of this has inspired the Disruptor.  My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches.  It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.

It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few.  Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.

Now I’m not saying locks and optimistic strategies are bad and should not be used.  They are excellent for many problems.  For example, bootstrapping a concurrent system or making major state stages in configuration or reference data.  However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited. 

The Principle at Scale

This principle works at all levels of scale.  Mandelbrot got this so right.  CPU cores are just nodes of execution and the cache system provides message passing for communication.  The same patterns apply if the processing node is a server and the communication system is a local network.  If a service, in SOA architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better.  Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data.  This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded.  Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.

Summary

If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally.  This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues.  If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

Sunday 11 September 2011

Adventures with AtomicLong

Sequencing events between threads is a common operation for many multi-threaded algorithms.  These sequences could be used for assigning identity to orders, trades, transactions, messages, events, etc.  Within the Disruptor we use a monotonic sequence for all events which is implemented as AtomicLong incrementAndGet for the multi-threaded publishing scenario.

While working on the latest version of the Disruptor I made some changes which I was convinced would improve performance, however the results surprised me.  I had removed some potentially megamorphic method calls and the performance got worse rather than better.  After a lot of investigation, I discovered that the megamorphic method calls were hiding a performance issue with the latest Intel Sandybridge processors.  With the megamorphic calls out of the way, the contention on the atomic sequence generation increased exposing the issue.  I've also observed this performance issue with other Java concurrent structures such as ArrayBlockingQueue.

I’ve been running various benchmarks on Sandybridge and have so far been impressed with performance improvements over Nehalem, especially for memory intensive applications due to the changes in its front-end.  However with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions.

Atomic instructions enable read-modify-write actions to be combined into an atomic operation.  A good example is incrementing a counter.  To complete the increment operation a thread must read the current value, increment it, and then write back the results.  In a multi-threaded environment these distinct operations could interleave with other threads doing the same with corrupt results as a consequence.  The normal way to avoid this interleaving is to take out a lock for mutual exclusion while performing the steps.  Locks are very expensive and often require kernel arbitration between threads.  Modern CPUs provide a number of atomic instructions which allow operations such as atomically incrementing a counter, or the ability to conditional set a pointer reference if the value is still as expected.  These operations are commonly referred to as CAS (Compare And Swap) instructions.  A good way to think of these CAS instructions is like optimistic locks, similar to what you experience when using a version control system like Subversion or CVS.  You try to make a change and if the version is what you expect then you succeed, otherwise the action aborts.

On x86/x64 these instructions are known as “lock” instructions.  The "lock" name comes from how a processor, after setting its lock signal, would lock the front-side/memory bus (FSB) for serialising memory access while the three steps of the operation took place atomically.  On more recent processors the lock instruction is simply implemented by getting an exclusive lock on the cache-line for modification.

These instructions are the basic building blocks used for implementing higher-level locks and semaphores.  This is, as will be explained shorty, why I've seen performing issues on Sandybridge for ArrayBlockingQueue in some of the Disruptor comparative performance tests.

Back to my benchmark.  The test was spending significantly more time in AtomicLong.incrementAndGet() than I had previously observed.  Initially, I suspected an issue with JDK 1.6.0_27 which I had just installed.  I ran the following test with various JVMs, including 1.7.0, and kept getting the same results.  I then booted different operating systems (Ubuntu, Fedora, Windows 7 - all 64-bit), again the same results.  This lead me to write an isolated test which I ran on Nehalem (2.8 GHz Core i7 860) and Sandybridge (2.2Ghz Core i7-2720QM).

import java.util.concurrent.atomic.AtomicLong;

public final class TestAtomicIncrement
    implements Runnable
{
    public static final long COUNT = 500L * 1000L * 1000L;
    public static final AtomicLong counter = new AtomicLong(0L);

    public static void main(final String[] args) throws Exception
    {
        final int numThreads = Integer.parseInt(args[0]);
        final long start = System.nanoTime();
        runTest(numThreads);
        System.out.println("duration = " + (System.nanoTime() - start));
        System.out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads)
        throws InterruptedException
    {
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestAtomicIncrement());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = 0L;
        while (i < COUNT)
        {
            i = counter.incrementAndGet();
        }
    }
}

Figure 1.

After running this test on 4 different Sandybridge processors with a range of clock speeds, I concluded that using LOCK CMPXCHG, under contention with increasing numbers of threads, is much less scalable than the previous Nehalem generation of processors.  Figure 1. above charts the results in nanoseconds duration to complete 500 million increments of a counter with increasing thread count.  Less is better.

I confirmed the JVM was generating the correct instructions for the CAS loop by getting Hotspot to print the assembler it generated.  I also confirmed that Hotspot generated identical assembler instructions for both Nehalem and Sandybridge.

I then decided to investigate further and write the following C++ program to test the relevant lock instructions to compare Nehalem and Sandybridge.  I know from using “objdump -d” on the binary that the GNU Atomic Builtins generate the lock instructions for ADD, XADD, and CMPXCHG, for the respectively named functions below. 
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>

typedef unsigned long long uint64;
const uint64 COUNT = 500LL * 1000LL * 1000LL;
volatile uint64 counter = 0;

void* run_add(void* numThreads)
{
    register uint64 value = (COUNT / *((int*)numThreads)) + 1;
    while (--value != 0)
    {
        __sync_add_and_fetch(&counter, 1);
    }
}

void* run_xadd(void*)
{
    register uint64 value = counter;
    while (value < COUNT)
    {
        value = __sync_add_and_fetch(&counter, 1);
    }
}

void* run_cas(void*)
{
    register uint64 value = 0;
    while (value < COUNT)
    {
        do
        {
            value = counter;
        }
        while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
    }
}

int main (int argc, char* argv[])
{
    const int NUM_THREADS = atoi(argv[1]);

    pthread_t threads[NUM_THREADS];
    void* status;
    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], &status);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    std::cout << "threads = "  << NUM_THREADS << std::endl;
    std::cout << "duration = " << duration << std::endl;
    std::cout << "counter = "  << counter << std::endl;

    return 0;
}
Figure 2.

It is clear from Figure 2. that Nehalem performs nearly an order of magnitude better for atomic operations as contention increases with threads.  I found LOCK ADD and LOCK XADD to be similar so I've only charted XADD for clarity.  The CAS operations for C++ and Java are comparable.

It is also very interesting how XADD greatly outperforms CAS and gives a nice scalable profile.  For 3 threads and above, XADD does not degrade further and simply performs at the rate at which the processor can keep the caches coherent.  Nehalem and Sandybridge level out respectively at ~100m and ~20m XADD operations per second for 3+ concurrent threads, whereas CAS continues to degrade with increasing thread count because of contention.  Naturally, performance degrades when QPI links are involved for a multi-socket scenario.  Oracle have now accepted that not supporting XADD is a bug and will hopefully fix it soon for the JVM. 

As to the performance I’ve observed with Sandybridge, it would be great if others could confirm my findings so we can all feedback to Intel and have this addressed.  I've not been able to get my hands on a server class system with Sandybridge.  I can confirm that for the "tick" to Westmere, the performance is similar to Nehalem and not an issue.  The "tock" to Sandybridge seems to introduce the issue.

Update: After discussions with Intel I wrote the following blog entry.

Saturday 27 August 2011

Disruptor 2.0 Released

Significantly improved performance and a cleaner API are the key takeaways for the Disruptor 2.0 concurrent programming framework for Java.  This release is the result of all the great feedback we have received from the community.  Feedback is very welcome and really improves the end product so please keep it coming.

You can find the Disruptor project here, plus we have a wiki with links to detailed blogs describing how things work.

Naming & API

Over the lifetime of the Disruptor naming has been a challenge.  The funny thing is that with the 2.0 release we have come almost full circle.  Originally we considered the Disruptor as an event processing framework that often got used as a queue replacement.  To make it understandable to queue users we adopted the nomenclature of producers and consumers.  However the consumers are not true consumers.  With this release the consensus is to return to the event processing roots and adopt the following naming changes.

Producer -> Publisher
Events are claimed in strict sequence and published to the RingBuffer.

Entry -> Event
Events represent the currency of data exchange through the dependency graph of EventProcessors.

Consumer -> EventProcessor
Events are processed by EventProcessors.  The processing of an event can be read only, but can also involve mutations on which other EventProcessors depend.

ConsumerBarrier -> DependencyBarrier
Complex graphs of dependent EventProcessors can be constructed for the processing of an Event.  The DependencyBarriers are assembled to represent the dependency graph.  This topic is the real value of the Disruptor and often misunderstood.  A fun example can be seen playing FizzBuzz in our performance tests.

The ProducerBarrier was always a one-to-one relationship with the RingBuffer so for ease of use its behaviour has been merged into the RingBuffer.  This allows direct publishing into the RingBuffer.

DSL Wizard

The most complex part of using the Disruptor is the setting up of the dependency graph of EventProcessors.   To simplify this for the most common cases we have integrated the DisruptorWizard project which provides a DSL as a fluent API for assembling the graph and assigning threads.

Performance

Significant performance tuning effort has gone into this release.  This effort has resulted in a ~2-3X improvement in throughput depending on CPU architecture.  For most use cases it is now an order of magnitude better than queue based approaches. On Sandybridge processors I've seen over 50 million events processed per second.

Sequence tracking has been completely rewritten to reduce the usage of hardware memory barriers, indirection layers, and megamorphic method calls resulting in a much more data and instruction cache friendly design.  New techniques have been employed to prevent false sharing because the previous ones got optimised out by the Oracle Java 7 JVM.

The one area not seeing a significant performance increase is the sequencer pattern.  The Disruptor is still much faster than queue based approaches for this pattern but a limitation of Java hits us hard here.   Java on x86/x64 is using LOCK CMPXCHG for CAS operations to implement the AtomicLong incrementAndGet() method which, based on my measurements, is ~2-10X slower than using LOCK XADD as contention increases.  Hopefully Oracle will see the error of SUNs ways on this and embrace x86/x64 to take advantage of such instructions.  Dave Dice at Oracle has blogged on the subject so I live in hope.

Memory Barriers


Of special note for this release is the elimination of hardware memory barriers on x86/x64 for Sequence tracking.  The beauty in the Disruptor design is that on CPU architectures that have a memory model [1] whereby:

  • loads are not reordered with older loads”, and
  • stores are not reordered with older stores”;

it is then possible to take advantage of the semantics provided by AtomicLong to avoid the use of the Java volatile keyword, and thus hardware fences on x86/x64.  The one sticky rule for concurrent algorithms, such as Dekker [2] and Peterson [3] locks, on x86/x64 is “loads can be re-ordered with older stores”.  This is not an issue given the design of the Disruptor.  The issue relates to the snooping of CPU local store buffers for older writes.  I’m likely to blog in more detail about why this is the case at a later date.  The code should be safe on other CPU architectures if the JVM implementers get the semantics of AtomicLong and Unsafe correct, however your mileage may vary for performance on other architectures compared to x64.

Roadmap

With this latest release it is becoming increasingly obvious how sensitive some CPU architectures are to processor affinity for threads.  When an EventProcessor gets rescheduled on a different core, after its time-slice is exhausted or it yields, the resulting cache pollution really hits performance.  For those who require more extreme and predictable performance I plan to release an Executor service with the Disruptor to allow the pinning of threads to CPU cores.

I'm also thinking of adding a progressive back off strategy for waiting EventProcessors as a WaitStrategy.  This strategy would first busy spin, then yield, then eventually sleep in millisecond periods to conserve CPU resource for those applications that burst for a while then go quiet.

  1. Memory Model: See Section 8.2 of http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
  2. Dekker algorithm: http://en.wikipedia.org/wiki/Dekker%27s_algorithm
  3. Peterson Algorithm: http://en.wikipedia.org/wiki/Peterson%27s_algorithm

Saturday 13 August 2011

False Sharing && Java 7

In my previous post on False Sharing I suggested it can be avoided by padding the cache line with unused long fields.  It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing.  I've experimented with a number of techniques on different platforms and found the following code to be the most reliable.
import java.util.concurrent.atomic.AtomicLong;

public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static PaddedAtomicLong[] longs = new PaddedAtomicLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new PaddedAtomicLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].set(i);
        }
    }

    public static long sumPaddingToPreventOptimisation(final int index)
    {
        PaddedAtomicLong v = longs[index];
        return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
    }

    public static class PaddedAtomicLong extends AtomicLong
    {
        public volatile long p1, p2, p3, p4, p5, p6 = 7L;
    }
}

With this code I get similar performance results to those stated in the previous False Sharing article.  The padding in PaddedAtomicLong above can be commented out to see the false sharing effect.

I think we should all lobby the powers that be inside Oracle to have intrinsics added to the language so we can have cache line aligned and padded atomic classes.  This and some other low-level changes would help make Java a real concurrent programming language.  We keep hearing them say multi-core is coming.   I say it is here and Java needs to catch up.

Tuesday 9 August 2011

Inter Thread Latency

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores.   The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects.  In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads.  These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java.  For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions.  In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we'll use two counters each updated by their own thread.  A simple ping-pong algorithm will be used to signal from one to the other and back again.  The exchange will be repeated millions of times to measure the average latency between cores.  This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.
public final class InterThreadLatency
    implements Runnable
{
    public static final long ITERATIONS = 500L * 1000L * 1000L;

    public static volatile long s1;
    public static volatile long s2;

    public static void main(final String[] args)
    {
        Thread t = new Thread(new InterThreadLatency());
        t.setDaemon(true);
        t.start();

        long start = System.nanoTime();

        long value = s1;
        while (s1 < ITERATIONS)
        {
            while (s2 != value)
            {
                // busy spin
            }
            value = ++s1;
        }

        long duration = System.nanoTime() - start;

        System.out.println("duration = " + duration);
        System.out.println("ns per op = " + duration / (ITERATIONS * 2));
        System.out.println("op/sec = " +  
            (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
        System.out.println("s1 = " + s1 + ", s2 = " + s2);
    }

    public void run()
    {
        long value = s2;
        while (true)
        {
            while (value == s1)
            {
                // busy spin
            }
            value = ++s2;
        }
    }
}

For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.
#include <time.h>
#include <pthread.h>
#include <stdio.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

volatile uint64 s1 = 0;
volatile uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s2, 1);
    }
}

int main (int argc, char *argv[])
{
    pthread_t threads[1];
    pthread_create(&threads[0], NULL, run, NULL);

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    register uint64 value = s1;
    while (s1 < ITERATIONS)
    {
        while (s2 != value)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s1, 1);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    printf("duration = %lld\n", duration);
    printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
    printf("op/sec = %lld\n",  
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
    printf("s1 = %lld, s2 = %lld\n", s1, s2);

    return 0;
}
Results

$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000

$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop.  So what does this tell us?  Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion.  On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores.   Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s.  This is because we have measured latency and not throughput.  We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor?  Basically, the latency of the Disruptor is about as low as we can get from Java.  It would be possible to get a ~10% latency improvement by moving to C++.  I’d expect a similar improvement in throughput for C++.  The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly.  The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say???   Well that is one of the neat parts of its design.  The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer, all without incurring a memory barrier.  For real world applications this batching effect is really significant.  For micro benchmarks it only makes the results more random,  especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns.  The Disruptor design allows us to get greater throughput without explicit batching on the publisher side.  In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.