Tuesday 22 November 2011

Biased Locking, OSR, and Benchmarking Fun

After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach.  As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before.  Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.

In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs.  While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.

In this post I will re-run the experiment taking into account the feedback and present some new results.  I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.

On Stack Replacement (OSR)

Java virtual machines will compile code to achieve greater performance based on runtime profiling.  Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule.  Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling.  Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.

Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled.  The compiled code on normal JIT'ing can be used when the method is next called.  However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go.  OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.

I was under the impression that normal JIT'ing and OSR would result in similar code.  Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested.  For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.

What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one.  You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.

Biased Locking

Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.

All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.

This point, combined with knowing more about OSR, is important for micro-benchmarks.  It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.

The Code
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType {JVM, JUC}
    public static LockType lockType;

    public static final long WARMUP_ITERATIONS = 100L * 1000L;
    public static final long ITERATIONS = 500L * 1000L * 1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;

    private final long iterationLimit;
    private final CyclicBarrier barrier;

    public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
    {
        this.barrier = barrier;
        this.iterationLimit = iterationLimit;
    }

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);

        for (int i = 0; i < 10; i++)
        {
            runTest(numThreads, WARMUP_ITERATIONS);
            counter = 0L;
        }

        final long start = System.nanoTime();
        runTest(numThreads, ITERATIONS);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads, final long iterationLimit)
        throws Exception
    {
        CyclicBarrier barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}

Script to run tests:

set -x
for i in {1..8}
do 
    java -server -XX:-UseBiasedLocking TestLocks JVM $i
done

for i in {1..8}
do 
    java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done

for i in {1..8}
do 
    java -server TestLocks JUC $i
done

Results

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

Nehalem 2.8GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
153,283,461
450,950,969
62,876,566
218,519,295
18,108,615
10,217,186
313,349,605
13,416,198
14,108,622
48,120,172
8,040,773
14,207,310
54,725,114
4,551,766
14,302,683
65,133,706
5,246,548
14,676,616
75,473,652
5,585,666
18,145,525
85,514,056
5,414,171
19,010,725


Sandy Bridge 2.0GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
1
34,500,407
396,511,324
43,148,808
2
20,899,076
19,742,639
6,038,923
3
9,288,039
11,957,032
24,147,807
4
5,618,862
5,589,289
9,082,961
5
5,609,932
5,592,574
9,389,243
6
5,742,907
5,760,558
12,518,728
7
6,699,201
6,641,886
13,684,475
8
6,957,824
6,925,410
14,819,005

Observations
  1. Biased locking has a huge benefit in the un-contended single threaded case.
  2. Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost.  This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.
  3. -XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
  4. Avoiding OSR does not make a material difference to this set of test results.  This is likely to be because the loop is so simple or other costs are dominating.
  5. For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.
Conclusion

My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased.  If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.

Saturday 19 November 2011

Java Lock Implementations

We all use 3rd party libraries as a normal part of development.  Generally, we have no control over their internals.  The libraries provided with the JDK are a typical example.   Many of these libraries employ locks to manage contention.

JDK locks come with two implementations.  One uses atomic CAS style instructions to manage the claim process.  CAS instructions tend to be the most expensive type of CPU instructions and on x86 have memory ordering semantics.  Often locks are un-contended which gives rise to a possible optimisation whereby a lock can be biased to the un-contended thread using techniques to avoid the use of atomic instructions.  This biasing allows a lock in theory to be quickly reacquired by the same thread.  If the lock turns out to be contended by multiple threads the algorithm with revert from being biased and fall back to the standard approach using atomic instructions.  Biased locking became the default lock implementation with Java 6.

When respecting the single writer principle, biased locking should be your friend.  Lately, when using the sockets API, I decided to measure the lock costs and was surprised by the results.  I found that my un-contended thread was incurring a bit more cost than I expected from the lock.  I put together the following test to compare the cost of the current lock implementations available in Java 6.

The Test

For the test I shall increment a counter within a lock, and increase the number of contending threads on the lock.  This test will be repeated for the 3 major lock implementations available to Java:
  1. Atomic locking on Java language monitors
  2. Biased locking on Java language monitors
  3. ReentrantLock introduced with the java.util.concurrent package in Java 5.
I'll also run the tests on the 3 most recent generations of the Intel CPU.  For each CPU I'll execute the tests up to the maximum number of concurrent threads the core count will support.

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

The Code
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType { JVM, JUC }
    public static LockType lockType;

    public static final long ITERATIONS = 500L * 1000L *1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;
    private static CyclicBarrier barrier;

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);
        
        runTest(numThreads); // warm up
        counter = 0L;

        final long start = System.nanoTime();
        runTest(numThreads);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads) throws Exception
    {
        barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}
Script the tests:

set -x
for i in {1..8}; do java -XX:-UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java -XX:+UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java TestLocks JUC $i; done

The Results

Figure 1.

Figure 2.

Figure 3.

Observations
  1. Biased locking, in the un-contended case, is ~10% more expensive than the atomic locking.  It seems that for recent CPU generations the cost of atomic instructions is less than the necessary housekeeping for biased locks.  Previous to Nehalem, lock instructions would assert a lock on the memory bus to perform the these atomic operations, each would cost more than 100 cycles.  Since Nehalem, atomic instructions can be handled local to a CPU core, and typically cost only 10-20 cycles if they do not need to wait on the store buffer to empty while enforcing memory ordering semantics.
  2. As contention increases, language monitor locks quickly reach a throughput limit regardless of thread count.
  3. ReentrantLock gives the best un-contended performance and scales significantly better with increasing contention compared to language monitors using synchronized.
  4. ReentrantLock has an odd characteristic of reduced performance when 2 threads are contending.  This deserves further investigation.
  5. Sandybridge suffers from the increased latency of atomic instructions I detailed in a previous article when contended thread count is low.  As contended thread count continues to increase, the cost of the kernel arbitration tends to dominate and Sandybridge shows its strength with increased memory throughput.
Conclusion

Biased locking should no longer be the default lock implementation on modern Intel processors.  I recommend you measure your applications and experiement with the -XX:-UseBiasedLocking JVM option to determine if you can benefit from using atomic lock based algorithm for the un-contended case.

When developing your own concurrent libraries I would recommend ReentrantLock rather than using the synchronized keyword due to the significantly better performance on x86, if a lock-free alternative algorithm is not a viable option.

Update 20-Nov-2011

Dave Dice has pointed out that biased locking is not implemented for the locks created in the first few seconds of the JVM startup.  I'll re-run my tests this week and post the results.  I've had some more quality feedback that suggests my results could be potentially invalid.  Micro benchmarks can be tricky but the advice of measuring your own application in the large still stands.

A re-run of the tests can be seen in this follow-on blog taking account of Dave's feedback.

Saturday 5 November 2011

Locks & Condition Variables - Latency Impact

In a previous article on Inter-Thread Latency I showed how it is possible to signal a state change between 2 threads with less than 50ns of latency.  To many developers, writing concurrent code using locks is a scary experience.  Writing concurrent code using lock-free algorithms, i.e. algorithms that rely on the use of memory barriers and an intimate understanding of the underlying memory models, can be totally terrifying.  To me lock-free / non-blocking algorithms are like playing with explosives or corrosive chemicals, if you do not understand what you are doing, or show the ultimate respect, then very bad things can, and most likely will, happen!

In this article, I'd like to illustrate the impact of using locks and the resulting latency they can impose on your designs.  I want to use a very similar algorithm to that used in my previous inter-thread latency article to illustrate the ping-pong effect of handing control back and forth between 2 threads.  In this case, rather than using a couple of volatile variables, I will employ a pair of condition variables to signal a state change so control can be passed back and forth.

The Code
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

import static java.lang.System.out;

public final class LockedSignallingLatency
{
    private static final int ITERATIONS = 10 * 1000 * 1000;

    private static final Lock lock = new ReentrantLock();
    private static final Condition sendCondition = lock.newCondition();
    private static final Condition echoCondition = lock.newCondition();

    private static long sendValue = -1L;
    private static long echoValue = -1L;

    public static void main(final String[] args)
        throws Exception
    {
        final Thread sendThread = new Thread(new SendRunner());
        final Thread echoThread = new Thread(new EchoRunner());

        final long start = System.nanoTime();

        echoThread.start();
        sendThread.start();

        sendThread.join();
        echoThread.join();

        final long duration = System.nanoTime() - start;

        out.printf("duration %,d (ns)\n", duration);
        out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L));
        out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration);
    }

    public static final class SendRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    sendValue = i;
                    sendCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    while (echoValue != i)
                    {
                        echoCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

            }
        }
    }

    public static final class EchoRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    while (sendValue != i)
                    {
                        sendCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    echoValue = i;
                    echoCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }
            }
        }
    }
}
Test Results

Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency
duration 41,649,616,343 (ns)
2,082 ns/op
480,196 ops/s

$ java LockedSignallingLatency
duration 73,789,456,491 (ns)
3,689 ns/op
271,041 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 40,469,689,559 (ns)
2,023 ns/op
494,197 ops/s

$ java LockedSignallingLatency
duration 169,795,756,230 (ns)
8,489 ns/op
117,788 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 47,209,549,484 (ns)
2,360 ns/op
423,643 ops/s

$ java LockedSignallingLatency
duration 336,168,489,093 (ns)
16,808 ns/op
59,493 ops/s

Observations

The above is a typical set of results I've seen in the middle of the range from multiple runs.  There are a couple of interesting observations I'd like to expand on.

Firstly, this is 3 orders of magnitude greater latency than what I illustrated in the previous article using just memory barriers to signal between threads.  This cost comes about because the kernel needs to get involved to arbitrate between the threads for the lock, and then manage the scheduling for the threads to awaken when the condition is signalled.  The one-way latency to signal a change is pretty much the same as what is considered current state of the art for network hops between nodes via a switch.  It is possible to get ~1µs latency with InfiniBand and less than 5µs with 10GigE and user-space IP stacks.

Secondly, the impact is clear when letting the OS choose what CPUs the threads get scheduled on rather than pinning them manually.  I've observed this same issue across many use cases whereby Linux, in default configuration for its scheduler, will greatly impact the performance of a low-latency system by scheduling threads on different cores resulting in cache pollution.   Windows by default seems to make a better job of this.

I recently had an interesting discussion with Cliff Click about using condition variables and their cost.  He pointed out a problem he was seeing.  If you look at the case where a sleeping thread gets signalled within the lock, it goes to run and then discovers it cannot get the lock because the signalling thread already has the lock, so it gets put back to sleep until the signalling thread releases the lock, thus causing more work than necessary.  Modern schedulers would benefit from being more aware of communication mechanisms between threads to have more efficient location and rescheduling logic.  As we go more concurrent and parallel our schedulers need to become more aware of IPC mechanisms.

Conclusion

When designing a low-latency system it is crucial to avoid the use of locks and condition variables for the main transaction flows.  Non-blocking or lock-free algorithms are key to achieving ultra-low latency but can be very difficult to prove correct.   I would not recommend designing lock-free algorithms for business logic but they can be very effectively employed for low-level infrastructure components.  The business logic is best run on single threads following the Single Writer Principle from my previous article.