Part 8 of 16

Parallel Streams: ForkJoinPool, Spliterators, and When NOT to Parallelize

How Parallel Streams Work

Parallel streams are one of Java 8’s most misused features. It is tempting to add .parallel() to any slow stream pipeline, but the performance characteristics are counterintuitive: parallel can make things slower for small data, and adding blocking I/O inside a parallel stream can stall the entire JVM. This article explains the mechanics, the cases where parallel genuinely helps, and the patterns to avoid.

A parallel stream splits its source into sub-sequences, processes each sub-sequence on a separate thread, and merges the results. The mechanism is ForkJoin — specifically ForkJoinPool.commonPool(), a shared thread pool managed by the JVM.

// Sequential — processes on the calling thread
List<String> seq = names.stream()
    .map(String::toUpperCase)
    .collect(Collectors.toList());

// Parallel — splits work across ForkJoinPool.commonPool()
List<String> par = names.parallelStream()
    .map(String::toUpperCase)
    .collect(Collectors.toList());

// Convert existing stream to parallel
List<String> par2 = names.stream()
    .parallel()           // switch to parallel
    .map(String::toUpperCase)
    .collect(Collectors.toList());

A single call to .parallel() or .sequential() anywhere in a pipeline applies to the entire pipeline.


ForkJoinPool.commonPool

The common pool is shared across the entire JVM process. Its thread count defaults to Runtime.getRuntime().availableProcessors() - 1 (leaving one CPU for the main thread).

System.out.println(ForkJoinPool.commonPool().getParallelism());
// e.g., 7 on an 8-core machine

You can change the pool size with a system property:

-Djava.util.concurrent.ForkJoinPool.common.parallelism=4

The Shared Pool Problem

Because the common pool is shared, one slow parallel stream can starve all others. A blocking I/O call inside a parallel stream blocks a ForkJoin worker thread, which can cascade:

// DANGEROUS: blocking inside parallel stream consumes ForkJoin threads
orders.parallelStream()
    .map(order -> httpClient.fetchDetails(order.getId()))  // blocks!
    .collect(Collectors.toList());

Fix: Use a custom pool for blocking operations:

ForkJoinPool customPool = new ForkJoinPool(8);
List<OrderDetail> results = customPool.submit(() ->
    orders.parallelStream()
        .map(order -> httpClient.fetchDetails(order.getId()))
        .collect(Collectors.toList())
).get();
customPool.shutdown();

Spliterators

The engine behind stream splitting is Spliterator<T> — an iterator that knows how to split itself for parallel processing.

public interface Spliterator<T> {
    boolean tryAdvance(Consumer<? super T> action);  // process next element
    Spliterator<T> trySplit();                        // split off half
    long estimateSize();                              // estimated remaining elements
    int characteristics();                            // ORDERED, SIZED, DISTINCT, etc.
}

When a stream goes parallel:

  1. trySplit() is called recursively to create sub-tasks down to a threshold
  2. Each sub-task processes its split and produces partial results
  3. Results are combined using the pipeline’s combiner

Spliterator Characteristics

Characteristics tell the framework what guarantees the data source provides, enabling optimisations:

CharacteristicMeaningExample sources
ORDEREDElements have a defined encounter orderList, LinkedList, Arrays.stream
SORTEDElements are sortedTreeSet, sorted stream
SIZEDestimateSize() is exactArrayList, HashSet
DISTINCTNo duplicate elementsSet
NONNULLNo null elementsConcurrentHashMap
IMMUTABLESource cannot be modifiedList.of() (Java 9+)
SUBSIZEDSub-spliterators are also SIZEDArrays

ArrayList is ORDERED + SIZED + SUBSIZED — it splits perfectly. HashSet is SIZED + DISTINCT but not ORDERED — it can’t guarantee encounter order.


When Parallel Is Actually Faster

Parallel streams have overhead: task splitting, thread coordination, result merging. They are only faster when the computational savings outweigh this overhead.

Conditions for parallel wins

  1. Large data set — typically 10,000+ elements; below that, overhead dominates
  2. Computationally expensive per-element operation — CPU-bound work that takes non-trivial time
  3. Splittable sourceArrayList, arrays, IntStream.range split evenly; LinkedList does not
  4. No ordering requirement — or the ordering is cheap to restore
  5. No shared mutable state — thread-safe operations only

Quick benchmark: CPU-bound work

long N = 10_000_000L;

// Sequential
long start = System.nanoTime();
long sumSeq = LongStream.range(0, N)
    .map(n -> n * n % 1000000007L)
    .sum();
long seqMs = (System.nanoTime() - start) / 1_000_000;

// Parallel
start = System.nanoTime();
long sumPar = LongStream.range(0, N)
    .parallel()
    .map(n -> n * n % 1000000007L)
    .sum();
long parMs = (System.nanoTime() - start) / 1_000_000;

System.out.printf("Sequential: %dms, Parallel: %dms%n", seqMs, parMs);
// Expect ~3–7x speedup on an 8-core machine

Sources with good split behaviour

SourceParallelism qualityReason
ArrayListExcellentO(1) split, exact size
int[] / long[]ExcellentO(1) split, exact size
IntStream.rangeExcellentO(1) arithmetic split
TreeSet / TreeMapGoodBalanced tree splits reasonably
HashSet / HashMapModerateSplits by bucket, uneven possible
LinkedListPoorO(n) split
Files.lines()PoorSequential read only

When NOT to Use Parallel Streams

Small data sets

// Bad: 10 elements — overhead far exceeds savings
List<Integer> small = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
small.parallelStream().map(n -> n * 2).collect(Collectors.toList());

// Use sequential
small.stream().map(n -> n * 2).collect(Collectors.toList());

Order matters and the source is ordered

// DANGEROUS: parallel + ordered list = correct but slow (re-ordering overhead)
// and may produce unexpected output order in intermediate steps
List<String> ordered = new ArrayList<>(names);
ordered.parallelStream()
    .forEach(System.out::println); // order not guaranteed

// Fix: use forEachOrdered (but this negates most parallelism benefit)
ordered.parallelStream()
    .forEachOrdered(System.out::println);

Shared mutable state

// BROKEN: ArrayList.add is not thread-safe
List<Integer> results = new ArrayList<>();
numbers.parallelStream()
    .filter(n -> n > 5)
    .forEach(results::add);  // race condition!

// Fix: use collect instead
List<Integer> results = numbers.parallelStream()
    .filter(n -> n > 5)
    .collect(Collectors.toList());  // thread-safe

I/O-bound operations (blocking)

Use CompletableFuture with a custom thread pool instead of parallel streams for HTTP calls, database queries, or file I/O.

Short pipelines

If the pipeline has only one or two operations and the element count is modest, the overhead of splitting and merging will dominate.


Ordering Guarantees

Stream typeforEachcollect(toList())findFirst
Sequential, ordered✓ ordered✓ orderedfirst element
Parallel, ordered✗ any order✓ ordered (re-ordered)first element (expensive)
Parallel, unordered✗ any order✗ any orderany element (fast)

For parallel streams on ordered sources (lists), collect(Collectors.toList()) always preserves encounter order — Java guarantees this even for parallel streams. Only forEach loses order.

To tell the stream “I don’t care about order” and enable optimisations:

names.parallelStream()
    .unordered()  // removes ORDERED characteristic
    .filter(s -> s.length() > 3)
    .collect(Collectors.toList()); // may be in any order — faster

Practical Guide

// Template for deciding stream mode
Stream<T> stream = source.stream();

if (source.size() > 10_000           // large enough
        && operationIsCpuBound        // not I/O
        && !needsOrderedSideEffects   // no System.out::println in forEach
        && operationIsThreadSafe) {   // no shared mutable state
    stream = stream.parallel();
}

Common patterns that are safe to parallelise

// CPU-heavy computation, result in a list
List<Result> results = items.parallelStream()
    .map(item -> expensiveCompute(item))
    .collect(Collectors.toList());

// Sum / reduce over large numeric arrays
long total = LongStream.range(0, 1_000_000).parallel().sum();

// Filtering large collections
List<Order> bigOrders = orders.parallelStream()
    .filter(o -> o.getTotal() > 10_000)
    .collect(Collectors.toList());

Common Mistakes

Combining parallel streams with sorted() or distinct()

sorted() and distinct() are stateful operations — they must see all elements before they can produce output. In a parallel stream, every worker thread produces partial results that must then be merged and sorted. This negates most of the parallel benefit and adds overhead:

// Very likely slower than sequential for most data sizes
names.parallelStream()
    .sorted()
    .collect(Collectors.toList());

// Sequential sorted is often faster for in-memory lists
names.stream()
    .sorted()
    .collect(Collectors.toList());

Assuming parallelStream() is always safe on all sources

Stream.of(...) and Collection.stream() sources are generally safe, but not all collections split well. LinkedList.parallelStream() is essentially sequential because LinkedList cannot be split efficiently:

// LinkedList: trySplit() traverses to the midpoint — O(n) cost per split
List<String> linked = new LinkedList<>(names);
linked.parallelStream().map(String::toUpperCase).collect(toList());
// Almost certainly slower than sequential

// ArrayList: O(1) split
List<String> array = new ArrayList<>(names);
array.parallelStream().map(String::toUpperCase).collect(toList());
// Can be faster for large lists with expensive per-element work

Not measuring before and after

Every parallel stream usage should be accompanied by a benchmark. The JMH microbenchmark framework is the standard tool:

// pom.xml dependency
// <dependency>
//   <groupId>org.openjdk.jmh</groupId>
//   <artifactId>jmh-core</artifactId>
//   <version>1.37</version>
// </dependency>

@Benchmark
public long sequentialSum(BenchmarkState state) {
    return LongStream.range(0, state.N).map(n -> n * n % 1_000_000_007L).sum();
}

@Benchmark
public long parallelSum(BenchmarkState state) {
    return LongStream.range(0, state.N).parallel().map(n -> n * n % 1_000_000_007L).sum();
}

Without measuring, you are guessing.


Summary

ConceptKey point
How it worksForkJoin splits source → process in parallel → merge results
Default poolForkJoinPool.commonPool(), size = CPU count - 1
SpliteratorEnables splitting; ArrayList / arrays split best
Orderingcollect(toList()) preserves order; forEach does not
Safe useLarge + CPU-bound + stateless + no ordering side effects
Avoid whenSmall data, I/O-bound, shared mutable state, order matters
Always doMeasure with JMH before and after adding .parallel()

Next Step

Optional: Eliminating NullPointerException the Right Way →

Part of the DevOps Monk Java tutorial series: Java 8Java 11Java 17Java 21