Coherence supports several cache topologies. This section contains the following information:
Coherence and Cache Topologies
Coherence supports several cache topologies, but the partitioned, near, and replicated options will satisfy the vast majority of use cases. All are fully coherent and support cluster-wide locking and transactions:
- Partitioned (or Distributed) - Each machine contains a unique partition of the dataset. Adding machines to the cluster will increase the capacity of the cache. Both read and write access involve network transfer and serialization/deserialization.
- Near - Each machine contains a small local cache which is synchronized with a larger Partitioned cache, optimizing read performance. There is some overhead involved with synchronizing the caches.
- Replicated - Each machine contains a full copy of the dataset. Read access is instantaneous.
Data Access Patterns
Data access distribution (hot spots)
When caching a large dataset, typically a small portion of that dataset will be responsible for most data accesses. For example, in a 1000 object dataset, 80% of operations may be be against a 100 object subset. The remaining 20% of operations may be against the other 900 objects. Obviously the most effective return on investment will be gained by caching the 100 most active objects; caching the remaining 900 objects will provide 25% more effective caching while requiring a 900% increase in resources.
On the other hand, if every object is accessed equally often (for example in sequential scans of the dataset), then caching will require more resources for the same level of effectiveness. In this case, achieving 80% cache effectiveness would require caching 80% of the dataset versus 10%. (Note that sequential scans of partially cached data sets will generally defeat MRU, LFU and MRU-LFU eviction policies). In practice, almost all non-synthetic (benchmark) data access patterns are uneven, and will respond well to caching subsets of data.
In cases where a subset of data is active, and a smaller subset is particularly active, Near caching can be very beneficial when used with the "all" invalidation strategy (this is effectively a two-tier extension of the above rules).
Cluster-node affinity
Coherence's Near cache technology will transparently take advantage of cluster-node affinity, especially when used with the "present" invalidation strategy. This topology is particularly useful when used in conjunction with a sticky load-balancer. Note that the "present" invalidation strategy results in higher overhead (as opposed to "all") when the front portion of the cache is "thrashed" (very short lifespan of cache entries); this is due to the higher overhead of adding/removing key-level event listeners. In general, a cache should be tuned to avoid thrashing and so this is usually not an issue.
Read-write ratio and data sizes
Generally speaking, the following cache topologies are best for the following use cases:
- Replicated cache - small amounts of read-heavy data (e.g. metadata)
- Partitioned cache - large amounts of read-write data (e.g. large data caches)
- Near cache - similar to Partitioned, but has further benefits from read-heavy tiered access patterns (e.g. large data caches with hotspots) and "sticky" data access (e.g. sticky HTTP session data). Depending on the synchronization method (expiry, asynchronous, synchronous), the worst case performance may range from similar to a Partitioned cache to considerably worse.
Interleaving
Interleaving refers to the number of cache reads between each cache write. The Partitioned cache is not affected by interleaving (as it is designed for 1:1 interleaving). The Replicated and Near caches by contrast are optimized for read-heavy caching, and prefer a read-heavy interleave (e.g. 10 reads between every write). This is because they both locally cache data for subsequent read access. Writes to the cache will force these locally cached items to be refreshed, a comparatively expensive process (relative to the near-zero cost of fetching an object off the local memory heap). Note that with the Near cache technology, worst-case performance is still similar to the Partitioned cache; the loss of performance is relative to best-case scenarios.
Note that interleaving is related to read-write ratios, but only indirectly. For example, a Near cache with a 1:1 read-write ratio may be extremely fast (all writes followed by all reads) or much slower (1:1 interleave, write-read-write-read...).
Heap Size Considerations
This section will help you decide
- How many CPUs you will need for your system
- How much memory you will need for each system
- How many JVMs to run per system
- How much heap to configure with each JVM
Since all applications are different, this section should be read as guidelines. You will need to answer the following questions in order to choose the configuration that is right for you:
- How much data will be stored in Coherence caches?
- What are the application requirements in terms of latency and throughput?
- How CPU or Network intensive is the application?
Sizing is an imprecise science. There is no substitute for frequent performance and stress testing.
General Guidelines
Running with a fixed sized heap will save your JVM from having to grow the heap on demand and will result in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:
Note that the JVM process will consume more system memory then the specified heap size, for instance a 1GB JVM will consume 1.3GB of memory. This should be taken into consideration when determining the maximum number of JVMs which you will run on a machine. The actual allocated size can be monitored with tools such as top.
Cache Topologies and Heap Size
For large datasets, Partitioned or Near caches are recommended. As the scalability of the Partitioned cache is linear for both reading and writing, varying the number of Coherence JVMs will not significantly affect cache performance. Using a Replicated cache will put significant pressure on Garbage Collection.
Deciding How Many JVMs to Run Per System
The number of JVMs (nodes) to run per system depends on the system's number of processors/cores and amount of memory. As a starting point, we recommend starting with a plan to run one JVM for every two cores*. This recommendation balances the following factors:
- Multiple JVMs per server allow Coherence to make more efficient use of network resources. Coherence's packet-publisher and packet-receiver have a fixed number of threads per JVM; as you add cores, you'll want to add JVMs to scale across these cores.
- Too many JVMs will increase contention and context switching on processors.
- Too few JVMs may not be able to handle available memory and may not fully utilize the NIC.
- Especially for larger heap sizes, JVMs must have available processing capacity to avoid long GC pauses.
Depending on your application, you can add JVMs up toward one per core.
The recommended number of JVMs and amount of configured heap may also vary based on the number of processors/cores per socket and on the machine architecture.
Sizing Your Heap
When considering heap size, it is important to find the right balance. The lower bound is determined by per-JVM overhead (and also, manageability of a potentially large number of JVMs). For example, if there is a fixed overhead of 100MB for infrastructure software (e.g. JMX agents, connection pools, internal JVM structures), then the use of JVMs with 256MB heap sizes will result in close to 40% overhead for non-cache data.
The upper bound on JVM heap size is governed by memory management overhead, specifically the maximum duration of GC pauses and the percentage of CPU allocated to GC (and other memory management tasks).
Garbage Collection can affect the following:
- The latency of operations against Coherence. Larger heaps will cause longer and less predictable latency than smaller heaps.
- The stability of the cluster. With very large heaps, lengthy long garbage collection pauses can trick TCMP into believing a cluster member is dead since the JVM is unresponsive during GC pauses. Although TCMP takes GC pauses into account when deciding on member health, at some point it may decide the member is dead.
We offer the following guidelines:
- With older JVM's such as Sun 1.4 and 1.5, we recommend not allocating more than 1GB and 2GB heap respectively.
- For newer JVM's (Sun 1.6 or JRockit, for example), we recommend allocating up to a 4GB heap. We also recommend using Sun's Concurrent Mark and Sweep GC or JRockit's Deterministic GC.
The length of a GC pause scales worse than linearly to the size of the heap. That is, if you double the size of the heap, pause times due to GC will, in general, more than double.
GC pauses are also impacted by application usage.
- Pause times increase as the amount of live data in the heap increases. We recommend not exceeding 70% live data in your heap. This includes primary data, backup data, indexes, and application data.
- High object allocation rates will increase pause times. Even "simple" Coherence applications can cause high object allocation rates since every network packet generates many objects.
- CPU-intensive computation will increase contention and may also contribute to higher pause times.
Depending on your latency requirements, you can increase allocated heap space beyond the above recommendations, but be sure to stress test your system.
Moving the cache out of the application heap
Using dedicated Coherence cache server instances for Partitioned cache storage will minimize the heap size of application JVMs as the data is no longer stored locally. As most Partitioned cache access is remote (with only 1/N of data being held locally), using dedicated cache servers does not generally impose much additional overhead. Near cache technology may still be used, and it will generally have a minimal impact on heap size (as it is caching an even smaller subset of the Partitioned cache). Many applications are able to dramatically reduce heap sizes, resulting in better responsiveness.
Local partition storage may be enabled (for cache servers) or disabled (for application server clients) with the tangosol.coherence.distributed.localstorage Java property (e.g. -Dtangosol.coherence.distributed.localstorage=false).
It may also be disabled by modifying the <local-storage> setting in the tangosol-coherence.xml (or tangosol-coherence-override.xml) file as follows:
At least one storage-enabled JVM must be started before any storage-disabled clients access the cache.