Jon Masamitsu's Weblog
Our low-pause collector (UseConcMarkSweepGC) which we are usually careful to call our mostly concurrent collector has several phases, two of which are stop-the-world (STW) phases.
# STW initial mark
# Concurrent marking
# Concurrent precleaning
# STW remark
# Concurrent sweeping
# Concurrent reset
The first STW pause is used to find all the references to objects in the application (i.e., object references on thread stacks and in registers). After this first STW pause is the concurrent marking phase during which the application threads runs while GC is doing additional marking to determine the liveness of objects. After the concurrent marking phase there is a concurrent preclean phase (described more below) and then the second STW pause which is called the remark phase. The remark phase is a catch-up phase in which the GC figures out all the changes that the application threads have made during the previous concurrent phases. The remark phase is the longer of these two pauses. It is also typically the longest of any of the STW pauses (including the minor collection pauses). Because it is typically the longest pause we like to use parallelism where ever we can in the remark phase.
Part of the work in the remark phase involves rescanning objects that have been changed by an application thread (i.e., looking at the object A to see if A has been changed by the application thread so that A now references another object B and B was not previously marked as live). This includes objects in the young generation and here we come to the point of these ramblings. Rescanning the young generation in parallel requires that we divide the young generation into chunks so that we can give chunks out to the parallel GC threads doing the rescanning. A chunk needs to begin on the start of an object and in general we don't have a fast way to find the starts of objects in the young generation.
Given an arbitrary location in the young generation we are likely in the middle of an object, don't know what kind of object it is, and don't know how far we are from the start of the object. We know that the first object in the young generation starts at the beginning of the young generation and so we could start at the beginning and walk from object to object to do the chunking but that would be expensive. Instead we piggy-back the chunking of the young generation on another concurrent phase, the precleaning phase.
During the concurrent marking phase the applications threads are running and changing objects so that we don't have an exact picture of what's alive and what's not. We ultimately fix this up in the remark phase as described above (the object-A-gets-changed-to-point-to-object-B example). But we would like to do as much of the collection as we can concurrently so we have the concurrent precleaning phase. The precleaning phase does work similar to parts of the remark phase but does it concurrently. The details are not needed for this story so let me just say that there is a concurrent precleaning phase. During the latter part of the concurrent precleaning phase the the young generation "top" (the next location to be allocated in the young generation and so at an object start) is sampled at likely intervals and is saved as the start of a chunk. "Likely intervals" just means that we want to create chunks that are not too small and not too large so as to get good load balancing during the parallel remark.
Ok, so here's the punch line for all this. When we're doing the precleaning we do the sampling of the young generation top for a fixed amount of time before starting the remark. That fixed amount of time is CMSMaxAbortablePrecleanTime and its default value is 5 seconds. The best situation is to have a minor collection happen during the sampling. When that happens the sampling is done over the entire region in the young generation from its start to its final top. If a minor collection is not done during that 5 seconds then the region below the first sample is 1 chunk and it might be the majority of the young generation. Such a chunking doesn't spread the work out evenly to the GC threads so reduces the effective parallelism.
If the time between your minor collections is greater than 5 seconds and you're using parallel remark with the low-pause collector (which you are by default), you might not be getting parallel remarking after all. A symptom of this problem is significant variations in your remark pauses. This is not the only cause of variation in remark pauses but take a look at the times between your minor collections and if they are, say, greater than 3-4 seconds, you might need to up CMSMaxAbortablePrecleanTime so that you get a minor collection during the sampling.
And finally, why not just have the remark phase wait for a minor collection so that we get effective chunking? Waiting is often a bad thing to do. While waiting the application is running and changing objects and allocating new objects. The former makes more work for the remark phase when it happens and the latter could cause an out-of-memory before the GC can finish the collection. There is an option CMSScavengeBeforeRemark which is off by default. If turned on, it will cause a minor collection to occur just before the remark. That's good because it will reduce the remark pause. That's bad because there is a minor collection pause followed immediately by the remark pause which looks like 1 big fat pause.l