Symptoms: All Sakai application servers were experiencing extremely slow usability and timeouts.
Cause: There was a memory leak in a service that transformed student images from large jpgs to thumbnails with watermarks. This caused the application servers to be in an almost constant state of full garbage collection (FGC).
How we diagnosed the problem.
Prior to experiencing this memory leak, we started monitoring the JVM Heap statistics via a JSP that simply outputs the numbers, and a perl script that collects the data into an Orca database. Below are the plots of this event.



You can see that Friday, the heap approached its limit of 2Gbs, but the FGC events were not freeing tenured space. As tenured space grew, so did the frequency of the FGC events. This resulted in multiple pauses of the application servers, accounting for the timeouts from users and other machines in the system (apache and oracle). The tenured space never decreased, even during low usage periods in the early am, signifying a memory leak.
Since winfield was the furthest along in its return trip to a full heap, it was chosen as the machine to profile. We ran "jmap -histo" to profile the heap. Below are the first few line of that profile:
Object Histogram: Size Count Class description ------------------------------------------------------- 1285903680 136822 int[] 128920272 47439 byte[] 88167240 1030445 char[] 45586688 317075 * ConstMethodKlass 25399776 1058324 java.lang.String 22834936 317075 * MethodKlass 18387056 29881 * ConstantPoolKlass 12962504 213652 java.lang.Object[] 12447328 29881 * InstanceKlassKlass 10674720 25200 * ConstantPoolCacheKlass 8784808 211113 * SymbolKlass 7179840 82062 java.util.HashMap$Entry[] 3945200 60770 java.util.Hashtable$Entry[] 3739800 155825 java.util.HashMap$Entry 3581776 10503 * MethodDataKlass 3283904 114590 java.lang.String[] 2919600 36495 java.lang.reflect.Method 2833584 118066 java.util.Hashtable$Entry ...
From this you can see that we have almost 1.3 Gbs of integer arrays that account for the high memory footprint. Given that we had just deployed the photo roster recently, we grep'd the profile for "awt.image" and found objects that should never have made it into tenured space, let alone through a FGC event:
288808 2777 sun.awt.image.IntegerInterleavedRaster 266496 2776 sun.awt.image.ImageRepresentation 133296 2777 java.awt.image.SinglePixelPackedSampleModel 111080 2777 java.awt.image.BufferedImage 111080 2777 java.awt.image.DataBufferInt 111040 2776 sun.awt.image.ToolkitImage 77784 1389 sun.awt.image.BufImgSurfaceData 55520 1388 sun.awt.image.ByteArrayImageSource 55520 1388 java.awt.image.ReplicateScaleFilter 33312 1388 java.awt.image.FilteredImageSource 544 4 java.awt.image.DirectColorModel 192 2 sun.awt.image.ImageFetcher ...
This lead to an examination of the code that converted the ID centers large photos into smaller thumbnails with watermarks, where the leak we found.
We made a emergency code release at 1730 monday after reproducing the leak in test and verifying the fix.