Fall 07 Memory Leak Whitepaper

Symptoms: All Sakai application servers were experiencing extremely slow usability and timeouts.

Cause: There was a memory leak in a service that transformed student images from large jpgs to thumbnails with watermarks. This caused the application servers to be in an almost constant state of full garbage collection (FGC).

How we diagnosed the problem.
Prior to experiencing this memory leak, we started monitoring the JVM Heap statistics via a JSP that simply outputs the numbers, and a perl script that collects the data into an Orca database. Below are the plots of this event.

You can see that Friday, the heap approached its limit of 2Gbs, but the FGC events were not freeing tenured space. As tenured space grew, so did the frequency of the FGC events. This resulted in multiple pauses of the application servers, accounting for the timeouts from users and other machines in the system (apache and oracle). The tenured space never decreased, even during low usage periods in the early am, signifying a memory leak.

Since winfield was the furthest along in its return trip to a full heap, it was chosen as the machine to profile. We ran "jmap -histo" to profile the heap. Below are the first few line of that profile:

Object Histogram:

Size    Count   Class description
-------------------------------------------------------
1285903680      136822  int[]
128920272       47439   byte[]
88167240        1030445 char[]
45586688        317075  * ConstMethodKlass
25399776        1058324 java.lang.String
22834936        317075  * MethodKlass
18387056        29881   * ConstantPoolKlass
12962504        213652  java.lang.Object[]
12447328        29881   * InstanceKlassKlass
10674720        25200   * ConstantPoolCacheKlass
8784808 211113  * SymbolKlass
7179840 82062   java.util.HashMap$Entry[]
3945200 60770   java.util.Hashtable$Entry[]
3739800 155825  java.util.HashMap$Entry
3581776 10503   * MethodDataKlass
3283904 114590  java.lang.String[]
2919600 36495   java.lang.reflect.Method
2833584 118066  java.util.Hashtable$Entry
...

From this you can see that we have almost 1.3 Gbs of integer arrays that account for the high memory footprint. Given that we had just deployed the photo roster recently, we grep'd the profile for "awt.image" and found objects that should never have made it into tenured space, let alone through a FGC event:

288808  2777    sun.awt.image.IntegerInterleavedRaster
266496  2776    sun.awt.image.ImageRepresentation
133296  2777    java.awt.image.SinglePixelPackedSampleModel
111080  2777    java.awt.image.BufferedImage
111080  2777    java.awt.image.DataBufferInt
111040  2776    sun.awt.image.ToolkitImage
77784   1389    sun.awt.image.BufImgSurfaceData
55520   1388    sun.awt.image.ByteArrayImageSource
55520   1388    java.awt.image.ReplicateScaleFilter
33312   1388    java.awt.image.FilteredImageSource
544     4       java.awt.image.DirectColorModel
192     2       sun.awt.image.ImageFetcher
...

This lead to an examination of the code that converted the ID centers large photos into smaller thumbnails with watermarks, where the leak we found.

We made a emergency code release at 1730 monday after reproducing the leak in test and verifying the fix.

Labels

 
(None)