Document Cacher

Purpose

A document caching retriever will actively retrieve external content such as RSS feeds, XML, etc. This cacher is different from layout caching in that it caches the source document versus the render document.

It has two purposes:

  • To protect the portal from broken content feeds (timeouts, slow response)
  • To improve the responsiveness of the portal

High Level Requirements

  • Self initializing - when the cache is started it will retrieve all documents
    1. Document Cacher reads cache parameters.
    2. Document Cacher creates new active cache and new error cache
    3. Document Cacher reads document caching parameters
    4. For each document in document caching parameters
      1. Document Cacher places a lock on the document to prevent access until load is complete.
      2. Document Cacher schedules immediate document load with multi-threaded Document Loader.
      3. Document Loader loads document with next available thread.
        1. On Success
          1. Document Loader loads document to active cache
          2. Document Cacher releases document lock
          3. Document Cacher schedules document refresh with Document Loader based on refresh schedule.
        2. On Failure
          1. Document loader loads failure message to error cache.
          2. Document Cacher releases document lock.
          3. Document Cacher schedules document retry with Document Loader based on retry schedule.
  • Self populating - when an item is requested that is not in the cache it is retrieved
    1. Document Cacher receives request for document that does not exist in the active or error cache.
    2. If document is locked,
      1. Document Cacher waits for lock to be released.
      2. When lock is released, Document Cacher checks active cache and returns document if present.
      3. If document is not in active cache, Document Cacher returns document from error cache.
    3. If document is not locked,
      1. Document Cacher places a lock on the document.
      2. Document Cacher schedules immediate load of document with Document Loader.
      3. Document loader loads document with next available thread.
        1. Success and failure same as in initialization except that refresh and retry schedule will be cache level instead of document level settings.
  • Self refreshing - based on configured parameters the cache will retrieve documents at a scheduled interval
  1. Periodically the parameters cache should be re-read from their source (file, db, etc) and compared to previous version. 
    1. New documents that have been added should be initialized using the process described under Self Initializing above.
    2. Documents that no longer exist in the parameters file should not be refreshed when they expire from from the active cache.

Notes from JA-SIG

Requirements

Additional Requirements from JA-SIG

  1. Cached documents will be retrieved via a URL into the document cacher service.
  2. Actively retrieve a configured URL at a specified interval
    • Ability to vary interval based on absolute timing or timing relative to last successful or failed retrieval
  3. Configure complex retrieval intervals
    • One idea here would be to allow cron expressions
  4. Specify the action to take when a retrieval fails
    • Continue serving old data
    • Serv some per-URL error message
  5. Set an optional max age for cached data
  6. Share the cached data between multiple server nodes (Optional?)
  7. Allow 'easy' configuration via a big-long-url with all of the config parameters

Design Ideas

  1. Defined DocumentRetrievalService (DRS) interface
  2. DRS lookup by document URI
  3. cache service interface
    • allows for per-document cache settings
  4. how to store service configuration?
    • xstream - local xml file
    • embedded database (similar to bookmarks?)
  5. quartz for scheduling
    • need to have a db backed job store?

Notes on Document Cache Redesign

Caching app requirements

  • Config:
    • url
    • key
    • interval for active retry (smaller interval than max age)
    • method (get/post)
    • params (<string,string,...> for get/post
    • maxage - time to live success
    • retry - time to live error
    • timeout in seconds for http call
  • Document state in memory
    • Key
    • last receive
    • last fail
    • retrieve count from cache inception
    • fail count from cache inception
  • Other Notes
    • Can have multiple document config files
    • When cache is instantiated, load all documents using thread-pool
    • Retry page load before maxage occurs
    • Keep separate cache of failures that refresh on a different schedule
    • Honor last modified and other http headers
    • Retrieve on hard miss
  • Logging
    • Status, stats cache hits, misses, retrieve failures, success, Config

Current YaleInfo Document Cacher

TODO: add documentation on this page of DocumentCacher in YaleInfo.

Labels