Our greatest glory is not in never falling, but in rising every time we fall.

Confucius

Name

Design for resilience

Statement

Every resource used within a service must be highly available. When a service uses another service, it must consider that this service may be at fault.

Rationale

  • Increase the availability of services.
  • Keep the system responsive, even if some minor resources are not available.

Implications

  • Observe the operational status of the service (see Observe).
  • Some platform services, such as the authentication service or the data storage service, must be deployed to be highly available.
  • Provide backup processing for any call to an external service or resource:
    • Manage timeout to avoid calls that go on forever.
    • Provide reasonable defaults or cached results where possible to mitigate the effects of a call on a failed external service.

Examples

Bad

The Patient Transfer service calls the Patient Management service to retrieve patient data. The Patient Management service does not respond for 10 minutes because it is experiencing traffic overload. The Patient Transfer service waits for 10 minutes while preventing the user from continuing his work and fails with a technical error. Data already entered by the user is lost.

Good

The Medical Notetaking service calls the remote storage service in the background to persist the data entered. The caregiver entering medical data on their mobile device is in an area of the hospital without network coverage. The Medical Notetaking service detects the lack of connectivity and saves the data locally in a secure space. It also warns the user to return to an area with coverage so that the data can be transmitted automatically and informs them when this is done.