CheckIn capacity scenarios in the real world

Following on from a discussion I had with @nliam, we are scheduling a test for Wednesday.

We have three scenarios which I think could apply to any web-facing service:

  • Normal : Typical day-to-day scenario (100 users connecting distributed normally over 24 hours).
  • Conference: You are announcing something in a conference setting and people want to see it (300 users spiked over a 5-10 minute period, tailing off to 100 over 2 hours).
  • Malicious: Someone on the internet hates your particular website and wants to take it down. Not on my watch. (thousands of connections per second from multiple locations).

These are trying to reflect real-world scenarios which all public-facing websites should be able to handle, either by providing proper underlying capacity, or by defensive configuration. The aim of the tests is to explore where and how the service fails, in order to validate capacity and continuity plans.

Infrastructure

Typically, these tests should form part of the normal DevSecOps delivery pipeline, and in some “clean-room” environment : either a “dev” instance which mirrors the one in prod, or the mockup of the instance on resources that resemble prod.

Ops should be able to provide a playbook for running the tests that devs can run on demand, or include in their CI pipeline. This can go into a common repository - e.g. @egi-foundation/website-hard (where the various other hardening profiles for the web can be kept).

In order to make the test results realistic, they need to be run in a distributed environment - I.e. traffic needs to come from several places. We could provision a few tests across FedCloud sites e.g…

For now, I’d like some comments about this overall strategy. Next steps will include:

  • Create basic checklists:

    • for the service owner to complete prior to launching tests
    • for the test initiator to complete prior to launching tests
    • for the site owner to do while the test is ongoing.
  • Put the scenarios into something that can be easily kept under change control.

  • Develop something that can be done on-demand as a preconfigured action, so that developers or website owners can be pro-active