Large viewership spikes
The Secure Media Delivery at the Edge on AWS solution has been designed to support workloads of various scale including streaming services that draw very high viewership levels by using highly scalable components in processing the access tokens and processing log entries. One of the scaling factors is also ability to generate the tokens at the rate of viewers requesting playback URLs, and with the ability to scale underpinning resources to produce and serve access tokens back. In the design of this solution, access token is generated only once for each playback session and repeated by the same viewer for subsequent requests which vastly reduces the load at the playback API stage provided that new playback sessions spread over time. However, it is still possible that during specific periods of time, the load on playback API resulting from new playback requests and resulting number of parallel processes producing the tokens can exceed the available limits or underlying resource capacity. A typical example would be highly popular live streaming event, starting at a precise moment in time leading to excessive number of requests from new viewers starting the stream almost simultaneously. If your workload experiences this type of event pattern or simply serves very high number of new viewers in a steady state operation, make sure the underlying compute assets are able to scale appropriately to serve expected number of new viewers. In the reference architecture, the API module is capable of running multiple concurrent processes responsible for generating the token and supplying playback URL with the token in response. This is possible by the use of the standard scaling model of Lambda. You should monitor the concurrency metric in the region where the solution was deployed and request an increase if you are approaching this limit, or when you anticipate an upcoming event can drive exceedingly high viewership level. For the sudden spikes of playback API requests, you must also account for the rate at which Lambda concurrency can raise, which is 500 per minute. If this proves to be insufficient for the type of events you serve, consider using Lambda provisioned concurrency to improve the ability to absorb rapid increase of new viewers.
The other aspect of scalability in the process of generating the tokens are the limits on API calls for AWS Secrets Manager. Note that in default implementation, a solution’s library method reaches out to Secrets Manager to obtain the signing keys required to issue a token. In the region where secrets are stored, API rate quota exists which limits the number of GetSecretValue calls made from solution’s library method and equals to 5000 transactions per second. This does not indicate that the maximum token generation rate is at the same level. In the solution’s library implementation of managing the secrets, a local memoization technique is used to cache the secrets retrieved in the context of running process. You need to specify for how long a secret should be cached in memory so that it can be reused by the same threads initiated in the scope of the same process. For this reason, to enhance reusability of the once retrieved key, we recommend producing the tokens by using long running process to serve the requests and asynchronous threads that could share the same object holding the keys. This will effectively reduce the number of API calls towards Secrets Manager. For large-scale events that need to run a large number of parallel processes, to the point when Secrets Manager API calls limit can become a concern, it is possible to define custom key retrieval function that would introduce another layer of caching. For instance, you can create a dedicated function which would retrieve original key from AWS Secrets Manager but also store it in a shared space between the processes, for instance using in-memory data store like Amazon ElasticCache.