Perceived Quality of Service (PQoS)
The intent of DevOps is to strengthen the Perceived Quality Of Service (PQoS), by closing any gaps between the user experience and the user expectations. The PQoS depends on the individuals perceptions defined by Quality of Experience (QoE) and the Quality of the service (QoS) offered defined by the product / service offered. In order to achieve high PQoS, both QoS & QoE has to be improved.
A classic example is YouTube and their PQoS built around video streaming. Youtube uses pseudo-streaming methodology where the video is buffered before the video playback starts. Until the video buffer is filled to a certain percentage, the video playback does not start. If the network slows down during the playback, the subsequent buffering gets delayed and thus can interrupt the video playback. This is called stalling. This can lead to dissatisfied users moving away from youtube page. In order to reduce stalling, an optimal buffer length needs to be decided for the initial buffering. To do that, QoS attributes like frequency of stalling & the duration of the stalling are monitored through the application. However to improve the QoE, user dynamics like the geography, gender, time of access, the end device & its attributes like CPU, Memory, screen resolution etc are studied. By this, the initial buffer is tuned as per user dynamics. This customises the service dynamically and has a personal touch.
The User expectations around QoE can be built through various factors like competition, marketing messages etc and PQoS can be measured interns of performance, usability, features etc.
The intelligent effort does not necessarily mean more automation, it is about choosing the right level of automation with insights in to user behaviours & tuned to give a personal touch.
SERVQUAL – GAP Model
There could be different levels of gaps that might bring down the PQoS. As SERVQUAL defines, the gaps can raise due to any of these factors
- Requirements gathering
- Service design
- Service implementation & delivery
- Marketing communications
To reduce the gaps, we need to understand the metrics that are of importance to us. Mean Time To Detect (MTTD) & Mean Time To Recover (MTTR) are critical metrics to be measured.
Given that there failures are bound to happen, we need to look at improving MTTD & MTTR.
Continuous QA improves the defect containment & early detection of errors. Processes like Test Driven Developments help to automate tests right from the design phase. Next problem to address would be to segregate the environments across dev, test & production. Below is a sample of how the environments can be segregated and different tests can be run at different environments in a Develop Test Accept Produce (DTAP) cycle.
It is ideal to have a staging environment where some of the scale & acceptance tests can be performed. However considering a load similar to Youtube with 100 million users viewing the videos every day or upload 65000 videos with an average of 10 MB would be a huge cost to replicate in the staging environment. This brings the need for testing in production (TiP).
In order to have a cleaner segregation of tests in production, and to manage easier release updates, a blue-green deployment model can be used where blue & green deployments just act as active-active production environment and if one goes down, the service requests can be redirected to the other.
A good example here would be how Facebook lets the developers to push the code to production and manage multiple production environments. They maintain multiple levels of production environments say stage 1..9, where the 1 could be an internal release environment, 2 could be for small releases and 9 could be more stable larger releases. Until a code change clears a stage and no bugs were found it is not pushed to the next stage.
Testing in Production (TiP)
TiP gives an outside in approach to the tests. It helps to test real user work flows and data in the production environment. Some of the TiP methodologies include :
Canary Deployment Testing :
In this model, updates are pushed to a subset of production environment that caters to a subset of users. If any issues are identified in the updates, the updates are rolled back. However, if the users find the updates satisfactory, then the release is pushed to a larger production environment. Google does nightly canary builds for their Chrome browser updates and it is targeted to developers and partners.
Nilio’s approach [now a part of CA technology] to Canary testing is as below [http://www.infoq.com/news/2013/03/canary-release-improve-quality]:
- Stage the application release, config scripts, tests etc
- Remove “Canary” servers from load balancing
- Upgrade “Canary” application.
- Run automated tests across the canary deployment
- Add back “Canary” servers to load balancing & test for sanity at production
- Upgrade the rest of the servers if the “Canary” testing [if either 4 or 5 fails, roll back canary deployment]
A/B Split Testing :
This is primarily to identify usability & navigation patterns of the users. Different releases with varied UX are delivered to different groups of users. Based on the navigation & usage pattern, the best UX release is decided and pushed to a broader group of users.
Recovery Testing :
This tests focuses on the auto recovery configurations to ensure that if a failure happens then the deployment automatically heals. Netflix’s chaos monkey is a good example of injecting failures in to the production environment to observe whether the deployments automatically scale or recover.
It is essential to look at the audit controls while we talk about DevOps. Any data that is subject to be used in PE for production should be SOX compliant.
COBIT A17.4 – Test Environment:
- TE represent PE
- TE is secure & should not interact with PE
- Test Data DB to be similar to production & should be complaint with regulatory needs
- Protect sensitive test data & results
- Retain or dispose sensitive data as per regulatory compliance
*TE – Test Environment, PE – Production Environment
An ideal approach to get the audit controls in sync with the TiP is to have the auditors involved right from the project design phase and agree upon metrics & controls right at the beginning.
Test data samples to be carefully picked so that it eliminates customer sensitive data. QA to work with the Ops team to get the sampling of data. TiP is ideal for smoke/functionality tests, but if a scale tests is performed in the production then appropriate measures need to be taken so the tests do not blow away the production database. Appropriate cleanup of the test data to maintain the sanity of the production database. Strict access controls need to be implemented so that the QA engineers get the right level of access. Data needs to be encrypted over transit. Traceability of the tests like logging the time of test, QA engineering etc would help to trace back why a test has been performed and when
- Defect containment, earlier the better
- More testing != quality testing, test in production, use user samples
- Micro services, roll back
- Incremental Updates
- Involve Ops in build & deploy automation
- QA visibility in to user analytics
- BDD/TDD requires a lot of collaboration as it is an iterative process. Tests are automated by the time the product is ready for release.
Two major questions still remain :
- How do we do a performance test using production data, do we take samples or do we replicate the entire data set ?
- How do we automate security tests ? A common approach to the first question would be to replicate the entire dataset from production. There are tools that can detect and remove sensitive data from the dataset. In case of automating the security tests, Twitter had an experience of compromising Obama’s account and then trying to implement an automated approach towards security tests. Since security tests require out of the box test scenarios every time, they cannot be automated. However I would be more keen to learn if there are alternate approaches.