If I am a data scientist working at Uber, to test a new chat product designed for an automated partial refund instead of calling to the support center, I would follow the following process:
1. Clarifying questions:
What is the objective of this new chat product?
What are the expected benefits of this new feature?
What is the current refund process, and how does the new chat product fit in it?
Who is the target audience for this chat product?
What are the potential risks of implementing this new feature?
2. Prerequisites:
Success metrics: The primary success metric would be the number of successful automated partial refunds.
Counter metrics: The counter metric would be the number of customers who escalate their refund requests to the support center.
Ecosystem metrics: We need to monitor other metrics such as user satisfaction, retention, and engagement.
Control and treatment variants: The control group would consist of customers who use the existing refund process, while the treatment group would use the new chat product.
Randomization units: We would randomly assign customers to either the control or treatment group.
Null hypothesis: The null hypothesis would be that there is no significant difference in the number of successful refunds between the control and treatment groups.
Alternate hypothesis: The alternative hypothesis would be that the new chat product leads to a significant increase in the number of successful refunds compared to the existing refund process.
3. Experiment Design:
Significance level: 0.05
Practical significance level: 5% increase in the number of successful refunds
Power: 0.8
Sample size: We would calculate the sample size required using a power analysis. Assuming a baseline conversion rate of 20%, we would need a sample size of 30,000 customers per group to detect a 5% increase in the conversion rate with 80% power and a significance level of 0.05.
Duration: We would run the experiment for four weeks.
Effect size: The effect size would be the difference in the conversion rate between the control and treatment groups.
4. Running the experiment:
Ramp up plan: We would gradually roll out the new chat product to the treatment group over the first week to ensure that we do not overload the system.
Bonferroni correction: We would use a Bonferroni correction to adjust for multiple comparisons if we need to analyze multiple metrics.
5. Result to decision:
Basic sanity checks: We would perform basic sanity checks to ensure that there are no technical issues or bugs that affect the results.
Statistical test: We would use a two-sample t-test to compare the conversion rates between the control and treatment groups.
Recommendation: If the new chat product leads to a significant increase in the number of successful refunds, we would recommend rolling it out to all customers. If not, we would recommend further iterations or abandoning the project.
6. Post-launch monitoring:
Novelty/primacy effect: We would monitor the conversion rates over time to ensure that any initial spikes are not due to novelty or primacy effects.