How to test whether version 1 or 2 of surge pricing algorithm is working better?

A person waiting for the UBER taxi cab with message 1 minute away on phone screen.

Clarifying Questions:

Before proceeding with the A/B test, I would need to clarify a few things:

What specific aspects of the surge pricing algorithm are being tested? Is it the pricing strategy itself, the algorithm's response time, or another parameter?
What are the key performance indicators that indicate the success of the surge pricing algorithm? Are we focusing on user conversion, revenue, or another metric?
What is the user base distribution between version 1 and version 2? Are there any specific user segments we're interested in?
Are there any external factors (seasonality, events, holidays, etc.) that might influence the results of the test?

Prerequisites:

Success Metrics: Metrics that directly indicate the effectiveness of the surge pricing algorithm, such as revenue per ride or user satisfaction score
Counter Metrics: Metrics that need to be monitored to ensure that improvements in the success metrics are not causing negative impacts elsewhere, like wait time or number of ride cancellations.
Ecosystem Metrics: Broader business metrics that might be indirectly affected by the algorithm changes, like impact on driver earnings or customer complaints.
Control and Treatment Variants: Control group (version 1) and treatment group (version 2) representing the two variants of the surge pricing algorithm.
Randomization Units: Individual users who will be randomly assigned to either the control or treatment group.

Experiment Design:

Significance Level: Typically set at 0.05, indicating the probability of a Type I error.
Practical Significance Level: A threshold indicating the smallest effect size that is practically meaningful. For example, a 5% increase in revenue might be considered practically significant.
Power: The probability of correctly rejecting a false null hypothesis, often set at 0.8.
Sample Size: Calculated based on desired significance level, power, and expected effect size. Let's assume 50,000 users per group.
Duration: The duration of the experiment, for example, two weeks.
Effect Size: The expected difference in the success metric between the two versions, for example, a 10% increase in revenue.

Running the Experiment:

Ramp-up Plan: Gradually roll out the new algorithm to a small percentage of users and monitor its performance before full-scale implementation.

Result to Decision:

Sanity Checks: Ensure that basic metrics like user counts, conversion rates, and other key indicators are consistent across both groups before proceeding with statistical tests.
Statistical Test: Perform a hypothesis test (e.g., t-test or chi-squared test) to determine if the differences in success metrics between the two groups are statistically significant.
Recommendation: If the new algorithm's performance is statistically and practically significant, recommend its full implementation. If not, consider further analysis or adjustments before making a decision.

Post Launch Monitoring:

Novelty/Primacy Effect: Be aware of potential temporary impacts due to user excitement or skepticism about the new algorithm, which could fade over time.
Network Effect: Consider how the algorithm's performance might be influenced by the interactions between users and drivers within the platform.

Remember, these are just guidelines, and the actual design may vary based on the specifics of the surge pricing algorithm and the business context at Uber.

#Uber

Join the Elite: Get Hired in Product Data Science with Our Interview Prep Community

How to test whether version 1 or 2 of surge pricing algorithm is working better?

Subscribe Here for News & Updates