The complete guide to running tests that actually tell you something—even on a shoestring budget
The $2.4 Million Button Color
Let me tell you about the most expensive button color decision I've ever witnessed.
A SaaS company spent six months debating whether their main CTA should be blue or green. Six months. They had meetings about meetings about this button. They created mood boards. They consulted color psychology experts. They surveyed customers about their color preferences.
Finally, someone suggested they just test it. The test took two weeks to set up and run. The result? No statistically significant difference. None. Six months of salary, consultant fees, and opportunity cost—roughly $2.4 million in lost time and resources—to discover the color didn't matter.
But here's the twist: When they tested the button TEXT instead (changing "Get Started" to "Start Free Trial"), conversions jumped 34%.
This is why you need A/B testing. Not to test forty shades of blue, but to test what actually matters.
The Testing Reality Check
Before we dive in, let's address the elephant in the room: Most of what you've heard about A/B testing is enterprise-level fantasy that doesn't apply to your reality.
Microsoft can test 40 variations of a headline simultaneously because they have millions of daily visitors. You don't. Amazon can detect a 0.1% conversion improvement because that represents millions in revenue. For you, that's statistical noise.
Here's what the research actually says about testing for normal businesses:
Nielsen Norman Group's research reveals that you need approximately 5,000 visitors PER VARIATION to detect a 15% improvement with statistical confidence. If your site gets 10,000 visitors a month, a simple two-version test needs a full month to run. That's for detecting relatively large improvements.
Baymard Institute found that 71% of e-commerce sites run tests incorrectly—calling winners too early, testing too many variations, or testing meaningless changes. The average "winning" test that's called too early actually has a 26% chance of being a false positive.
But here's the good news: ConversionXL's analysis of 28,000 tests found that big, bold changes are 8x more likely to produce significant results than small tweaks. This means you can get meaningful results even with smaller traffic if you test the right things.
The Only Testing Framework You Need
Forget complex statistical models. Here's the framework that actually works for businesses doing less than $10M annually:
The ICE Score System
Before running any test, score it on three factors:
Impact (1-10): If this test wins, how much will it move the needle?
Confidence (1-10): How sure are you that this will make a difference?
Ease (1-10): How easy is it to implement this test?
Multiply these together. Anything scoring below 125 isn't worth testing yet. You don't have unlimited traffic, so you need to be selective.
The Hypothesis Framework
Every test needs a hypothesis. Not "I wonder if..." but a proper hypothesis:
Because we saw [data/observation]
We believe that [change]
For [audience]
Will cause [impact]
We'll know this when we see [metric]
Example: "Because we saw 67% of users abandon at the shipping cost reveal (via Hotjar recordings), we believe that showing shipping costs on product pages for all visitors will cause fewer cart abandonments. We'll know this when we see cart abandonment rate decrease by at least 10%."
If you can't fill out this framework, you're not ready to test.
What to Test (Based on Actual Data)
The High-Impact Hit List
Based on ConversionXL's meta-analysis of winning tests, here's what actually moves the needle:
Value Proposition Tests (Average lift: 12.3%)
Your headline and subheadline on your homepage. This is the single highest-impact test for most businesses. Don't test minor word changes—test completely different value propositions. Instead of "Fast, Reliable Service" vs. "Quick, Dependable Service" (worthless), test "Get Your Project Done in 48 Hours" vs. "Save 50% on Development Costs" (meaningful).
Pricing Structure Tests (Average lift: 17.8%)
Not just the price itself, but how you present it. Annual vs. monthly. Three tiers vs. four. What's included in each tier. The names of your tiers. Baymard found that unclear pricing is cited by 28% of cart abandoners as their reason for leaving.
Form Reduction Tests (Average lift: 26.9%)
Every field you remove increases completion rates. Test your longest form cut in half. Not one field at a time—that's enterprise thinking. Cut it in HALF and see what happens.
Trust Element Placement (Average lift: 8.7%)
Not whether to have testimonials, but WHERE they go. Test testimonials near the price vs. near the CTA vs. on a separate page. Test security badges at checkout vs. throughout the site. Test guarantees above vs. below the fold.
Mobile-Specific Experiences (Average lift: 23.4%)
Not just responsive design, but completely different experiences for mobile. Shorter forms. Different navigation. Simplified checkout. Click-to-call buttons. Mobile users aren't desktop users on smaller screens—they're different humans in different contexts.
What NOT to Test (Waste of Time)
Button colors (unless they're invisible against your background)
Minor copy tweaks ("Get" vs. "Receive")
Font changes (unless completely unreadable)
Image variations (unless testing presence vs. absence)
Micro-animations (nobody cares)
Footer content (2% of visitors see it)
These might matter for Amazon. They don't matter for you.
The Practical Testing Process
Step 1: The Pre-Test Checklist
□ Calculate your sample size
Use this calculator: https://www.optimizely.com/sample-size-calculator/
Typical settings for small businesses:
- Baseline conversion rate: Your current rate
- Minimum detectable effect: 20% (relative)
- Statistical significance: 95%
- Statistical power: 80%
If the calculator says you need 50,000 visitors per variation and you get 5,000 monthly visitors, that test will take 10 months. Find a different test.
□ Define success metrics
Primary metric: The ONE thing that determines success (usually conversion rate or revenue per visitor)
Secondary metrics: Things you're monitoring to ensure you're not breaking something else
Guardrail metrics: Things that can't get worse (like customer service tickets)
□ Document your hypothesis
Use the framework above. Write it down. Share it with your team. This prevents "I knew it would win" hindsight bias.
□ QA your variations
Test on multiple devices. Test on multiple browsers. Test the entire funnel, not just the changed element. Have someone else test it. The number of tests that fail due to technical issues is embarrassing.
Step 2: Running Your Test
The Testing Tools Hierarchy
Forget Google Optimize—it's shutting down September 2023. Here's what actually works:
For Shopify: Native A/B testing in themes, or apps like Neat A/B Testing (free up to 5,000 sessions)
For WordPress: Nelio A/B Testing ($29/month) or Split Hero ($27/month)
For Everything Else: VWO ($199/month minimum) or Optimizely (enterprise pricing)
The Cheap Option: Run version A for two weeks, then version B for two weeks, compare periods
The Four Testing Commandments
- Don't peek. Checking your test daily leads to false positives. Set a calendar reminder for when the test ends and ignore it until then.
- Test through full weeks. Monday behavior differs from Friday behavior. Start tests on Tuesday morning, end them on Tuesday morning at least two weeks later.
- Account for seasonality. Don't test during Black Friday unless you only care about Black Friday. Don't test during slow seasons unless that's your normal.
- Document everything. Screenshot your variations. Save your hypothesis. Record start and end dates. You'll thank yourself later.
Step 3: Reading Your Results
Statistical Significance Is Not Enough
Your testing tool says "95% significance! Winner!" Not so fast.
You also need:
- Practical significance: Is the improvement worth the effort to implement?
- Sample size completion: Did you reach your calculated sample size?
- Consistent performance: Did it win every day, or just spike once?
- Segment consistency: Did it win for all traffic sources or just one?
The Segment Deep Dive
A test that loses overall might win big for a specific segment:
- Mobile vs. Desktop
- New vs. Returning visitors
- Traffic source (organic vs. paid vs. direct)
- Geographic location
- Device type
- Browser
I've seen tests that lost 5% overall but won 40% on mobile. Always segment your results.
When to Call It
If you've reached your sample size and have significance, call it.
If you've run for 4 weeks without significance, call it a draw.
If you're seeing wild daily swings after 2 weeks, something's broken.
Step 4: Post-Test Actions
If You Won:
- Implement the winner immediately
- Document why it won (your best guess)
- Look for ways to apply the learning elsewhere
- Plan a follow-up test to improve further
- Monitor for 2 weeks to ensure the win persists
If You Lost:
- Revert to the original immediately
- Document why you think it lost
- Consider testing the opposite approach
- Check segments for hidden wins
- Move on—losses teach as much as wins
If It's a Draw:
- Keep the simpler version
- Test something bigger next time
- Consider combining elements from both
- Check if you're testing the right metric
Platform-Specific Testing Strategies
Shopify
Shopify's built-in testing is limited but functional:
What you can test easily:
- Product descriptions (duplicate products, compare)
- Collection pages (create alternates)
- Checkout customization (Plus only)
- Email templates (Shopify Email)
Recommended approach:
- Use Neat A/B Testing for homepage tests
- Duplicate products for description tests
- Use Dynamic Checkout buttons vs. standard
- Test free shipping thresholds via discounts
WordPress/WooCommerce
Most flexible platform for testing:
Best plugins:
- Nelio A/B Testing (full-featured)
- Split Hero (simple but effective)
- Thrive Optimize (if using Thrive themes)
What to test first:
- Homepage hero with Gutenberg blocks
- Product page layouts
- Checkout field reduction
- Category page layouts
Squarespace
Most limited testing options:
Workarounds:
- Use URL redirects for page tests
- Duplicate pages with different URLs
- Use announcement bars for offer tests
- Time-based testing (week 1 vs. week 2)
Focus on:
- Homepage headline via duplicates
- Contact form fields
- Product quick views
- Mobile-specific CSS
Wix
Basic but improving:
Native testing:
- Wix's Ascend includes basic A/B testing
- Limited to certain elements
- Traffic allocation is automatic
Best practices:
- Test lightboxes vs. inline forms
- Test single vs. multi-page forms
- Test mobile menu styles
- Test product gallery layouts
The Small Business Testing Calendar
Month 1: Foundation
Week 1-2: Install analytics, set baselines
Week 3-4: Run homepage headline test
Month 2: Trust & Value
Week 1-2: Test value proposition
Week 3-4: Test trust signal placement
Month 3: Conversion Path
Week 1-2: Test form reduction
Week 3-4: Test checkout flow
Month 4: Mobile Optimization
Week 1-2: Test mobile-specific experience
Week 3-4: Test mobile CTAs
Month 5: Pricing & Offers
Week 1-2: Test pricing presentation
Week 3-4: Test urgency/scarcity
Month 6: Review & Plan
Week 1-2: Analyze all test results
Week 3-4: Plan next quarter based on learnings
Rinse and repeat, building on what you learn.
The Testing Mindset
Here's what separates businesses that succeed with testing from those that waste time:
Test to learn, not to win. Every test teaches you something about your customers, even losses. Especially losses. Document everything.
Test behaviors, not preferences. Customers say they want comprehensive information, then bounce when you provide it. Watch what they do, not what they say.
Test big to win big. You don't have traffic for tiny tests. Go bold or go home.
Test consistently. One test per month beats ten tests once a year. Momentum matters.
Test with humility. Your opinion doesn't matter. Your customer's behavior does.
Common Testing Mistakes (And How to Avoid Them)
Mistake 1: Testing Without Enough Traffic
The problem: Running tests that need 50,000 visitors when you get 5,000 monthly.
The fix: Use the sample size calculator BEFORE planning tests. If you need more than 2 months of traffic, find a different test or test more dramatic changes.
Mistake 2: Testing Too Many Things at Once
The problem: Testing headline, button color, images, and layout simultaneously.
The fix: One test at a time for sites under 100k monthly visitors. You can't segment what worked if everything changes.
Mistake 3: Calling Tests Too Early
The problem: Seeing 95% significance on day 3 and declaring victory.
The fix: Wait for your full sample size, minimum 14 days, and include at least one full business cycle.
Mistake 4: Ignoring Segments
The problem: Looking only at aggregate results.
The fix: Always check mobile vs. desktop, new vs. returning, and traffic source segments. Hidden gold lives here.
Mistake 5: Not Testing the Whole Funnel
The problem: Improving homepage conversions while breaking checkout.
The fix: Monitor the entire funnel. Set up guardrail metrics. A win that breaks something else isn't a win.
The Bottom Line
A/B testing isn't about finding the perfect website. It's about continuous improvement based on actual user behavior instead of opinions and hunches.
You don't need millions of visitors or enterprise tools. You need clear hypotheses, meaningful changes, and the patience to let tests run to completion.
Start with your homepage headline. It's the highest-impact, easiest-to-test element for most businesses. Write five completely different versions. Test the best two. Learn something about your customers. Apply that learning everywhere.
Then test something else. And something else. Every month, you get a little better. Every test, you understand your customers more.
The businesses that win aren't the ones that guess right. They're the ones that test consistently, learn continuously, and compound those learnings into competitive advantage.
Your turn. What's your first hypothesis?