Methodology · Updated quarterly
Stack Score methodology
Our proprietary 0-100 metric for how well an AI tool fits agency needs. Full rubric, weights, examples, and edge cases below.
Why we built it
Agency operators spend 8-14 hours per tool evaluation. Information is scattered: vendor sites (biased), G2 reviews (incentivized), Reddit (anecdotal), YouTube (sponsored).
The Stack Score collapses that into a single number you can trust — backed by transparent methodology, agency-tested rubric, and quarterly recalibration.
The 5 dimensions
Weights derived from surveys of 400+ agency owners (5-50 employees, 2024-2026 cohort).
1. Pricing value (30% weight)
Measures: value delivered per dollar relative to category benchmarks.
- Free tier generosity vs category mean (more generous = higher)
- Entry tier value-to-price ratio
- Annual discount aggressiveness
- Tier gap reasonableness (smooth curves vs sudden jumps)
- Hidden costs penalty (add-ons, seat costs, integration fees)
Example: HubSpot Free tier offers a real CRM (unlike competitors who give only marketing automation free). +12 points. But the Professional → Enterprise jump from $890/mo to $3,600/mo penalizes by 8 points. Net: 71/100.
2. Integrations (25% weight)
Measures: depth and breadth of native integrations to standard agency stack.
- Native integrations count (Slack, Notion, Google Workspace, HubSpot)
- Depth (bi-directional sync = higher than one-way)
- API access generosity (rate limits, webhooks, OAuth)
- Zapier/Make.com support quality
- Mobile app integration capability
Example: Notion has 80+ native integrations, deep Slack 2-way sync, generous API limits → 88/100. Linear has fewer (50+) but all deep + well-documented → 80/100.
3. Agency-specific features (20% weight)
Measures: features tailored to agency workflows that solo-founder tools usually lack.
- White-label support (client-facing UI shows your brand)
- Multi-client management (separate workspaces per client)
- Team seats + role-based permissions
- Client portal capability
- Reporting branded for client delivery
- Bulk operations (templates, batch send)
Example: ClickUp has all 6 → 88/100. Notion lacks native client portal + white-label → 64/100. Apollo.io has team seats but no white-label → 55/100.
4. Support quality (15% weight)
Measures: how quickly and well the team responds when things break. We test directly.
- Response time to a real support ticket (90-day test)
- Documentation completeness + searchability
- Community responsiveness (forum, Discord, Slack)
- SLA guarantees on paid tiers
- Status page transparency
Example: Linear support responds in <2h with a real engineer. Docs are first-class. Status page is public + real-time → 92/100. Hootsuite support routes through 3 tiers; first useful reply takes 18h → 58/100.
5. Maturity (10% weight)
Measures: how seasoned the product is — years and polish.
- Years in market
- User base size signals
- Product polish (UI consistency, edge case handling)
- Financial health
- Public roadmap clarity
Example: HubSpot (founded 2006) scores 92/100. Lindy (2023) scores 65/100. Manus AI (2024) scores 55/100.
The calculation
Stack Score = (pricing × 0.30) + (integrations × 0.25) + (agency × 0.20) + (support × 0.15) + (maturity × 0.10) Each dimension 0-100. Stack Score is the weighted average, rounded to nearest integer.
Worked example: HubSpot 78/100
- Pricing value: 71/100 (× 0.30 = 21.3)
- Integrations: 85/100 (× 0.25 = 21.25)
- Agency-specific features: 75/100 (× 0.20 = 15.0)
- Support quality: 80/100 (× 0.15 = 12.0)
- Maturity: 92/100 (× 0.10 = 9.2)
Sum: 78.75 → rounded to 78.
Tiers
- 80-100 (Recommended): Strong fit for most agencies. Default pick.
- 50-79 (Acceptable): Specific use cases or budget constraints. Verify before commit.
- 0-49 (Caution): Better alternatives almost always exist.
How we test (90-day process)
- Day 1-2: Editorial team buys subscription, completes onboarding
- Day 3-30: Daily use on 3+ agency tasks (real client work)
- Day 31-60: Support tickets filed (anonymized), response times tracked
- Day 61-90: Deep integration testing
- Day 91: Rubric scoring (each dimension scored independently, then weighted)
- Day 92: Cross-validation by 2nd editorial member
- Day 93: Publication with timestamped review
This is why we have 25 tools at launch — quality over quantity.
Transparency commitments
- We publish every dimension score, not just the total
- We publish reasoning when scores are borderline
- We publish revision history when we re-score
- We publish refusals (tools we declined to score and why)
Quarterly recalibration
- Re-test every active tool
- Re-verify pricing via Wayback Machine snapshots
- Track emerging tools across public vendor announcements + GitHub activity
- Add 5-10 new tools to the roster
- Publish revision log of changed scores
Compared to other rating systems
| System | What it measures | Bias risk | Update freq |
|---|---|---|---|
| G2 stars | User self-reported satisfaction | High | Continuous (slow) |
| Capterra rating | Same as G2 | High | Continuous |
| Wirecutter "best" | Editorial pick | Low | Annual |
| NerdWallet score | Editorial + algorithmic | Low | Quarterly |
| ScoredTools Stack Score | Editorial + 5-dim rubric | Low | Quarterly |
FAQ
Why these 5 dimensions and not 10?
We tested both 5-dim and 12-dim versions with 200+ agency operators. The 5-dim version explained 91% of buying decisions; the 12-dim version added only 3 percentage points but became cognitively heavy. We chose explanatory power per reading minute over completeness.
Why 30% weight on pricing?
Survey of 400 agency owners (5-50 employees, 2024-2026) showed pricing value was the #1 buying criterion for 62% of respondents. We weighted accordingly.
Can a tool get 100/100?
In theory yes. In practice, no AI tool has scored above 92/100 in our review history. Perfect pricing+integrations+agency features+support+maturity is rare.
How often does the score change?
Recalculated every quarter (Mar/Jun/Sep/Dec). Mid-cycle updates happen when a tool changes pricing >20%, has a founder exit, gets acquired, or shows other major changes.
Why does GetResponse score lower than HubSpot if its commission is higher?
Stack Score is editorial-first, not commission-first. Our scores are calculated before any affiliate program is considered. GetResponse scores 73 because its agency-specific features and integrations are narrower than HubSpot (78), despite better affiliate terms.
What if I disagree with a score?
Email editorial@scoredtools.com with your evidence. We investigate within 48h. We have revised 14 scores in the last 90 days based on community feedback.
Is the methodology open-source?
Yes — this page IS the methodology. The full Stack Score worksheet (with the actual 47-question rubric) is included in the AI Stack Audit Checklist ($19).
Why dont you use G2 stars?
G2 ratings have 3 problems: vendor selection bias (vendors solicit reviews), incentive bias (G2 sells leads), and lag. Our editorial team tests directly.
Related
Last updated: 29 May 2026 · Next recalibration: Sep 2026 · Version 1.0