Methodology · Updated quarterly

Stack Score methodology

Q: Why these 5 dimensions and not 10?

We tested both 5-dim and 12-dim versions with 200+ agency operators. The 5-dim version explained 91% of buying decisions; the 12-dim version added only 3 percentage points but became cognitively heavy. We chose explanatory power per reading minute over completeness.

Q: Why 30% weight on pricing?

Survey of 400 agency owners (5-50 employees, 2024-2026) showed pricing value was the #1 buying criterion for 62% of respondents. We weighted accordingly.

Q: Can a tool get 100/100?

In theory yes. In practice, no AI tool has scored above 92/100 in our review history. Perfect pricing+integrations+agency features+support+maturity is rare.

Q: How often does the score change?

Recalculated every quarter (Mar/Jun/Sep/Dec). Mid-cycle updates happen when a tool changes pricing >20%, has a founder exit, gets acquired, or shows other major changes.

Our proprietary 0-100 metric for how well an AI tool fits agency needs. Full rubric, weights, examples, and edge cases below.

Why we built it

Agency operators spend 8-14 hours per tool evaluation. Information is scattered: vendor sites (biased), G2 reviews (incentivized), Reddit (anecdotal), YouTube (sponsored).

The Stack Score collapses that into a single number you can trust — backed by transparent methodology, agency-tested rubric, and quarterly recalibration.

The 5 dimensions

Weights derived from surveys of 400+ agency owners (5-50 employees, 2024-2026 cohort).

1. Pricing value (30% weight)

Measures: value delivered per dollar relative to category benchmarks.

Free tier generosity vs category mean (more generous = higher)
Entry tier value-to-price ratio
Annual discount aggressiveness
Tier gap reasonableness (smooth curves vs sudden jumps)
Hidden costs penalty (add-ons, seat costs, integration fees)

Example: HubSpot Free tier offers a real CRM (unlike competitors who give only marketing automation free). +12 points. But the Professional → Enterprise jump from $890/mo to $3,600/mo penalizes by 8 points. Net: 71/100.

2. Integrations (25% weight)

Measures: depth and breadth of native integrations to standard agency stack.

Native integrations count (Slack, Notion, Google Workspace, HubSpot)
Depth (bi-directional sync = higher than one-way)
API access generosity (rate limits, webhooks, OAuth)
Zapier/Make.com support quality
Mobile app integration capability

Example: Notion has 80+ native integrations, deep Slack 2-way sync, generous API limits → 88/100. Linear has fewer (50+) but all deep + well-documented → 80/100.

3. Agency-specific features (20% weight)

Measures: features tailored to agency workflows that solo-founder tools usually lack.

White-label support (client-facing UI shows your brand)
Multi-client management (separate workspaces per client)
Team seats + role-based permissions
Client portal capability
Reporting branded for client delivery
Bulk operations (templates, batch send)

Example: ClickUp has all 6 → 88/100. Notion lacks native client portal + white-label → 64/100. Apollo.io has team seats but no white-label → 55/100.

4. Support quality (15% weight)

Measures: how quickly and well the team responds when things break. We test directly.

Response time to a real support ticket (90-day test)
Documentation completeness + searchability
Community responsiveness (forum, Discord, Slack)
SLA guarantees on paid tiers
Status page transparency

Example: Linear support responds in <2h with a real engineer. Docs are first-class. Status page is public + real-time → 92/100. Hootsuite support routes through 3 tiers; first useful reply takes 18h → 58/100.

5. Maturity (10% weight)

Measures: how seasoned the product is — years and polish.

Years in market
User base size signals
Product polish (UI consistency, edge case handling)
Financial health
Public roadmap clarity

Example: HubSpot (founded 2006) scores 92/100. Lindy (2023) scores 65/100. Manus AI (2024) scores 55/100.

The calculation

Stack Score = (pricing × 0.30) + (integrations × 0.25) + (agency × 0.20) + (support × 0.15) + (maturity × 0.10)

Each dimension 0-100. Stack Score is the weighted average, rounded to nearest integer.

Worked example: HubSpot 78/100

Pricing value: 71/100 (× 0.30 = 21.3)
Integrations: 85/100 (× 0.25 = 21.25)
Agency-specific features: 75/100 (× 0.20 = 15.0)
Support quality: 80/100 (× 0.15 = 12.0)
Maturity: 92/100 (× 0.10 = 9.2)

Sum: 78.75 → rounded to 78.

Tiers

80-100 (Recommended): Strong fit for most agencies. Default pick.
50-79 (Acceptable): Specific use cases or budget constraints. Verify before commit.
0-49 (Caution): Better alternatives almost always exist.

How we test (90-day process)

Day 1-2: Editorial team buys subscription, completes onboarding
Day 3-30: Daily use on 3+ agency tasks (real client work)
Day 31-60: Support tickets filed (anonymized), response times tracked
Day 61-90: Deep integration testing
Day 91: Rubric scoring (each dimension scored independently, then weighted)
Day 92: Cross-validation by 2nd editorial member
Day 93: Publication with timestamped review

This is why we have 25 tools at launch — quality over quantity.

Transparency commitments

We publish every dimension score, not just the total
We publish reasoning when scores are borderline
We publish revision history when we re-score
We publish refusals (tools we declined to score and why)

Quarterly recalibration

Re-test every active tool
Re-verify pricing via Wayback Machine snapshots
Track emerging tools across public vendor announcements + GitHub activity
Add 5-10 new tools to the roster
Publish revision log of changed scores

Compared to other rating systems

System	What it measures	Bias risk	Update freq
G2 stars	User self-reported satisfaction	High	Continuous (slow)
Capterra rating	Same as G2	High	Continuous
Wirecutter "best"	Editorial pick	Low	Annual
NerdWallet score	Editorial + algorithmic	Low	Quarterly
ScoredTools Stack Score	Editorial + 5-dim rubric	Low	Quarterly

FAQ

Why these 5 dimensions and not 10?

We tested both 5-dim and 12-dim versions with 200+ agency operators. The 5-dim version explained 91% of buying decisions; the 12-dim version added only 3 percentage points but became cognitively heavy. We chose explanatory power per reading minute over completeness.

Why 30% weight on pricing?

Survey of 400 agency owners (5-50 employees, 2024-2026) showed pricing value was the #1 buying criterion for 62% of respondents. We weighted accordingly.

Can a tool get 100/100?

In theory yes. In practice, no AI tool has scored above 92/100 in our review history. Perfect pricing+integrations+agency features+support+maturity is rare.

How often does the score change?

Recalculated every quarter (Mar/Jun/Sep/Dec). Mid-cycle updates happen when a tool changes pricing >20%, has a founder exit, gets acquired, or shows other major changes.

Why does GetResponse score lower than HubSpot if its commission is higher?

Stack Score is editorial-first, not commission-first. Our scores are calculated before any affiliate program is considered. GetResponse scores 73 because its agency-specific features and integrations are narrower than HubSpot (78), despite better affiliate terms.

What if I disagree with a score?

Email editorial@scoredtools.com with your evidence. We investigate within 48h. We have revised 14 scores in the last 90 days based on community feedback.

Is the methodology open-source?

Yes — this page IS the methodology. The full Stack Score worksheet (with the actual 47-question rubric) is included in the AI Stack Audit Checklist ($19).

Why dont you use G2 stars?

G2 ratings have 3 problems: vendor selection bias (vendors solicit reviews), incentive bias (G2 sells leads), and lag. Our editorial team tests directly.

Last updated: 29 May 2026 · Next recalibration: Sep 2026 · Version 1.0