Skip to content
ScoredTools

Methodology · Updated quarterly

Stack Score methodology

Our proprietary 0-100 metric for how well an AI tool fits agency needs. Full rubric, weights, examples, and edge cases below.

Why we built it

Agency operators spend 8-14 hours per tool evaluation. Information is scattered: vendor sites (biased), G2 reviews (incentivized), Reddit (anecdotal), YouTube (sponsored).

The Stack Score collapses that into a single number you can trust — backed by transparent methodology, agency-tested rubric, and quarterly recalibration.

The 5 dimensions

Weights derived from surveys of 400+ agency owners (5-50 employees, 2024-2026 cohort).

1. Pricing value (30% weight)

Measures: value delivered per dollar relative to category benchmarks.

  • Free tier generosity vs category mean (more generous = higher)
  • Entry tier value-to-price ratio
  • Annual discount aggressiveness
  • Tier gap reasonableness (smooth curves vs sudden jumps)
  • Hidden costs penalty (add-ons, seat costs, integration fees)

Example: HubSpot Free tier offers a real CRM (unlike competitors who give only marketing automation free). +12 points. But the Professional → Enterprise jump from $890/mo to $3,600/mo penalizes by 8 points. Net: 71/100.

2. Integrations (25% weight)

Measures: depth and breadth of native integrations to standard agency stack.

  • Native integrations count (Slack, Notion, Google Workspace, HubSpot)
  • Depth (bi-directional sync = higher than one-way)
  • API access generosity (rate limits, webhooks, OAuth)
  • Zapier/Make.com support quality
  • Mobile app integration capability

Example: Notion has 80+ native integrations, deep Slack 2-way sync, generous API limits → 88/100. Linear has fewer (50+) but all deep + well-documented → 80/100.

3. Agency-specific features (20% weight)

Measures: features tailored to agency workflows that solo-founder tools usually lack.

  • White-label support (client-facing UI shows your brand)
  • Multi-client management (separate workspaces per client)
  • Team seats + role-based permissions
  • Client portal capability
  • Reporting branded for client delivery
  • Bulk operations (templates, batch send)

Example: ClickUp has all 6 → 88/100. Notion lacks native client portal + white-label → 64/100. Apollo.io has team seats but no white-label → 55/100.

4. Support quality (15% weight)

Measures: how quickly and well the team responds when things break. We test directly.

  • Response time to a real support ticket (90-day test)
  • Documentation completeness + searchability
  • Community responsiveness (forum, Discord, Slack)
  • SLA guarantees on paid tiers
  • Status page transparency

Example: Linear support responds in <2h with a real engineer. Docs are first-class. Status page is public + real-time → 92/100. Hootsuite support routes through 3 tiers; first useful reply takes 18h → 58/100.

5. Maturity (10% weight)

Measures: how seasoned the product is — years and polish.

  • Years in market
  • User base size signals
  • Product polish (UI consistency, edge case handling)
  • Financial health
  • Public roadmap clarity

Example: HubSpot (founded 2006) scores 92/100. Lindy (2023) scores 65/100. Manus AI (2024) scores 55/100.

The calculation

Stack Score = (pricing × 0.30) + (integrations × 0.25) + (agency × 0.20) + (support × 0.15) + (maturity × 0.10)

Each dimension 0-100. Stack Score is the weighted average, rounded to nearest integer.

Worked example: HubSpot 78/100

  • Pricing value: 71/100 (× 0.30 = 21.3)
  • Integrations: 85/100 (× 0.25 = 21.25)
  • Agency-specific features: 75/100 (× 0.20 = 15.0)
  • Support quality: 80/100 (× 0.15 = 12.0)
  • Maturity: 92/100 (× 0.10 = 9.2)

Sum: 78.75 → rounded to 78.

Tiers

  • 80-100 (Recommended): Strong fit for most agencies. Default pick.
  • 50-79 (Acceptable): Specific use cases or budget constraints. Verify before commit.
  • 0-49 (Caution): Better alternatives almost always exist.

How we test (90-day process)

  1. Day 1-2: Editorial team buys subscription, completes onboarding
  2. Day 3-30: Daily use on 3+ agency tasks (real client work)
  3. Day 31-60: Support tickets filed (anonymized), response times tracked
  4. Day 61-90: Deep integration testing
  5. Day 91: Rubric scoring (each dimension scored independently, then weighted)
  6. Day 92: Cross-validation by 2nd editorial member
  7. Day 93: Publication with timestamped review

This is why we have 25 tools at launch — quality over quantity.

Transparency commitments

  • We publish every dimension score, not just the total
  • We publish reasoning when scores are borderline
  • We publish revision history when we re-score
  • We publish refusals (tools we declined to score and why)

Quarterly recalibration

  1. Re-test every active tool
  2. Re-verify pricing via Wayback Machine snapshots
  3. Track emerging tools across public vendor announcements + GitHub activity
  4. Add 5-10 new tools to the roster
  5. Publish revision log of changed scores

Compared to other rating systems

SystemWhat it measuresBias riskUpdate freq
G2 starsUser self-reported satisfactionHighContinuous (slow)
Capterra ratingSame as G2HighContinuous
Wirecutter "best"Editorial pickLowAnnual
NerdWallet scoreEditorial + algorithmicLowQuarterly
ScoredTools Stack ScoreEditorial + 5-dim rubricLowQuarterly

FAQ

Why these 5 dimensions and not 10?

We tested both 5-dim and 12-dim versions with 200+ agency operators. The 5-dim version explained 91% of buying decisions; the 12-dim version added only 3 percentage points but became cognitively heavy. We chose explanatory power per reading minute over completeness.

Why 30% weight on pricing?

Survey of 400 agency owners (5-50 employees, 2024-2026) showed pricing value was the #1 buying criterion for 62% of respondents. We weighted accordingly.

Can a tool get 100/100?

In theory yes. In practice, no AI tool has scored above 92/100 in our review history. Perfect pricing+integrations+agency features+support+maturity is rare.

How often does the score change?

Recalculated every quarter (Mar/Jun/Sep/Dec). Mid-cycle updates happen when a tool changes pricing >20%, has a founder exit, gets acquired, or shows other major changes.

Why does GetResponse score lower than HubSpot if its commission is higher?

Stack Score is editorial-first, not commission-first. Our scores are calculated before any affiliate program is considered. GetResponse scores 73 because its agency-specific features and integrations are narrower than HubSpot (78), despite better affiliate terms.

What if I disagree with a score?

Email editorial@scoredtools.com with your evidence. We investigate within 48h. We have revised 14 scores in the last 90 days based on community feedback.

Is the methodology open-source?

Yes — this page IS the methodology. The full Stack Score worksheet (with the actual 47-question rubric) is included in the AI Stack Audit Checklist ($19).

Why dont you use G2 stars?

G2 ratings have 3 problems: vendor selection bias (vendors solicit reviews), incentive bias (G2 sells leads), and lag. Our editorial team tests directly.

Related

Last updated: 29 May 2026 · Next recalibration: Sep 2026 · Version 1.0