Scoring Methodology — How Goler Scores Work

This document explains how Goler scores work, what they mean, and what they do not. It is intended for users, venue operators, and anyone who wants to understand the basis of a Goler verdict before relying on it.

The exact scoring formulas, weights, thresholds, model prompts, and abuse-prevention logic are not published in this document. We publish principles, evidence standards, signal categories, limitations, and the dispute process. Specifics are treated as proprietary to prevent manipulation of the system.

What Goler Measures and Does Not Measure

Goler analyzes reviews from several sources to produce two separate per-venue assessments: a Quality Score and a Risk Score. The two are computed independently and shown to the user side by side.

Review Sources

Public reviews — written on third-party platforms (for example, search engines and map services) and aggregated by Goler.
Internal Goler reviews — submitted directly through Goler by visitors. These reviews are not published anywhere publicly. They are visible only to the venue (through the venue dashboard) and to the reviewer who wrote them. Internal reviews contribute to scoring only when they pass the same integrity, trust, and audit requirements as public reviews. Venue operators can request review of the evidence basis for any internal review affecting their score through the dispute process, subject to user privacy protections.
Structured user contributions — explicit signals such as dish likes, flags, and other product-level interactions that produce structured data without free-form text.

Goler Measures

The aggregate sentiment of visitors who left ratings or written reviews about a venue.
The presence and concentration of explicit, content-level claims about safety, hygiene, service, billing, and other operationally observable issues.
The recurrence of such claims across multiple independent reviewers.
The confidence we have in our own conclusion, given the size and consistency of available evidence.

Goler Does Not Measure

The intentions, ethics, or character of the venue's owners or staff.
Compliance with any specific regulatory framework (FDA, EU food law, local health codes). Our scores are not regulatory findings and do not certify or de-certify a venue under any legal regime.
Medical diagnoses. A suspected_food_poisoning_report signal reflects what reviewers wrote, not a clinical diagnosis.
Legal liability of any party.
The value of food, service, or experience as judged by professional critics.

A Goler score is an indicative consumer signal, not a regulatory, medical, or legal finding. It is meant to inform an individual decision before visiting a venue. It is not designed to be used as primary evidence in legal, regulatory, or medical contexts.

Quality vs Risk: Two Separate Questions

Most review platforms reduce a venue to a single number — a star average. This conflates two questions that have different shapes.

Quality answers “Did most visitors enjoy this place?” — answered through aggregate sentiment, ratings, and text.
Risk answers “Are there documented incidents that should change my decision?” — answered only through specific claims with content, expressed in review text.

Quality is a question about the population of visitors. It can be answered with any expression of sentiment, including a star rating without text. The strength and direction of feeling is what matters; the specific reasons are not required.

Risk is a question about specific events. It can only be answered by reviewers who explicitly described what happened. A 1-star rating without text tells us a visitor was unhappy but not what they were unhappy about. We cannot attribute that rating to food poisoning, harassment, slow service, or anything else, because the visitor did not say.

This distinction is intentional and is the central methodological choice in Goler. It means the two scores are computed from different evidence bases, and a venue can legitimately have a high Quality score and a high Risk score at the same time. They are not opposing measures of the same thing.

Why Risk Is Computed Only From Textual Claims

A Risk score is a statement about specific incidents. To make such a statement, we require evidence that names the incident type. A reviewer who wrote “I got food poisoning here” provides that evidence. A reviewer who left only a 1-star rating does not — they expressed unhappiness without specifying what kind.

If we counted every low rating as evidence of food poisoning, harassment, or any other risk category, we would be fabricating attributions the reviewer did not make. The reviewer's silence on the specifics is not consent for us to fill those specifics in.

The symmetric position holds for high ratings. A satisfied reviewer who did not write text did not testify “no safety incidents occurred during my visit.” They testified “I was satisfied,” which is a different statement.

Therefore Risk uses only reviews that produced extractable risk-relevant claims. Reviews without text are reflected in the Quality score, where their expression of sentiment is a sufficient and appropriate signal. We treat the two evidence bases as serving two different questions, and we do not mix them.

A consequence of this rule is that a popular venue with thousands of silent positive reviews and a small number of severe incident reports will not see those incident reports diluted by volume. We consider this consequence correct: documented incidents are facts about a venue's history that volume of unrelated positive sentiment does not erase.

Risk Is Not an Incident-Rate Estimate

Risk is not an estimate of the probability that a visitor will experience harm. Goler does not have visit-level exposure data, medical verification, or follow-up data from reviewers who left no text.

Risk measures the severity, recurrence, and concentration of documented risk-related claims found in text-evaluable review evidence. Silent ratings are therefore not used as negative observations in Risk. Treating silence as evidence of no incident would require an assumption Goler cannot verify.

Goler does not infer safety from silence, and does not infer danger from silence.

Signals

A signal is a structured label assigned to a portion of a review by our engine. Signals are the unit of evidence in Goler's scoring.

A review goes through a pipeline that:

Extracts factual claims from the text.
Classifies each claim against a taxonomy of recognized signal types.
Records the assignment, the source claim, the source review, the model used, and the confidence of the assignment.

Each signal carries:

A signal key identifying what the claim is about (for example, dirty_restroom, friendly_staff, suspected_food_poisoning_report).
A sentiment: positive, negative, or neutral.
A severity between 0 and 5, used only for negative signals to indicate how serious the underlying issue is.
A confidence between 0 and 1, indicating how certain the engine is about the assignment.
A reference back to the source review and the specific claim within it, retained for audit.

Signals are the atomic level at which Goler reasons. Score formulas operate over signals, not over raw review text.

The current taxonomy contains approximately 90 signal types, covering food and drink, service and staff, cleanliness and safety, value and billing, and overall experience. The taxonomy is grown from observed patterns in real review data rather than authored top-down, and it evolves over time as new patterns emerge.

Severity Tiers

Negative signals are grouped into four severity tiers. Tiers determine how strongly a signal contributes to Risk and to category breakdowns.

CRITICAL — Direct exposure to acute harm: ingestion, physical safety. Examples: suspected_food_poisoning_report, suspected_spoiled_food, undercooked_high_risk_food, harassment_or_aggressive_behaviour, foreign_object_in_food.
HIGH — Significant risk indicator with strong correlation to harm. Examples: cross_contamination_suspected, expired_or_unsafe_ingredient, physical_intimidation.
MEDIUM — Meaningful concern but indirect: facility hygiene, information failures. Examples: dirty_restroom, allergen_info_missing_or_unclear, unpleasant_odour_unsanitary.
LOW — Mild concern that contributes to overall picture but does not on its own indicate hazard. Examples: dirty_dining_area, loud_environment_uncomfortable.

Severity assignment is a methodological judgment, not an objective measurement. It reflects how seriously we believe a given signal type should weigh on a consumer decision. Severity assignments reflect Goler's current methodology and may be revised as the methodology evolves. We acknowledge other reasonable methodologies could assign different tiers. We intend to publish additional per-signal rationale over time.

If you believe a specific severity assignment is incorrect, see the Dispute Policy.

How Scores Are Computed (Conceptually)

The exact formulas, weights, and thresholds are not published. The high-level behavior of each score is described below.

Quality Score

Quality combines:

The aggregate distribution of star ratings across all reviews, including reviews without text.
The balance of positive and negative quality signals extracted from review text, weighted by confidence and reviewer trust.
A sample-size correction that reduces score volatility for venues with very few reviews.
A floor mechanism ensuring that when a large fraction of reviewers express dissatisfaction, the score reflects that even when individual complaints are mild.

The score is normalized to a 0–100 scale, where higher is better.

Risk Score

Risk combines:

The number and severity of risk-relevant signals extracted from review text.
A confidence weight per signal, from the classifier.
A density measure: how concentrated risk signals are within the population of reviews that produced any extractable claim.
A confirmation multiplier that gives weight to recurrence — a single report carries less weight than several independent reports of the same kind.
A reduction when the venue's owner publicly responded to the original review.

The score is normalized to a 0–100 scale, where higher means more risk indicators present in the reviewed evidence.

Category Breakdown

The Quality and Risk scores answer high-level questions about a venue overall. They do not, on their own, tell a user where in the visitor experience the venue is succeeding or struggling. The category breakdown exists to answer that.

Each venue is broken down into five product-meaningful categories: food and drink, service and staff, cleanliness and safety, value and billing, and experience and operations.

Categories are not a parallel scoring system. They are a deeper view of the same evidence, routed by topic. Every signal that contributes to Quality or Risk also contributes to exactly one category, based on what the underlying claim is about.

Each category receives its own score, a confidence indicator (reflecting how much category-relevant evidence was available), the top positive and negative signals contributing to it, and a risk flag when category-relevant risk signals are present.

The purpose of this breakdown is consumer-actionable insight. A user looking at a venue with Quality 70 and Risk 40 still does not know whether the issues are in food, in service, or in cleanliness. The category breakdown answers that. For venue operators, the breakdown serves the same purpose in reverse: it identifies which area to address first.

Confidence and Traceability

Every score includes a confidence indicator reflecting sample size and signal consistency. A score derived from 5 reviews is not presented as having the same authority as a score derived from 5,000.

Every score is traceable: the engine retains the chain from final score back to contributing signals back to source reviews.

Integrity Check

Alongside the scoring engines, Goler runs an Integrity Check over the same review data. Its purpose is to surface possible problems in the evidence itself — patterns that look atypical, coordinated, solicited, automated, or otherwise less representative of normal visitor behavior — so the user can read the score with appropriate context.

The Integrity Check does not participate in scoring. It does not raise or lower Quality, Risk, or category scores. It produces a separate advisory shown to the user alongside the scores. This separation is intentional: a score is our reading of the evidence; an integrity flag is a comment on the evidence itself. We keep them in different layers so that a flagged advisory does not silently distort a venue's score, and so that operators can challenge integrity flags independently of scoring decisions.

Integrity flags are not findings of fraud, misconduct, or bad faith. They mark patterns that warrant additional context for the reader; they do not assert any party's intent.

What It Measures

Three detectors run independently and feed into a combined advisory:

Temporal anomalies (review bursts). Concentrations of reviews in unusually short time windows compared to the venue's baseline review rate. A spike of 23 reviews in 8 days at a venue that normally receives one review every two days is flagged. The detector reports the date range, review count, baseline rate, burst multiplier, and the sentiment skew within the burst.
Reviewer profile anomalies. Patterns in reviewer behavior that deviate from typical engagement — for example, contributors who only ever post 5-star ratings, contributors whose review activity diverges from rating activity, or unusually concentrated authorship within a small window.
Signal contradictions. Internal inconsistencies within a single review — for example, a review that simultaneously claims a place is unsafe and is the safest place the reviewer has ever visited. Such reviews are flagged because their evidentiary weight is ambiguous.

How the Advisory Is Built

Each detector produces zero or more concerns, each with a severity (typically clean, suspicious, or high) and a short structured description.

The concerns are aggregated into a single integrity severity for the venue. The advisory output also includes:

A list of structured concerns with evidence (date ranges, counts, ratios).
A balanced list of possible explanations — both benign (a venue reopened, a viral social media moment, a seasonal tourist surge) and adversarial (coordinated campaign, solicited reviews). We do not assert which explanation is correct; we present the full plausible space and let the user decide.
A short user-facing caution text that summarizes the situation in plain language.

How It Is Surfaced

When the integrity severity is non-clean, the user sees a caution note alongside the venue's scores. Typical wording: “Unusual review patterns detected on multiple occasions. We don't know whether they're benign (an event, a feature, a reopening) or solicited. Read accordingly.”

The caution does not change the numerical scores. It changes how the user is invited to interpret them.

Why Not Bake Integrity Into the Score

Folding integrity flags into the score would make a single number do two different jobs at once: rate the venue and rate our confidence in our own evidence. Those are different statements. We keep them separate so that a venue with a high score and a flagged integrity advisory remains visibly both — a high-scoring venue and a venue whose evidence has unusual patterns the user should know about. The user reads both. Neither hides the other.

Audit and Reproducibility

For any score Goler displays, the engine retains:

The version of the engine that produced it.
The list of reviews considered.
The claims extracted from each review.
The signal classifications applied to each claim, with the model used and confidence assigned.
Any integrity flags (suspected review bursts, suspected reviewer profile anomalies, internal contradictions).
The direction and relative importance of the signals that materially influenced the final scores.

This is sufficient to reconstruct any verdict and to identify which reviews and which classifications drove a specific outcome. The chain is preserved per scoring run and is versioned.

Goler retains the underlying audit data needed to reconstruct a venue score. During disputes, we may use this data to produce a venue-specific evidence report where operationally and legally appropriate.

Engine Versioning

Goler scores are produced by versioned engines. Each scoring run is associated with engine, strategy, timestamp, and build metadata retained in the audit trail.

Current scoring engines:

Risk Engine v3 — density-based scoring with severity tiers and confirmation multipliers.
Quality Engine v4 — sentiment aggregation with sample-size correction and a negative-share floor.

Engine version changes are deliberate and infrequent. When the engine is updated, prior scores remain attributable to the version that produced them; we do not silently rewrite history. Material methodology changes are reflected in the version of this document and in a public changelog (planned).

Limitations

We list the limitations of our approach openly because we want users, operators, and regulators to understand the boundaries of what Goler scores can and cannot tell them.

Sample Bias

Reviews are a self-selected sample of visitors. Visitors with strong feelings — positive or negative — write more often than visitors with neutral experiences. Aggregate scores reflect the opinions of those who chose to leave a record, not a representative sample of all customers.

Language Coverage

Our extraction pipeline is most accurate on English text. Reviews in other languages are translated before extraction; translation quality affects classification accuracy. Reviews without any usable text are excluded from Risk computation entirely.

Time Window

Only reviews written within the last three years are considered for scoring. Older reviews are excluded entirely, on the basis that venue operations, ownership, and management can change substantially over time and very old evidence becomes unreliable. Within the three-year window, however, all reviews are currently treated equally — a review from 35 months ago is weighted the same as one from last month. Recency-weighted scoring within the window is a known area for improvement and is on our roadmap.

This limitation has a practical consequence: a venue that recently changed chef, ownership, or management may be unfairly weighed down by older evidence reflecting a prior state of the business. Operators in this situation can use the Dispute Policy to file a free, self-service report documenting that the underlying issue has been resolved. Confirmed resolution reports are recorded in the venue's audit trail and are factored into how older evidence is interpreted, providing a path for venues to demonstrate change without waiting for the time window to advance on its own.

LLM Extraction Errors

Signal classification is performed by language models, which are imperfect. False positives (a claim classified as a risk signal when the original text did not warrant it) and false negatives (a real risk signal missed) both occur. We monitor these error rates internally; we do not claim them to be zero.

Selection of Source Reviews

Each review source has its own gaps and biases. Public review platforms over-represent visitors with strong opinions and under-represent demographics that do not engage with those platforms. Internal Goler reviews over-represent visitors who chose to engage with the Goler product. We use multiple sources in part to offset these biases, but no combination of sources eliminates them.

Severity Is Judgment, Not Measurement

As stated in Severity Tiers, the assignment of a signal to CRITICAL, HIGH, MEDIUM, or LOW is a methodological choice. Reasonable people may disagree with specific assignments. We provide a dispute mechanism for substantive challenges.

No Causal Inference

Even when many reviewers report similar issues, we do not claim to have established that the venue caused the issues. A signal is a record of what reviewers said, not a confirmed factual finding about venue operations.

Indicative, Not Certifying

A Goler score is a consumer-facing signal. It is not a substitute for a regulatory inspection, a medical investigation, or a legal proceeding. Decisions with significant safety, financial, or legal consequences should not be made on a Goler score alone.

Disputes

If you operate a venue and believe a specific Goler score, signal, or signal severity assignment misrepresents reality, the formal dispute process is available.

See the Dispute Policy for the full process, criteria, and timelines.

Document Versioning

This document will change as the methodology evolves. Each material change is reflected in a public changelog (planned).

Anticipated near-term updates:

Per-signal rationale references (food safety, hygiene, accessibility standards).
Recency-weighted scoring (when introduced).
Public changelog format.

We make no promises about a fixed update cadence; we will publish methodology changes when they happen and identify them clearly.

Contact

For dispute-related inquiries, see the Dispute Policy or email legal@goler.co.