Apples to Apples — by Problem, by Track, or by Abstracted Rubric

Comparing a doctor-appointment app to a satellite orbit modeler is malpractice. The systemic answer takes one of three forms: single-problem competitions, explicit tracks, or a single uniform rubric pitched one level above the problem domain.

GrowingPrinciple 2 · Apples to applesLast updated 2026-05-03

Apples to apples, or explicit tracks, or a single uniform rubric pitched one level above the problem domain. Comparing a doctor-appointment app to a satellite orbit modeler is malpractice; comparing a small student team's software project to a funded startup's pre-existing hardware rover is malpractice of a different kind. The systemic answer takes one of three forms, and the strongest hackathons in the world have already converged on these three rather than improvising new rubrics for each event.

The case that cross-domain judging is unreliable rests on more than intuition. MLH's own 2014 organizer essay, written from inside the science-fair format the league standardized for student events, names the bias plainly: hardware projects attract more attention from judges than software projects of equivalent quality, simply because they are more visually interesting. The Eventflare judging guide cites a University of Southern California study finding that judges tend to be biased toward teams that are more extroverted, more confident, and that have more polished presentations, a confound that worsens sharply with cross-domain variance because participants in different domains learn different presentation conventions. WildHacks at Northwestern publishes its full statistical scoring methodology openly, including a Bayesian shrinkage model that adjusts raw scores for inter-judge variance and a median-after-dropping-outliers tiebreaker the organizers note is taken directly from the National Speech and Debate Association national circuit. Carl Domingo, writing in MLH's organizer email series, concedes plainly that submitted projects have a broad range that makes finding a single best hack hard. When organizers as sophisticated as WildHacks build Bayesian-shrinkage models to recover fairness from raw cross-domain scoring, the underlying claim — that raw cross-domain scoring is unreliable — is no longer in dispute.

The first valid architecture is the single-problem competition. Every team solves for the same thing; the rubric is the problem; comparison is direct because the variance the rubric has to absorb is small. Kaggle competitions and DREAM Challenges are the canonical examples, and the architecture's appeal is that the apples-to-apples question never arises because there are no oranges. The cost is that creativity narrows to fit the metric the problem chose, and participants whose strengths sit outside that metric find the format unkind.

The second is explicit tracks, with each track judged against its own rubric and prize pool. NASA Space Apps publishes twenty to thirty distinct Challenge Statements per year and awards in ten named categories, with judges assigned to challenges they have domain context for. ETHGlobal uses partner bounties as parallel evaluation tracks, each sponsor judging their bounty against their own narrow criteria. HackDuke's hackduke-code-for-good runs four named tracks (Education, Energy and Environment, Inequality, Health) with separate winners per track and a majority-novice sub-prize within each. Smart India Hackathon publishes more than a thousand problem statements per year, each owned by a specific Ministry or PSU, each with a defined theme bucket. The cost of this architecture is track imbalance — if one track attracts twenty teams and another attracts three, the prize pools feel mis-calibrated — and its mitigation is publishing track-specific rubrics in advance and sizing prizes to expected participation rather than to organizer preference.

The third is the architecture the v0.3 of this principle added: a single uniform rubric pitched one level above the problem domain so that responses to non-commensurate problems can be evaluated against shared dimensions. The Google Solution Challenge runs the canonical version. Teams across more than 110 countries pick whichever of the United Nations' seventeen Sustainable Development Goals they wish to address, but every entry is judged against the same fifty-point rubric, twenty-five points for Impact (problem statement clarity, SDG target specificity, user feedback and iteration, success metrics, next-steps plan) and twenty-five points for Technology (architecture, scalability, demo quality, technical decisions). The rubric makes apples-to-apples comparison possible by abstracting one level above the problem domain — by asking not "did the team solve this specific problem well" but "did the team identify a real problem, ground it in evidence, and ship a technically sound response." The architecture only works when that abstraction holds. If the rubric leaks problem-specific dimensions — if the technical criterion implicitly favors backend scaling over interface design, or if the impact criterion implicitly favors quantitative metrics over qualitative ones — the abstraction collapses and the format reduces to the original cross-domain bias problem the architecture exists to address.

The anti-architecture, which the rest of this site treats as a named failure mode, is themeless open innovation judged against a single-equal-weight rubric across non-commensurate work. The format is common, especially among first-time student events that inherit a rubric template without thinking through the domain variance their event will actually attract. The result is the science-fair-bias literature in compressed form: the hardware projects win the room, the extroverted presenters win the panel, and the participants who came to build something thoughtful and software-shaped go home with the disappointing memory that judging felt random because, structurally, it was. The fix is not better judges or more careful calibration sessions, though both help at the margin. The fix is to choose one of the three architectures above and design the event around it.

The single-problem, explicit-track, and abstracted-rubric formats coexist; they are not in tension with one another, and well-organized hackathons sometimes blend them — a single-problem core with explicit sub-tracks for implementation approach, or an abstracted rubric supplemented by track-specific bonuses. The discipline that holds across all three is the same: every team's work must be evaluable against the same rubric as every other team's work it competes with. The format taxonomy format-taxonomy catalogues which of the ten working hackathon archetypes lean toward which architecture, and the case studies nasa-space-apps, google-solution-challenge, and smart-india-hackathon work each architecture in detail. The closing principle in this set, integrity-through-convergence, covers what fair judging looks like when the rubric problem is harder still — when there is no ground truth available on the day of judging at all, and the integrity of the result has to be argued from convergence across independent teams rather than from rubric alignment.