SeedFrontier.ai

Banking77 breakthrough

Intent classification result

Banking77 is harder than it looks.

Noisy labels. 77 intents. Real overlap. SeedFrontier just hit 94.42% on the official Banking77 test set, up 0.59 percentage points over baseline, under a strict full-train protocol with no leakage.

~225 ms. ~68 MiB. Small enough to run on-robot. Fast enough to keep a humanoid standing. This is the shape humanoid robotics actually needs.

View the result See the protocol

94.42%

official Banking77 test accuracy

Measured on the official test set

+0.59pp

over baseline

Meaningful lift under the same evaluation frame

~225 ms

inference

Runtime kept visible instead of hand-waved away

~68 MiB

model footprint

Compact enough to matter in deployment discussions

Result snapshot

What matters

Official test set

Accuracy

94.42%

official Banking77 test score

Delta

+0.59pp

improvement over baseline

Runtime

~225 ms

inference

Footprint

~68 MiB

model size

Difficulty profile77 intents

Label noisehigh

Intent overlaphigh

Production viabilitystrong

Why Banking77 is hard

Noisy labels

Intent data is messy. Banking77 does not give you a clean synthetic path to good-looking results.

Why Banking77 is hard

77 intents

This is a large intent inventory with plenty of room for confusion across semantically adjacent classes.

Why Banking77 is hard

Real overlap

The hard part is not just class count. It is the genuine semantic overlap between user intents.

The headline

A benchmark result that reads like engineering, not marketing.

The point of this page is not to shout “state of the art” in the abstract. It is to show a result that is specific, measurable, and anchored to a known benchmark with visible operating characteristics.

94.42% on the official Banking77 test set, with a strict full-train protocol, no leakage, roughly 225 milliseconds inference, and a footprint around 68 MiB.

Protocol

Credibility comes from the rules around the number.

The landing page needs to make the evaluation discipline obvious. That is what separates a real benchmark statement from a dressed-up experiment.

Official test set

The headline number is the real external score, not a cherry-picked split or internal holdout.

Strict full-train protocol

The result was produced with a disciplined training setup rather than ad hoc experimentation.

No leakage

The page makes credibility explicit. No contamination, no hidden shortcut, no soft benchmark framing.

Measured runtime + size

Accuracy is paired with inference and memory footprint so the result reads like a deployable system.

Why this matters

The result is strong because the story is disciplined.

This is a stronger public story than a vague AutoML claim because the benchmark, delta, and constraints are concrete.
The result says SeedFrontier can push intent classification quality without disappearing into giant-model handwaving.
The combination of score, protocol discipline, runtime, and footprint gives the page a credible engineering tone.

Messaging line

Banking77 is not easy. That is exactly why this result is worth publishing.

Robotics

This is the shape humanoid robotics actually needs.

Tesla Optimus. Boston Dynamics. Figure. 1X. Agility Robotics. Every humanoid and legged platform being built right now is compute-starved. A giant cloud-scale model does not fit in a robot torso powered by a battery. A ~68 MiB model running in ~225 ms does.

That is not a coincidence. It is the frontier these platforms actually live on — and it is where SeedFrontier is built to operate.

Compute is trapped on-robot

Humanoid platforms run on embedded accelerators with thermal and power budgets measured in watts. Cloud offload is not an option when control loops run at kilohertz rates.

Latency is standing vs. falling

Balance, locomotion, and manipulation policies run anywhere from 30 Hz to 1 kHz. Every millisecond of inference is a direct physical constraint on what the robot can actually do.

Battery budgets punish bloat

Every joule spent on inference is a joule not spent on actuators. Smaller, faster models extend runtime and make concurrent on-robot skills possible in the first place.

The on-robot control stack

Every layer is a deployment constraint.

A humanoid runs many models at once, at wildly different frequencies and sizes. Large general models cannot fill these layers. Small, specialized, production-shaped models can.

LayerFrequencySize budget

Balance and whole-body control500–1000 Hzunder 10 MB

Joint-level control1000+ Hzunder 1 MB

Locomotion policy50–200 Hz1–50 MB

Manipulation policy30–100 Hz50–500 MB

Visual perception30–60 Hz20–200 MB

Task planning (VLM)1–5 Hz2–7 B params

Who this is for

Humanoid and legged robotics teams shipping on real hardware.

If your platform has a thermal budget, a battery, and a control loop, the shape of your models is the shape of your product. Small, fast, accurate models are not a nice-to-have for humanoid robotics — they are the only thing that fits.

Tesla OptimusBoston DynamicsFigure1XAgility RoboticsApptronikSanctuary AIPhysical Intelligence

Messaging line

Humanoids do not need bigger models. They need the right model — small, fast, and accurate enough to run on-robot without apology. SeedFrontier delivers that shape.

Coming next

Full results soon on seedfrontier.ai

Banking77 is a proof point. The challenge is real, the benchmark is official, the gain is measurable, and the runtime profile stays visible — the kind of shape a real deployment actually needs.

The real target is humanoid robotics, where small, fast, specialized models are not optional. The deeper write-up will connect the Banking77 result to on-robot deployment and what this means for the teams building the next generation of humanoid platforms.