Boby's Garage: From a Colab Notebook to a 94% On-Device Car Classifier

April 2026

A long-form record of how a personal product went from "EfficientNet-B3 on a Colab T4" to "EfficientNet-V2-S, 94.24% Top-1 on a 655-class generation-level taxonomy, bundled and shipping on Android." This piece is about the infrastructure and product side of the journey — where the model trained, why I changed that, the hardware decisions, and the surrounding pieces (license-plate blur, on-device gating, hierarchical inference) that grew alongside the classifier.

1. The original idea

A mobile app where users walk or drive around, point their phone at a car, and "catch" it — the same loop as Pokémon Go, but for real-world car spotting. Each catch awards XP scaled by a rarity tier (Common → Legendary), feeds a personal collection grid, and unlocks community surfaces (Social Grid feed, leaderboard, garage groups). The product hypothesis was that car enthusiasts would happily turn an existing habit (noticing interesting cars in the wild) into a gamified collection loop, and that good fine-grained car recognition was the load-bearing capability.

That shaped two non-negotiables the whole pipeline has had to honour ever since:

On-device inference. A classifier round-trip to the cloud per frame would have been too slow for a tap-to-catch viewfinder, would have burned battery, and would have created a non-trivial server bill. Everything else is downstream of "the model has to fit on a phone and run in real time."
Generation-level granularity. Catching "an Audi A4" is a mediocre experience; catching specifically the B9 generation is the point. The taxonomy splits each model family by generation — a deliberate choice that makes the classification problem strictly harder than typical car-recognition benchmarks (which collapse year/generation into a single class label).

Corollary: every classifier version needs to be thin enough to bundle in the app, fast enough to run on a mid-range Android in real time, and accurate enough across hundreds of generations to make the catch feel earned rather than coincidental.

2. Pre-v3: where it actually started — Google Colab

The very first training runs of what became the v2.0.0 model lived in Google Colab. The setup:

Project directory mounted on Google Drive at /content/drive/MyDrive/CarSpotter_ML.
T4 GPU on the free / Colab Pro tier.
EfficientNet-B3 backbone, 300×300 input, 1536→512→665 head, 665 classes / 59 makes.
Export pipeline: PyTorch checkpoint → ONNX → TF SavedModel → TFLite (INT8 quantized, 12.3 MB).
Temperature calibration: T=1.5074, ECE 0.017.

This was enough to ship a first version of the app and prove the loop worked end-to-end. It was not enough to be reliable training infrastructure. The pain points that pushed me off Colab were predictable but real:

Disconnects. Long runs (12-16h on a T4 for v4's full 4-phase recipe) routinely got severed by Colab's idle-disconnect. Workarounds with --resume flags and aggressive checkpointing to Drive helped, but every disconnect cost real time.
Ephemeral /tmp. The notebook would write artifacts to /tmp/ and forget about them on restart. Every export needed to land on Drive, which slowed the pipeline.
Unpredictable hardware. Sometimes Colab handed out a T4, sometimes a less-good card, sometimes nothing at all if the queue was full.
No persistent dev environment. Every notebook open re-installed dependencies from scratch.

v3 was planned to start on Colab too — the v3 plan explicitly references "Phase 4: Colab Training" with T4 budgets and a pack_for_colab.sh script — but I never finished a v3 training run on Colab. v3 was effectively replaced before it shipped, because by the time I was ready to train it the infrastructure had moved.

The v4 development cycle started on Colab (colab/BobysGarage_Training_v4.ipynb, 30 cells) but its successful export pipeline ran locally on WSL2 with onnx2tf after onnx-tf produced TFLite Flex ops I couldn't ship. v4 hitting 79.4% Top-1 at 384×384 NHWC on EfficientNet-V2-S was the moment the architecture stopped changing — every subsequent release line (v5.0 → v5.13) keeps the same backbone family.

3. The migration to RunPod

The Colab → RunPod move happened around v5.x. The triggers were straightforward: training got longer, the model got bigger, and my local WSL2 box doesn't have GPU headroom for EfficientNet-V2-S at 384×384 with my batch size and sub-center ArcFace + DINOv2 distillation. The needed properties were:

On-demand GPU, with the option to pick the card.
Persistent /workspace that survives stop/start cycles so I don't re-upload datasets each run.
SSH access with a real shell — not a notebook abstraction.
Cheap when stopped. RunPod charges only when the pod is running; the persistent volume is a small standing cost.

What I picked:

GPU	Where used	Why
NVIDIA A100 80GB	v5.4 production training	660-class dataset at scale. Picked for VRAM headroom on 384×384 at batch 64.
A100 SXM4-80GB	v5.6 / v5.7 training	~10 min/epoch with batch 64 + grad accum 2. The SXM4 form factor gave better throughput than the PCIe A100 for the same VRAM.
NVIDIA H100 80GB	v5.13 (planned), wider runs	Used when I want to iterate quickly on data changes; the H100 is overkill for the size of the model but the wall-clock saving on 50-epoch runs is real.

The pod is normally stopped between sessions. Provisioning is a cold start — there's a few minutes of boot + dependency install before any Python script runs. Standard install on every cold boot: pip install timm albumentations tensorboard. The persistent workspace already has the dataset, checkpoints, and training scripts.

What lives on the pod:

Datasets (the one I actually use is far too big to hold in the repo: v5.13 has 175,937 images at 384px after curation).
Checkpoints (best.pth, best_fp32.pth, swa_final.pth, EMA shadow weights, optimizer state for resume-on-disconnect).
Per-version training scripts and run logs.
Teacher logits for distillation runs (e.g. v5.8 had ~307 MB of pre-computed DINOv2 logits in fp16).

What lives in the repo:

Training script source (ml/v5/training/train.py etc.) — committed via explicit .gitignore negation rules so the pipeline is reproducible.
Class lists, manifests, confusion-pair JSONs — small, deterministic, version-tagged.
Final TFLite / CoreML artifacts that ship in the app bundle, plus their SHA256 hashes for verification.

The contract is: training is reproducible from the repo + a fresh pod. Any run-specific artifact bigger than a few MB lives on the pod, and is allowed to be ephemeral.

4. The classifier evolution at a glance

Version	Backbone	Classes	Top-1 (val)	Where trained	Bundled?	Notes
v2.0.0	EfficientNet-B3	665	—	Colab T4	Yes (early)	Original, INT8 quantized, 12.3 MB
v3.x	EfficientNet-B3	~900 (planned)	3.2% (broken)	Colab (planned)	No	Never shipped — superseded by v4
v4	EfficientNet-V2-S	666 (665 + bg)	79.4%	Colab + local export	Yes	First V2-S, 4-phase training, fp32 TFLite (INT8 broke accuracy: 91% → 54%)
v5.1	EfficientNet-V2-S + GeM + 768d	667	80.67%	Colab + RunPod	No (superseded)	Multi-task make+model heads stabilized
v5.4	same	660	82.38%	RunPod A100 80GB	Yes	First A100 production run, ~$13. Sub-center ArcFace + DINOv2 distillation.
v5.7-swa	same	660	79.18% / 90.58% (clean overlap)	RunPod A100 SXM4	Yes (v0.4.0)	Confusion-pair detail collection added; SWA for stability
v5.8	same	677	78.81%	RunPod	No (internal)	Confusion-margin loss ablation: every config worsened trust metrics → margin removed from production recipe
v5.12	same	655	94.24%	RunPod	Yes — current production	87.4 MB float32 TFLite. Full data hygiene pass (Cleanlab-style auto-relabel + targeted resourcing). The headline number.
v5.13	same	895	TBD	RunPod H100 (planned)	No (training pending)	240 new classes from the v5.13 catalog expansion. Adds 3,000 COCO background images for false-positive rejection.

The single biggest lesson from the v2 → v12 ladder: nothing about the architecture changed after v4. The 79.4% → 94.24% gain came almost entirely from data quality work — deduplication, mislabel detection, viewpoint-balancing, confusion-pair-targeted sourcing, and the cumulative effect of running the iterative cleanup loop (train → detect mislabels → fix → retrain) repeatedly. The recipe additions (ArcFace, DINOv2 distillation, EMA, SWA) gave point-deltas on the order of 1-2 pp each; the data work delivered the rest.

5. Why these specific architecture choices stuck

Some choices I tried, rejected, and stuck the rejection in memory so I don't re-try them:

EfficientNet-V2-S at 384×384 NHWC. Picked for the size/accuracy trade-off and because NHWC matches Android's ImagePreprocessor.kt output without an extra transpose. NCHW path was tried and dropped — onnx-tf was generating a DepthwiseConv2dNative Flex op, which broke the TFLite build because tensorflow-lite-select-tf-ops:2.17.0 doesn't exist on Maven.
GeM pooling (learned p≈3.0) instead of plain global average pool. Empirically a few percentage points on fine-grained.
Sub-center ArcFace replaces focal + triplet since v5.3. The K=3 sub-centers absorb intra-class variance (different paint colours / angles of the same model) better than triplet's hard-mining.
DINOv2 ViT-L/14 frozen-teacher distillation. Zero inference cost (teacher dropped at deploy), but improves the embedding's robustness to viewpoint and lighting. Used during v5.3+ training.
TrivialAugmentWide at the PIL layer before Albumentations. Avoids over-tuning augmentation; the full Albumentations pipeline still runs after.
EMA (decay 0.999, later 0.9995). Stabilizes the final epochs.
SWA (Stochastic Weight Averaging) for v5.4 / v5.7. Used to be the final phase; later iterations rejected it because it eroded my trust metrics (ConfViol, HCE) even when it lifted Top-1.

Things I deliberately do not do:

No INT8 quantization. v4 testing showed dynamic-range INT8 dropped Top-1 from 91% to 54%, with phantom classes locking onto constant ~8.0 logits. I ship float32 TFLite and pay the size cost.
No ConvNeXt backbone. GELU + LayerNorm hits a tf.Erf not supported op in TFLite, so the model wouldn't ship. I tested and stopped.
No multi-crop test-time augmentation. 5-6× inference is too slow on the device for an interactive viewfinder. TTA flip at the confirmation step (not the scan loop) is on the table for a future release.
No Co-Teaching for label noise. Sub-center ArcFace + Cleanlab-style audit handles it more cleanly.

6. The stack around the model

Three other capabilities grew alongside the classifier and are easy to forget about because they're invisible when they work.

6.1 License plate blur

A car-spotting app where the photos are shared (Social Grid) has an obvious privacy obligation: don't publish other people's plates. I built my own plate-blur pipeline rather than rely on cloud OCR.

The current shape:

Detector: a YOLO plate model trained separately, exported to TFLite, bundled at assets/models/plate_detector.tflite.
Native pipeline: Android-only as of writing. The expo-car-classifier native module exposes loadPlateModel, detectPlates, and blurLicensePlates via Kotlin (PlateBlurProcessor.blurPlates()).
JS/TS binding: src/hooks/usePlateBlur.ts — a fail-open hook that returns the original URI if the platform is unsupported (iOS) or the blur fails (no plate found, model not loaded, anything).
Lifecycle: blur runs synchronously after the catch photo is captured, before it gets uploaded to S3 or shown in the catch confirmation. The processed JPEG overwrites the original at the same path, which is fast (no extra files on disk) but makes debugging plate-blur quality harder when something looks wrong — there's no "before" image to compare against.

Known failure modes the model still has trouble with: distant plates (pixel-coverage too small for the detector), heavily angled plates, night/glare/motion blur, and non-US plate aspect ratios. Recall is the metric I care about most (a missed plate is a privacy fail; a false positive that blurs a wheel is annoying but not harmful).

iOS plate blur is not implemented — the JS hook guards against the platform and returns the original URI. iOS catches uploaded today have unblurred plates, which is one reason iOS shipping has stayed parked.

6.2 On-device gating

The classifier doesn't decide a catch on its own. Every prediction goes through a multi-stage gate before the user sees a "Catch!" prompt:

Gate 0 — background class. The classifier has a zzz_background class (last logit). If that wins, no catch.
Gate 1 — out-of-distribution detection. Three sub-checks, any of which rejects: max-softmax below threshold, max-logit below threshold, logit-entropy above threshold. Caught the case in v4 testing where black frames hit zzz_background at 98.5% confidence — exactly what I want.
Gate 2 — per-rarity confidence threshold. Stricter for rarer cars (COMMON ≥0.40, LEGENDARY ≥0.70), so a flickery low-confidence read can't fluke a Legendary catch.
Gate 3 — voting across frames. Multi-frame voting with margin requirements; one-frame fluke predictions don't count.

These thresholds were calibrated against real captures during v4 → v5 and tuned every time the underlying confidence distribution shifted (e.g. when bundled model size or temperature changed).

6.3 Hierarchical inference + make-aware reranking

The same paper that gave me the auxiliary make head also pushed me to do per-make reranking at inference time:

For each candidate, sum probabilities across all classes belonging to the same make.
If the collective make signal points to a different make than the top single-class prediction, reject — the model recognizes the badge but has no matching model in the taxonomy.
Otherwise rerank within-make using the family aggregation, then apply the gates above.

This isn't strictly necessary for the model to work, but it cuts a class of errors that the bare softmax can't catch.

7. The research the recipe leans on: CompCars

The training recipe and taxonomy choices lean heavily on Yang, Luo, Loy & Tang, A Large-Scale Car Dataset for Fine-Grained Categorization and Verification (CVPR 2015, CUHK MMLab) — the canonical reference for fine-grained car classification.

What I adopted directly:

Auxiliary make head. The paper's Table 3 reports model-level top-1 of 76.7% and make-level top-1 of 82.9% on the same backbone — a +6.2 pp gap that suggests coarse-to-fine training. I wired make as a parallel head sharing the GeM-pooled embedding (loss weight 0.3, fixed since v5.1). On v5.4 this shows up as 94.7% make vs 82.38% model accuracy — the regularization signal the paper described.
All-viewpoint training is non-negotiable. The paper's strongest finding: training on all five viewpoints reaches 76.7%, while the best single-viewpoint training reaches only 59.8% — a +16.9 pp advantage. I treat this as a hard rule on the data side. Crawler intake doesn't filter by viewpoint, and coverage scripts target classes with viewpoint imbalance, not classes with low totals.
Within-make confusion is the load-bearing error. The paper shows most fine-grained errors stay inside the right make. The entire v5.x error-reduction loop is organized around this: 31 mined confusion pairs, v5.5 allocated 11,663 detail-focused images across 103 classes targeting those pairs, and v5.13 reserves 1,200 images per class budget specifically for confusion pairs before any other data goes in.
Train distribution must match deployment distribution. The paper notes web-trained CNNs transfer only partially to surveillance-modality images. My deployment distribution is uncontrolled phone-camera frames, so user-submitted photos and BaT/auction listings (closest to camera distribution) are the backbone — press/marketing shots are supplemental, not primary.

Where I deliberately diverged:

Generation-level taxonomy. CompCars is model-level (a Civic is a Civic across all generations). My v5.0 → v5.12 line trained at generation-level (8th-gen FK Civic ≠ 10th-gen FK7 Civic) because the app rewards collecting specific generations. v5.13's 240 new classes deliberately revert to model-level for under-represented markets where per-generation labels would cost too much to source. This is the inverse of the paper's design.
Modern backbone + recipe. The paper's OverFeat baseline reaches 76.7% top-1 on 431 model-level classes. I hit 94.24% top-1 on a strictly harder 655-class generation-level problem with EfficientNet-V2-S + Sub-Center ArcFace + DINOv2 distillation + a decade of augmentation/data-hygiene research. The numbers aren't apples-to-apples, but the gap is large enough to attribute mostly to the modern recipe.
Trust metrics on top of Top-1. Top-1 alone is not enough for an on-device classifier in a gamified app — a confidently wrong catch is worse than abstaining. Every checkpoint has to clear ConfViol ≤ 25% and HCE ≤ ~620 gates in addition to Top-1 ≥ 94% / Top-3 ≥ 99%. Multiple v5.6 SWA runs hit higher Top-1 than the shipped v5.7 but were rejected because their ConfViol regressed.
A background class. CompCars never has to decide "is this even a car?" — my model does, on every frame. v5.13 adds 3,000 COCO images (500 hard negatives like motorcycles/buses + 2,500 clean scenes) as the zzz_background class to reduce false catches when the user pans across non-car scenes.

One paper finding I have not exploited: the 8-part voting ensemble (headlight, taillight, etc.) hits 80.8% top-1 in the paper — higher than the entire-car classifier. Running 8 models per frame on-device is a non-starter, and phone-camera catches rarely yield clean isolated part crops. Worth revisiting if Top-1 stalls or if a "verify by detail" UX surface ever becomes useful.

8. Where I am now

Bundled model: v5.12.0, 655 classes, 94.24% Top-1, 87.4 MB float32 TFLite (Android) + CoreML variant for iOS (87.3 MB, GAP pooling instead of GeM since the pooling op didn't translate cleanly through CoreML conversion).
App version: v0.5.5, Android in Open Testing on Play.
Training infrastructure: RunPod, pod stopped between runs.
Next training target: v5.13 (895 classes, dataset built and uploaded, training pending an H100 spin-up).
Open-loop items: iOS plate blur not implemented, model-level vs generation-level granularity is split (655 generation-level + 240 model-level for v5.13's added market coverage), production verification head (the paper's third task) not yet built.

The journey isn't over. But the line from "Colab notebook with EfficientNet-B3 in February 2026" to "RunPod-trained EfficientNet-V2-S, 94.24% on a 655-class generation-level taxonomy, bundled and shipping" is the durable progression worth recording.

Credits

UI/UX design was done in collaboration with web360pointer on Fiverr — the catch flow, collection grid, and Social Grid surfaces all came out of that collaboration.

← Back to Blog