How to Run Multiple ML Models Without Increasing App Size

A quiet look at how apps can carry multiple on-device ML models by loading intelligence only when it’s needed, keeping the experience light without losing capability.

By Samantha BlakePublished 3 months ago • 6 min read

Rain pressed softly against the laundromat windows that afternoon, blurring the street outside into smudges of gray. Inside, the dryers turned in slow circles, pushing warm air across the room. I had my laptop balanced awkwardly on my knees, reviewing a build that had grown far larger than anyone expected. The app carried four separate machine learning models—each important, each heavy—and the product manager sitting beside me kept tapping her coffee lid as if the rhythm could reveal an answer we both already sensed. The models were valuable. The app size was not. Something needed to change, and it had nothing to do with features and everything to do with how the app treated weight.

When Intelligence Becomes Heavy

Early versions of the apps that reached us after work done with teams tied to mobile app development Portland tended to treat machine learning like a single decision—place the model inside the bundle and allow it to run locally. That logic works as long as there is only one model, or maybe two. But once a team begins layering recommendation engines, text embeddings, classification networks, and personalization modules, the binary feels swollen. The idea of intelligence becomes uncomfortable because users sense the size of it before they sense the value of it.

I watched this pattern enough times to understand something simple—people rarely delete apps because they dislike features. They delete them because they need space for moments unrelated to software. A few photos. A video. A message backup. When an app crosses an invisible boundary of weight, it becomes the first candidate for removal.

The product manager beside me that afternoon understood that instinctively. She didn’t want the app to grow heavy. She wanted the intelligence to feel presence without burden.

When Models Don’t Need to Travel Together

While the dryers hummed behind us, I opened a folder on my screen and showed her each model independently. The classification one stood at a few megabytes, the recommendation engine slightly larger, and the embedding network dwarfed both. They were stitched into the APK like books crammed into a single backpack. The trick wasn’t compression; it was separation.

Models don’t need to travel everywhere together.

Some models belong to first-time onboarding. Some belong to offline predictions. Some only activate for users who reach certain screens. Keeping every model inside the binary assumes that every user needs everything, which is rarely true.

The moment she understood that idea, her tapping slowed.

The Shift That Happens When Models Load Dynamically

One of the earliest things I learned about on-device ML is that models behave like guests. Some stay permanently. Some visit only when asked. Some appear for a task and then disappear quietly. The laundromat felt like an oddly appropriate place to think about this—clothes arriving and leaving, cycles running briefly before the next one begins.

I showed her how dynamic loading worked. Instead of bundling the models inside the binary, the app could ship a minimal on-device architecture that knew how to fetch external assets when needed. The core of the app remained small. The intelligence expanded when summoned, stored safely, and evicted gracefully when no longer used. It felt like watching an app breathe.

The product manager leaned closer to my screen, eyebrows lifted not in excitement but in relief.

When Space Becomes a Promise

The more we talked, the more I realized that storage isn’t a technical metric—it is an emotional one. A user does not calculate filesystem numbers. A user simply senses that an app respects the limits of their device. Intelligence should feel light, not crowded.

I thought back to earlier projects where teams packed everything into one binary. They feared network delays, so they kept the models local. They feared connection loss, so they bundled additional versions. They feared refactoring, so they left unused weights inside the structure. The APKs swelled until they felt like they carried history instead of functionality.

Separation doesn’t just organize models. It relieves the app from carrying things it doesn’t need yet.

Creating a System Instead of a Bundle

The next step in that laundromat conversation was showing her how a modular ML system behaved. Not one overloaded binary, but an app that possessed the ability to host models flexibly. The app would determine which model belonged to a device tier. Older phones didn’t need the high-precision model. Newer phones could handle more depth. Some users needed just-in-time downloads. Others needed persistence.

We didn’t build a large model architecture. We built a thin framework that understood how to pull intelligence into place.

It wasn’t elegant in a visual sense. It simply felt natural—like choosing the right tool at the moment it was needed.

When Pruning Reveals What Actually Matters

Before creating any dynamic system, we pruned. And pruning tells a story of its own. You examine weights, layer by layer, removing redundancy, reducing shape, compressing format. It feels like sculpting something that was hidden beneath the bulk. Models shrink not by losing intelligence but by losing noise.

Some features don’t need full classification probability distribution. Some don’t need richly encoded embeddings. Some don’t need high-resolution dataset remnants inside the shipped model. When you prune with intention, you discover how much of a model’s size comes from caution rather than necessity.

The product manager listened quietly, her attention shifting from fear of loss to understanding of control.

When Responsibility Moves Closer to the Device

Running multiple models without increasing app size means moving responsibility—not away from the user, but closer to the moment when the user initiates a task. A device should not host intelligence that does not participate in its experience. It should pull, compute, and store with purpose.

One of the models in that build dealt with personalization. Only a fraction of the user base ever interacted with the feature. Why should everyone carry its weight? When the device requests the model only after the user becomes part of that feature path, the app feels respectful.

Respect is a kind of performance.

When CI Becomes a Confirmation Mirror

After that laundromat meeting, I returned to the office and wired the dynamic loading system into CI. The pipeline ran tests that confirmed when models were fetched, where they lived temporarily, and when they were evicted. It felt like watching the app breathe through cycle after cycle, never holding more than it needed. The APK size stopped growing. The intelligence still expanded. It became modular, predictable, calm.

That calmness spread inside the team. Designers added features without apology. Developers experimented without fear. Product leads no longer wondered whether one more model would push the app past acceptable limits.

The system held steady because the intelligence no longer depended on occupancy—it depended on timing.

When a Lightweight App Feels More Capable

A few weeks later, I downloaded the release build onto my own phone. The app installed instantly, far quicker than earlier versions. It opened without hesitation. When I accessed areas that required ML tasks, the models arrived without ceremony. Their presence felt invisible. Their absence felt natural. Nothing dragged. Nothing carried more than it should.

That moment reminded me that users never praise size reduction. They praise smoothness. They praise immediacy. They praise apps that behave like they belong in their device, not like they dominate it.

Running multiple ML models without increasing app size is not innovation. It is restraint. It is deciding what belongs in the moment and what should wait until asked.

The Quiet End of the Laundromat Afternoon

When my clothes finished drying that day, I packed my laptop and stood up slowly. The warmth from the machines lingered in the room. The product manager gathered her things, her expression softer than when she arrived. She understood now that intelligence didn’t have to live inside the app forever. It could visit when needed, contribute, and leave.

The rain outside had eased into a mist. The street looked washed clean, the kind of calm that matches the feeling of an app right-sized again.

Running multiple models isn’t a technical challenge—it’s an awareness challenge. It is knowing that intelligence must serve experience, not overshadow it. Once the app learns how to load what it needs and release what it doesn’t, it stops feeling like a collection of heavy features and starts feeling like something closer to intuition—light, present, and never more than it needs to be.

how to interview Vocal

About the Creator

Samantha Blake

Samantha Blake writes about tech, health, AI and work life, creating clear stories for clients in Los Angeles, Charlotte, Denver, Milwaukee, Orlando, Austin, Atlanta and Miami. She builds articles readers can trust.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Samantha Blake and writers in Education and other communities.