Training Machines on Human Behaviour Is Big Business

Latest top stories

Start-ups

Technology

Training Machines on Human Behaviour Is Big Business—And Getting Bigger

20 April 2026

In April 2026, startup Humyn Labs announced it would deploy $20 million to scale what it describes as a “human data layer” for robotics and physical artificial intelligence (AI). The San Francisco-based company, founded by Manish Agarwal and Ishank Gupta, is targeting a growing constraint in the development of real-world AI systems: the lack of high-quality human behavioural data needed to train machines outside controlled environments.

Its approach—capturing and structuring human activity at scale—has drawn scrutiny, however: raising questions about consent, data ownership and the commercial use of human behaviour as a training resource.

What Humyn Labs does—and why it matters

Humyn Labs was founded in 2026 by Agarwal, a former gaming executive, and Gupta, who had both been previously involved in KGeN, a platform focused on verified digital identity and user data. That background is reflected in the company’s current positioning.

Rather than building robots or AI models, Humyn Labs is developing data infrastructure. Its goal is to connect real-world human activity to companies building physical AI systems, supplying datasets that allow machines to learn how people move, act and interact with their environments.

The company says it works with a network of more than one million contributors across more than 60 countries, generating multimodal datasets that include video, audio and motion data. While these figures are company-reported and not independently verified, they point to the scale at which the company intends to operate.

At its core, Humyn Labs is attempting to industrialise a process that has historically been fragmented: turning human behaviour into structured training data.

Why physical AI needs human behavioural data

The rise of physical AI—robots, autonomous machines and embodied systems—has exposed a fundamental limitation in current AI development.

Large language models have been trained on vast amounts of text data available online. But robots do not operate in text environments. They must function in the physical world, where conditions are dynamic and unpredictable.

Tasks such as picking up objects, navigating cluttered spaces or interacting safely with humans require an understanding of real-world behaviour. These are not problems that can be solved with synthetic data alone.

“There is no internet data for real-world actions,” Agarwal said in an interview with Moneycontrol.

This gap is becoming more apparent as robotics moves beyond controlled environments into commercial deployment. Systems that perform well in testing often struggle when exposed to real-world variability. Training data is a key reason why.

The rise of the “human data layer” in AI

Humyn Labs is part of a broader shift in AI development: the emergence of a data-centric layer focused on real-world human behaviour.

The company captures first-person activity—how people move, interact with objects and navigate environments—and converts it into structured datasets for machine learning systems. Its $20 million investment is being used to expand data collection operations, develop testing environments and scale multilingual datasets.

This approach positions Humyn Labs as a middleware provider within the AI stack. It does not build the models or the hardware, but supplies the data that enables both.

As AI systems move into the physical world, this layer is becoming increasingly important.

Other companies training AI with human data

Humyn Labs is not the only company working on this problem. A growing ecosystem is emerging around AI training data for robotics and physical systems, though approaches vary.

Established players such as Scale AI have built large-scale data annotation businesses supporting autonomous vehicles and robotics. Platforms like Encord specialise in managing complex datasets, including video and sensor data used in machine learning workflows. Meanwhile, Surge AI focuses on human-labelled data used to refine AI systems through feedback.

These companies primarily operate in the digital domain. The shift towards physical AI is extending their relevance, but also exposing new challenges.

Newer approaches are emerging that focus more directly on capturing real-world human behaviour. Workforce platform Instawork has begun equipping workers with wearable cameras to record how tasks are performed, turning operational workflows into training data.

Robotics companies are also building their own datasets. Covariant, for example, collects data from robots operating in warehouse environments, creating continuous feedback loops between deployment and learning.

Other startups are experimenting with teleoperation, where humans remotely control robots to generate training data through real-world interaction.

Taken together, these approaches point to a fragmented but fast-developing market, where different players are trying to solve the same underlying problem: how to generate reliable data for machines operating in the physical world.

Why collecting real-world AI data is expensive

Unlike digital data, human behavioural data cannot be collected passively at scale.

It must be captured deliberately, often using sensors or cameras, then labelled, structured and validated before it becomes usable for machine learning. Each step adds complexity and cost.

Humyn Labs’ $20 million deployment reflects the infrastructure required to support this process. It covers data collection, processing and quality control—elements that are essential for building datasets that can be used in production environments.

As demand for robotics grows, access to high-quality training data is becoming a competitive factor. Companies that control such datasets may be better positioned to develop reliable systems.

Why the human data layer is controversial

The emergence of a “human data layer” raises questions that go beyond technology.

Privacy is one of the most immediate concerns. Human behavioural data—particularly motion and interaction data—can be sensitive. Even when anonymised, it may still reveal identifiable patterns.

Ownership is less clearly defined. If human behaviour becomes a key input for training AI systems, it is not yet clear how the value generated from that data should be distributed. Contributors may be compensated for participation, but the long-term commercial value of the datasets may be significantly higher.

There is also a shift in labour dynamics. Platforms that capture human activity for training purposes blur the line between work and data generation. While companies such as Humyn Labs frame this as a technology infrastructure problem, it introduces new questions about how human contribution is recognised and rewarded.

The risk of bias in AI training data

As with other forms of AI, the composition of datasets is critical.

Systems trained on limited or unrepresentative data may perform unevenly across different environments or populations. This is particularly relevant for physical AI, where conditions vary widely across regions and use cases.

Humyn Labs’ global data collection strategy may help address this, but it also introduces challenges in maintaining consistency and quality across datasets.

Why data infrastructure will shape the future of AI

As AI systems move into the physical world, the importance of data infrastructure is becoming more visible.

Control over data pipelines can influence how systems are trained, what behaviours they learn and where they can be deployed. This shifts part of the industry’s influence from model developers to data providers.

Companies like Humyn Labs are positioning themselves within this layer, aiming to become suppliers of the datasets that underpin physical AI systems.

A hidden layer with outsized impact

The development of AI is often framed in terms of models and hardware. The rise of the “human data layer” points to a less visible but equally important constraint: the availability of high-quality, real-world data.

As robotics moves closer to large-scale deployment, this layer is likely to become more prominent. It will shape not only how systems are trained, but also how they behave in real-world environments.

For now, it remains largely out of view. But it is increasingly clear that without it, the next phase of AI will struggle to move beyond controlled settings into everyday use.

Liked this article? You can support our independent journalism via our page on Buy Me a Coffee. It helps keep MoveTheNeedle.news focused on depth, not clicks.

👉 https://buymeacoffee.com/movetheneedle.news