Building agents for the physical world
How AI scales humans, and how we make the AI trustworthy
Earlier this year, the conflict in the Middle East forced a number of our relocations to change overnight. People who were mid-move had to be pulled out of the region.
Flights were rebooked under pressure, hotel bookings that had been confirmed had to be modified, and the local movers we usually rely on were either overwhelmed or had stopped servicing the region.
Our 24/7 operations team replanned moves in real time, with the agent staging options and surfacing constraints. Some of what was needed didn’t exist in our system yet.
That period clarified what “failure” could look like in a relocation. Not a bad recommendation, but a person and their family or items being stuck in a city during a geopolitical crisis, with belongings on a container, a lease signed, and a school enrollment in motion. The cost of a bug is measured in someone’s life, not a retry.
Gullie is building a personal relocation agent and the infrastructure to move talent across 160+ countries. We are solving many of the same technical problems as other AI infra and devtools companies. The difference is that the system has to hold under conditions where the world stops behaving the way the model assumed.
Here are four of the hardest problems on our roadmap.
1. Confidence Thresholds: When Does the Agent Step Back?
Every agent needs a handoff-to-human policy. How confident are we in having an agent make a decision and execute on it? The instinct was to route on complexity, but we’ve found that’s the wrong axis.
Generating a shortlist of three apartments in Berlin is complex but recoverable. It pulls from multiple sources, normalizes listings, applies weighted scoring and visualizes it in our platform. If we get it wrong, the relocatee gives feedback for the agent to do another run. In other words, it is recoverable, and we are fairly confident in giving a good result.
Submitting a tax residency declaration is objectively straightforward as a task, but if you get it wrong, the person could end up spending months fighting a tax authority, with downstream errors in payroll, social security, and benefits enrollment.
So we route on a two-dimensional score: error cost x reversibility.
High cost and low reversibility means a human reviews before anything fires off, regardless of how easy the task looks.
2. Reversibility: There isn’t always an “undo”
When the agent approves a moving quote and the shipper picks up someone’s household, that is a physical action. The belongings are on a container ship. The old apartment has a new tenant. You can’t rollback a shipper.
We tier every action in the pipeline:
Tier one is fully reversible. Drafting, estimating, generating. The agent operates freely.
Tier two is semi-reversible at some cost. Rescheduling a visa appointment. Amending a lease application. Changing a flight inside the change window. The agent executes with a confirmation step and a time-boxed review.
Tier three is less reversible or costly to make amendments. Submitting a government filing, triggering household goods shipment, or executing a lease.
The agent prepares everything, validates against the jurisdiction, and stages it for our human-in-the-loop team to review and execute. It does not fire on its own.
This is enforced in the execution layer, not the prompt, and it isn’t a placeholder for future automation. It is the design. The model gets faster every quarter, but the judgment call about whether to ship someone’s life across an ocean stays with a human.
Maintaining this across 160+ countries is tough. The same action lives in different tiers depending on where and when you are, and we’ve learnt that crisis exposes the tiering instantly.
3. Long Feedback Loops: You Can’t A/B Test Someone’s Move
A relocation takes 3 to 9 months, and a human and its AI agent makes hundreds of micro-decisions across that window. Each one compounds. You don’t know if the experience was optimal until the person has landed, enrolled their kids into school and reported that things feel stable.
By the time you have that signal, the memory blocks and context from the start of the case has been updated and compressed multiple times, and immigration rules may have changed.
The feedback is about a system that may not be the same anymore.
And you can’t A/B test this. The combination of origin, destination, family composition, visa type, employment structure, pets, and timeline is unique enough that no two relocations are comparable. There’s no control group. Every case is an edge case.
So we’re building toward evaluation at the decision level, not the outcome level. Instead of waiting until the end of the relocation to ask “did this go well?” (one signal per case, six months late), the goal is to break the relocation into individual decisions the agent made along the way and grade each decision separately, as you go.
4. Agent Auth on Legacy Systems
A lot of relocation work runs through government portals with no APIs. Most were built in the early 2000s. Some only work in the local language. Every country is different. Different forms, different rules, different documents.
Browser automation breaks constantly. A government redesigns its portal over a weekend and your scripts are dead by Monday. You can’t version-control a website you don’t own.
So instead of automating the portals, we automate the work around them. The agent collects the data, checks it against the country’s rules, prepares the documents, and fills in everything it can. Then it hands a ready-to-submit package to our ops team.
Where the volume justifies it, we build dedicated point solutions. One example: a DMV slot watcher that monitors appointment availability and pings you to book a slot the moment something opens up. Driver’s licenses appointments that used to take a month to book now take three days.
Behind all of this is a knowledge graph that tracks what each country and city actually requires: forms, documents, timelines, quirks. Keeping it accurate is constant work. Governments change rules all the time and nobody sends a changelog.
The value isn’t the click. It’s the 90% of work before the click. The human approval is what makes the last step trustworthy.
Why we do this
Every relocation is a person starting a new chapter. People will be moving across borders for the rest of human history.
The infrastructure for how this actually happens has been broken for decades, run on spreadsheets and personal favours. The cost of that brokenness has always been paid by the person trying to move.
Building the system that finally works, that holds in a crisis, that can be trusted with the most consequential moments of someone’s life, is a problem worth solving.
The pattern across all four problems above is the same. The AI is what makes our human team scale. The human team is what makes the AI trustworthy.
We are not interested in removing humans from the loop. We are interested in giving them leverage that didn’t exist before, so the humans can spend their attention on the moments that matter.
We’re hiring! If any of this resonates, write to us at careers@gullie.io.


