Teaching Satellites to Remember: Patagonia as a Testbed for Predictive Earth Intelligence

Community Article Published May 8, 2026

NuTonic combines temporal satellite memory with visual-language reasoning so AI can not only see Earth from space, but explain what is changing and help people act sooner.

Every day, satellites watch the planet change. Forests dry out. Rivers swell. Glaciers retreat. Cities push into coasts and wetlands. The raw images are already there, streaming from orbit in a volume no human team can read fast enough.

The missing piece is not more pixels. It is understanding.

NuTonic's Patagonia experiment explores a powerful idea: combine a satellite vision-language model, which can describe what it sees, with a temporal intelligence model, TiM, which is built to reason across time. The result is a prototype for Earth observation that does more than caption a picture. It can begin to answer the question that matters most in climate response, conservation, and disaster preparedness:

What is changing, and where should we look next?


The Big Idea

Most satellite AI systems are trained to recognize a single image: water, forest, snow, city, road, burn scar. That is useful, but the Earth is not a still photograph. The Earth is a movie.

TiM gives the system a form of temporal memory. Instead of asking the model to interpret one isolated image, the workflow gives it evidence from a sequence of Sentinel-2 observations. The vision-language model then turns that evidence into human-readable analysis: a caption, a likely land-cover story, and where relevant, boxes that point to regions of interest.

In plain language:

TiM watches the change. The VLM explains the change. Together, they point attention toward what may happen next.

This is the core promise of the Patagonia work. It is not just "AI looks at a satellite image." It is a step toward an operational system where AI can help people monitor remote terrain, detect stress, identify emerging risks, and prioritize scarce attention.

Making the problem explicit:

Sentinel-2 could in theory provide near-real time intelligence , if only it could send the information in time and we could process it in time. In practice that is not the case : even near-real-time is already too late. We need to be able to anticipate the future in order to react in the present.


Why Patagonia?

Patagonia is one of the best places on Earth to test this kind of system because it contains so many environmental stories in one region.

There are glaciers and ice fields, forests and lakes, dry steppe, coastal cities, fjords, marine reserves, wetlands, and wildfire-prone landscapes. A useful Earth intelligence model must be able to handle all of that variety. It must distinguish a marine reserve from a coastal channel, a forest lake from bare steppe, a flood-prone wetland from ordinary water, and a wildfire context from a naturally dry landscape.

That makes Patagonia more than a scenic benchmark. It is a miniature stress test for planetary monitoring.

The evaluation used curated Patagonia targets including:

  • Andean forest and lake regions,
  • Marine reserves and nearshore channels,
  • Glacier and ice landscapes,
  • Coastal urban controls,
  • Fjord and mountain controls,
  • Dry steppe wildfire contexts,
  • Seasonal wetland and flood-pulse contexts.

For a public competition, this matters because the story is easy to understand: if AI can help read Patagonia's changing terrain, the same idea can scale to many of the world's fragile frontiers.


From Seeing to Predicting

Prediction in Earth observation does not always mean forecasting a precise event at a precise hour. Often, the most valuable prediction is earlier awareness: a system that recognizes the ingredients of risk before they become a headline.

A river corridor that is repeatedly wetter than expected can become a flood-priority area. A steppe region showing burn-relevant signals can become a wildfire-watch zone. A coastal wetland showing unusual transitions can be flagged for conservation review. A glacier margin can be tracked as its surrounding terrain changes.

The Patagonia prototype points toward this workflow:

flowchart LR
  A[Satellite observations over time] --> B[TiM temporal memory]
  B --> C[Vision-language model explanation]
  C --> D[Human-readable change story]
  D --> E[Priority map for people]
  E --> F[Conservation disaster response climate planning]

This is the leap: from passive imagery to active decision support.

Instead of asking analysts to inspect every pixel everywhere, the system can help generate a ranked set of places worth attention. It becomes a machine assistant for triage: not replacing scientists, emergency teams, or conservation experts, but giving them a faster first read of where change may be concentrating.


How the Datasets Were Produced

The satellite fine-tune is backed by a data-production pipeline that turns ordinary geolocated points into supervised VLM examples. The important idea is that each row is not just an image. It is an image paired with satellite-derived labels, captions, and sometimes grounding boxes, so the model learns to describe Earth imagery in a way that downstream tools can use.

The training dataset path is:

  1. Start from geolocated candidate points. The orchestrator selects latitude/longitude rows from a Hugging Face source dataset, using spacing and sampling rules so the points are geographically diverse rather than clustered.
  2. Materialize satellite evidence. For each selected point, the pipeline downloads Sentinel-2 L2A imagery through a public STAC catalog. Optional Mapbox Satellite stills provide an overhead context image for broader visual grounding.
  3. Add land-cover supervision. using Google Earth Engine / Dynamic World labels when available. Those labels are aligned to the same 10 m reference grid as the Sentinel-2 RGB stack.
  4. Cut the world into training chips. The builder slides native-resolution windows over the Sentinel scene, then downsamples RGB tiles to model-sized images such as 224×224. The land-cover mask is downsampled with nearest-neighbor alignment so labels still match the image.
  5. Create image-text rows. Each tile can emit a global caption row and per-class rows for visible land-cover classes above a threshold. Per-class rows teach the model both what is present and where it appears.
  6. Write LEAP/LFM-VL SFT JSONL. Outputs are stored under images/, metadata/, and data/train.jsonl, data/validation.jsonl, data/test.jsonl, using the message format expected by the satellite VLM training stack.
flowchart LR
  P[Geolocated POIs] --> S[Sentinel-2 STAC imagery]
  P --> M[Optional Mapbox stills]
  S --> D[Dynamic World / EE labels]
  S --> T[RGB training chips]
  D --> T
  T --> J[Caption + grounding JSONL rows]
  M --> J
  J --> H[Hugging Face dataset]
  H --> F[NuTonic/lspace fine-tune]

The production scripts are designed for scale. run_lfm_vl_sft_orchestrator.py processes points in ephemeral batches, streams JSONL rows, prunes large Sentinel COG trees after successful processing, and supports geo-jitter so nearby variations of the same point can become additional training examples. That lets me build a large satellite instruction dataset of approximately 1.1 million entries from a 11 thousand entries seed dataset.

For the model, this matters because the training examples teach three habits at once:

  • Describe the scene in satellite-native language.
  • Respect structured output so answers can feed app overlays and review tools.
  • Connect words to regions through class-specific grounding examples.

The Patagonia evaluation then tests those habits in a separate setting: can the model use the same satellite vocabulary, boxes, and structure when it is given temporal context and asked to reason about change? The major difference between the Patagonia evaluation and the test/validation splits is that the training dataset was generated from "lived areas" whereas in Patagonia we shouldnt expect to see any human activity in the natural reserves selected.


What the Vision-Language Model Adds

Satellite analytics are often trapped inside numbers: vegetation index, cloud mask, land-cover fraction, water probability. Those numbers are important, but they are not how most people make decisions.

A mayor, field team, environmental journalist, or regional planner needs a sentence:

"This wetland appears to show increased water extent relative to the previous observation window."

Or:

"This dry steppe area is being evaluated in a wildfire-change context; bare-ground and vegetation signals should be reviewed."

That is where a vision-language model becomes powerful. It translates machine observations into language. It can name what is visible, summarize the likely environmental context, and produce structured outputs that downstream tools can use.

The NuTonic fine-tune, NuTonic/lspace, is a satellite-specialized version of LiquidAI/LFM2.5-VL-450M. It was trained for satellite captioning, visual question answering, and grounding-style tasks using the NuTonic satellite SFT stack. In everyday terms, it is being taught to speak the language of Earth observation rather than the language of ordinary internet photos.


What TiM Adds

A normal vision model sees a frame. TiM is designed to help with the sequence.

That matters because many of Earth's most important events are defined by before-and-after patterns:

  • Flooding is water appearing or expanding.
  • Drought stress is vegetation weakening over time.
  • Fire impact is a before-and-after change in land surface.
  • Urban expansion is gradual replacement of bare or vegetated land.
  • Glacier retreat is motion and exposure across seasons.

In the Patagonia setup, TiM-style temporal information is injected into the model's prompt so the VLM is not merely describing a pretty satellite tile. It is being asked to read that tile with a memory of change.

The best version of this system is a collaboration:

  • TiM helps detect temporal signals.
  • The VLM explains those signals.
  • The product layer turns explanations into maps, alerts, and bundles for people.

That is the larger vision behind the PRO tab and inference services in this repository: a map-first interface where satellite materialization, temporal signals, and VLM explanation come together in a practical workflow.


How the Temporal Evaluation Works

The Patagonia benchmark is not only a still-image caption test. It is designed to ask whether a model can use time-aware evidence when interpreting a satellite scene.

The evaluation has three temporal layers:

  1. Temporal satellite inputs for TiM. TiM receives Sentinel-2 observations from a date window, using s2_mode: stac, rgb_mode: s2_rgb, and both RGB and S2L2A modalities in 20260507T230936Z/tim/tim_config.json. Most targets use the broad window 2025-01-01/2026-04-30; the wildfire and wetland targets use narrower seasonal windows.
  2. Temporal scene selection for selected AOIs. Some targets include explicit temporal_scenes: the steppe wildfire row compares 2024-11-01/2025-01-31 with 2025-11-01/2026-01-31, while the wetland/flood-pulse row compares 2025-04-01/2025-06-30 with 2025-10-01/2025-12-31. This gives the benchmark a concrete before/after structure instead of a generic date range.
  3. Delta gold for grounding. The main run used gold_mode: delta with gold_min_temporal_separation_days: 31. In practice, the scoring harness builds gold regions from bi-temporal Sentinel-2 scene-classification disagreement, so the target boxes represent where the optical surface changed, not just what was visible in one frame.

That means the model is evaluated in a more realistic Earth-observation setting: it sees a late/current scene, receives temporal context, and is judged partly against regions where the satellite record says meaningful change occurred.

flowchart LR
  A[Earlier Sentinel-2 scene] --> D[Delta gold: changed regions]
  B[Later Sentinel-2 scene] --> D
  B --> C[Current RGB still]
  A --> T[TiM temporal context]
  B --> T
  C --> V[VLM response]
  T --> V
  D --> G[Grounding score]
  V --> G
  V --> M[Lexical contract faithfulness composite]

In simpler language: the benchmark asks the model to describe the current image with memory, then checks whether its words and boxes align with the temporal evidence.


Evaluation Metrics

The run reports several axes, each normalized to a score between 0 and 1:

Metric What it measures Why it matters for temporal EO
Lexical Whether the model uses expected Patagonia/environment vocabulary for the AOI. A change explanation must name the right kind of place: wetland, glacier, marine reserve, steppe, forest, coast.
Grounding Whether predicted boxes overlap SCL-derived gold boxes. In delta mode, those boxes are change-oriented. The model should not only speak about change; it should point to where change is happening.
Output contract / structured Whether the response follows the production JSON/box schema. Structured outputs can become app overlays, review queues, or downstream alerts.
Faithfulness / TiM alignment Whether the caption agrees with injected analytics such as land-cover fractions or change hints. A temporal model is useful only if the explanation respects the evidence it was given.
Composite Weighted blend of the above axes. Gives one overall score while preserving diagnostic sub-scores.

For the main run, composite_weight_preset: auto resolved to a scoring setup with strong emphasis on grounding and structured/faithful output. The report records the effective default weights as:

Axis Weight
Lexical 0.18
Grounding 0.42
Output contract / structured 0.22
Faithfulness / TiM alignment 0.18

When procedural analytics are active, the equivalent preset uses slightly more weight on faithfulness: lexical 0.16, grounding 0.40, contract 0.22, faithfulness 0.22.

The grounding policy also discourages box spam: the main run allowed a maximum of 3 predicted boxes, applied a penalty for extra boxes, and used an oversize penalty strength of 0.9.


A Simple Example

Imagine a remote wetland in southern Patagonia.

A single satellite image might show water, vegetation, and exposed ground. A person could inspect it and form an opinion. But one image does not say whether the wetland is normal, expanding, drying, or recovering.

Now add temporal memory. The system can compare recent observations with earlier ones. It can look for changing water extent, vegetation shifts, or exposed bare ground. Then the VLM can summarize the situation in ordinary language and point to the part of the image that deserves review.

That turns a satellite tile into a decision cue: "This area may be changing. Put it on the list."

For flood response, habitat monitoring, wildfire preparedness, and climate adaptation, that kind of prioritization is often the difference between being overwhelmed by data and acting early.


What We Learned in Patagonia

The Patagonia runs compared the NuTonic satellite specialist against the base Liquid model across twelve curated targets. The evaluation looked at whether the model used the right environmental vocabulary, whether it produced structured outputs, whether it pointed to relevant areas, whether its caption respected the injected analytics, and whether TiM-style temporal context improved the answer.

The headline technical result is that temporal context improved both models in the main run:

Model Mean composite with temporal context Mean composite without temporal context Lift from temporal context
NuTonic/lspace satellite fine-tune 0.4825 0.3857 +0.1357
Base LFM2.5-VL-450M 0.2885 0.1528 +0.0968

That is the core signal for the competition story: a satellite model becomes more useful when it is allowed to reason with time.

The full metric breakdown is more nuanced:

Metric Base LFM2.5-VL-450M NuTonic/lspace fine-tune Readout
Mean composite 0.2885 0.4825 Finetune wins overall in this run.
Output contract / structured 0.0000 0.7333 Finetune followed the strict output format much better.
Lexical 0.4854 0.6125 Finetune used expected AOI vocabulary more consistently.
Grounding 0.1590 0.2240 Fine-tune pointed to relevant regions better on average.
Faithfulness / TiM alignment 0.6182 0.6882 Both used injected analytics; fine-tune was slightly higher.
Pass rate at 0.5 0.1667 0.4167 Finetune passed more examples at the selected threshold.

For a technical reader, the interesting point is not that the fine-tune wins every metric. The interesting point is that the temporal prompt condition moves both models upward, and the satellite fine-tune shows the larger relative gain. The model family is responding to temporal evidence; the next improvement target is stricter output-format training plus healthier TiM signal quality.

The benchmark also includes counterfactual probes such as tim_payload_flip, where analytics are perturbed to test whether a model changes its answer when the temporal/analytics evidence changes. These probes are important because a useful predictive Earth system should not merely recite a plausible caption; it should be sensitive to the evidence.


Why This Matters Beyond Scores

Benchmarks are useful, but the real prize is a new kind of Earth interface.

Today, satellite platforms often ask users to know what to search for. A person chooses a layer, chooses a date, chooses a region, toggles filters, and manually interprets the result. That is powerful for experts, but it does not scale to every watershed, coastline, protected area, and fire-prone region.

The TiM + VLM approach changes the relationship:

  • The system can watch many regions.
  • It can summarize likely changes.
  • It can flag places where the story is unusual.
  • It can turn technical observations into language.
  • It can support a priority queue for human experts.

That makes the model valuable not merely as a captioning engine, but as a first responder for attention.


The Product Vision

The repository already points toward a practical product workflow:

  1. A user selects or monitors an area on a map.
  2. The system materializes satellite imagery and temporal context.
  3. TiM extracts change-oriented signals.
  4. The satellite VLM turns those signals into readable explanations.
  5. A PRO-style interface presents the result as a bundle: imagery, summary, likely change story, and regions to inspect.

In a conservation context, that could mean faster review of protected marine and wetland zones.

In a wildfire context, it could mean watching vulnerable steppe regions with more temporal awareness.

In a flood context, it could mean earlier triage of landscapes where water patterns are shifting.

In a climate context, it could mean turning years of difficult satellite data into plain-language narratives people can act on.


What Makes This Different

The important shift is not simply "AI for satellite images." The important shift is memory plus explanation.

Traditional computer vision can classify a scene. TiM-style temporal reasoning can ask how the scene is changing. A vision-language model can explain that change in words and connect it to human goals.

Together, they form a more natural intelligence loop:

flowchart TB
  See[See the Earth] --> Remember[Remember recent change]
  Remember --> Explain[Explain what changed]
  Explain --> Prioritize[Prioritize where people should look]
  Prioritize --> Act[Support action]
  Act --> See

This is how satellite AI becomes useful to more than specialists. It becomes a system that can speak across disciplines: science, conservation, emergency response, policy, journalism, and local planning.


The Honest Technical Footnote

The published Patagonia runs are an early prototype, not a finished emergency-response product.

That distinction matters. The measured temporal lift is still real in the harness: the TiM-in-prompt condition improved composite scores over the image-only/no-TiM condition. But the published numbers should be read as evidence for the evaluation architecture and temporal-context prompt value, not as final proof that the TiM semantic outputs were healthy in those exact runs.

Technically, the temporal evidence came from several mechanisms at once:

  • STAC Sentinel-2 windows used to construct the current/late image and TiM batch rows.
  • Explicit temporal_scenes for wildfire and flood-pulse targets.
  • Delta-mode optical gold built from bi-temporal SCL disagreement with a 31-day minimum separation.
  • Prompt variants comparing TiM-context rows against no-TiM rows and counterfactual TiM payload flips.

That caveat does not weaken the vision. It clarifies the next milestone:

Once the temporal signal is fully healthy, the same architecture becomes even more compelling.

The system already demonstrates the central product pattern: feed satellite history into a language-capable visual model, ask it to explain environmental change, and use the result to guide human attention.


NuTonic's Patagonia Release

NuTonic's Patagonia work is a prototype for predictive Earth intelligence.

It takes the constant flow of satellite imagery and gives it three missing abilities:

  • Memory: looking across time, not just at one image.
  • Language: explaining the change in words people can use.
  • Priority: helping decide where attention should go first.

This matters because climate and disaster systems fail when signals arrive too late or are buried in data. The planet already gives us warning signs. The challenge is reading them quickly enough.

By combining TiM with a satellite-specialized VLM, NuTonic is building toward an AI system that can help society move from watching the Earth to anticipating its changes.

That is the promise of Patagonia: a remote, beautiful, fragile landscape used as a proving ground for an idea the whole planet needs.


Refs

Community

Sign up or log in to comment