I also wonder how come Gemini Chat managed to recognize it, but Gemini API didn’t, despite using the same model.
While Gemini x.x and GPT x.x are the models themselves—such as LLMs, VLMs, or multimodal LLMs—Gemini Chat and ChatGPT are Agentic RAGs (or more advanced services), so their behavior differs significantly.
In some cases, Gemini Chat and ChatGPT automatically even create and execute Python scripts within the cloud runtime, process the results, and—if necessary—call other Gemini or GPT models… essentially building and using a fairly complex pipeline like this internally on their own to provide the final answer based on those results. Seriously.
It’s quite difficult to replicate all of that as-is (though it is possible since there are OSS frameworks for that), and because it’s fully automated, the results are prone to variation (to the point where even slight differences in usage can lead to performance differences…), so it’s better to leave that for later.
It’s probably more realistic to start by building a pipeline that mimics only the parts that are important for your own project. Also, that approach is likely to be lighter:
Evidence-based photo appraisal for antiques
Gemini API + Lens + FAISS/RAG + OCR + expert-style QC
The key shift:
Generic resale items: photo appraisal is mostly object recognition.
Antiques: photo appraisal is hypothesis testing.
A simple item can work like this:
photo → “generic brass candle holder” → similar records → ~$10
A difficult antique should not work like this:
photo → “Meiji vase” → similar records → ~$500
For antiques, the valuable label is usually what must be proven. A vase can look like a Meiji-period Japanese vase while actually being a later decorative reproduction, Satsuma-style imitation, tube-lined revival piece, tourist/export ware, or seller-mislabeled object.
A better target pipeline:
photos
→ visible evidence
→ competing hypotheses
→ positive + negative retrieval
→ expert rubric checks
→ comp filtering
→ QC / skeptic review
→ supported value + upside-if-authenticated
Not:
photo
→ one best caption
→ FAISS text search
→ average price
1. Why Gemini Chat likely beat the Gemini API call
The likely reason is not simply “Gemini Chat used a better model.”
More likely:
Gemini model = the engine.
Gemini Chat / Gemini Apps = finished assistant product around the engine.
Gemini API = programmable access to the engine/tools, but you build the assistant behavior.
Gemini Apps support uploaded files/photos/videos in a chat workflow: Gemini Apps file upload docs. The Gemini API also supports image understanding, but you must explicitly send the right images, label them, prompt correctly, and attach tools/retrieval: Gemini image understanding.
In the vase case, Gemini Chat saw multiple views including the bottom. For ceramics, the base, foot rim, underside wear, mark, glaze/body transition, decoration detail, and restoration clues can matter more than the front view.
So the API probably failed because it behaved like a one-shot captioner. Gemini Chat succeeded because it behaved more like a multi-photo inspection assistant.
To match or beat Chat, replicate:
- multi-photo context
- labeled views
- detail crops
- OCR on marks/labels
- visual search as candidate discovery
- RAG over rubrics and prior cases
- structured JSON outputs
- tool calling
- comp filtering
- QC gates
Useful Gemini building blocks:
2. Three use cases, three routes
A. Generic brass candle holder
Mostly object recognition plus broad comp retrieval.
photo → “brass candle holder” → FAISS/text/image comps → ~$10
Route:
cheap Gemini call or OSS VLM
→ object type
→ visible material
→ broad database search
→ simple value range
Example:
Likely identity:
Generic brass candle holder.
Supported resale estimate:
~$8–15, depending on size, condition, and local marketplace.
Confidence:
Medium-high if no maker, designer, age, or unusual quality indicators are visible.
B. Branded brass candle holder
This is exact-identity / maker recognition.
Gemini may see:
brass candle holder
Google Lens may find:
specific branded brass candle holder
That is where visual search helps. Google Lens discovers visually similar images and related content from an image: Google Lens: how it works.
Route:
full photo + mark/logo crop
→ OCR if needed
→ Google Lens / visual search
→ exact or near-exact comp retrieval
→ database price lookup
→ Gemini reconciles evidence
For branded objects, the best Lens image is often:
logo
maker mark
label
pattern number
base stamp
distinctive design detail
Example:
Likely identity:
[Brand/model] brass candle holder.
Evidence:
Visible maker/brand clue + visual-search match + matching database records.
Supported estimate:
~$100, assuming same model, material, size, and condition.
C. Japanese vase
This is authentication / attribution / comp-validity.
Dangerous near-neighbors:
Meiji-period Satsuma vase
Meiji-style vase
Satsuma-style decorative vase
moriage tourist ware
tube-lined imitation
modern decorative reproduction
Chinese/Japanese-style decorative ceramic
Google Lens may surface the expensive visual hypothesis:
Meiji-period Japanese vase → maybe ~$500
But multi-photo evidence may support:
tube-lined or revival technique used to mimic older Meiji-period appearance
→ likely later/revival/reproduction
→ maybe ~$50
The QC question is not:
Which answer is cheaper?
It is:
Which hypothesis is best supported by visible evidence?
What contradicts the expensive hypothesis?
What evidence is missing?
Which sold comps actually match the supported hypothesis?
What upside remains if the item is later authenticated?
3. Recommended architecture
1. Photo intake / sufficiency gate
2. Risk routing
3. Crop and detail extraction
4. OCR and mark interpretation
5. Visual evidence extraction
6. Lens / visual-search candidate discovery
7. Multimodal retrieval
8. Negative-example retrieval
9. Category-specific rubric checks
10. Comparable-sales filtering
11. QC / skeptic review
12. Final appraisal report
Role separation:
| Component |
Correct role |
Incorrect role |
| Gemini API |
Evidence extraction, hypotheses, rubric reasoning, comp filtering, QC, report |
One-shot appraiser |
| Google Lens |
Candidate labels, visually similar web/listing discovery |
Authenticator or price authority |
| OCR |
Read marks, labels, stamps, signatures |
Maker/authenticity proof by itself |
| FAISS/Qdrant/vector DB |
Retrieve positive comps, negatives, marks, details |
Final price calculator |
| Rubric/RAG |
Tell model what to check per category |
Generic background only |
| QC agent |
Block unsupported claims, preserve upside, decide escalation |
Pick cheaper answer |
4. Photo sufficiency gate
For Japanese ceramics, require:
front view
back view
left and right side views
top / mouth / interior
bottom / base
foot rim close-up
mark / backstamp close-up
decoration macro
damage / restoration close-ups
scale photo
Internal schema:
{
"object_category_guess": "Japanese ceramic vase",
"sufficient_for_generic_identification": true,
"sufficient_for_authentication": false,
"missing_required_views": [
"clear bottom/base photo",
"foot rim close-up",
"legible mark/backstamp close-up",
"decoration macro"
],
"valuation_allowed": "low_confidence_only",
"blocked_claims": [
"Meiji-period attribution",
"verified maker",
"high-confidence high-value appraisal"
]
}
Hard rule:
If category = Japanese ceramic vase
and no clear base/foot/mark views are present,
then block high-confidence period and maker claims.
5. Route by risk
Route A: generic low-risk item
Route B: branded / marked / exact-product item
Route C: antique / authenticity-sensitive item
Route A
Gemini object ID
→ broad text/image database search
→ simple comp estimate
Route B
crop mark/logo
→ OCR
→ Lens / eBay / visual search
→ exact or near-exact comps
→ value estimate
eBay’s Browse API includes search by image for product-like discovery: eBay Browse API: search by image.
Route C
multi-photo intake
→ detail crops
→ OCR
→ positive retrieval
→ negative retrieval
→ category rubric
→ comp filtering
→ QC
→ supported value + upside scenario
6. Crop decisive details
Full-object views identify the broad object. Detail crops contain appraisal evidence.
For ceramics, crop:
base
foot rim
mark / backstamp
decoration detail
glaze/body transition
top/interior
damage/restoration
scale reference
Useful tools:
- Florence-2 for captioning/detection/segmentation-style tasks.
- Grounding DINO for text-prompted region detection.
- SAM / SAM 2 for segmentation after detection.
- YOLO variants if you train fixed appraisal-detail categories.
The cropper is not the appraiser. It ensures the model sees the same details a human specialist would inspect.
7. OCR and mark interpretation
A mark can be:
maker mark
workshop mark
retailer mark
import mark
country-of-origin mark
pattern number
decorative mark
apocryphal mark
fake mark
later-added label
Split mark handling:
1. detect mark region
2. transcribe mark
3. classify mark type
4. interpret appraisal implication
OCR candidates:
Example:
{
"ocr_text": "MADE IN JAPAN",
"mark_type": "country_of_origin_mark",
"appraisal_implication": "If original to the object, this conflicts with a 19th-century Meiji-period attribution.",
"confidence": "medium",
"needs_human_check": false
}
Never jump from:
mark visible
to:
maker verified
without corroboration.
8. Extract visible evidence, not final value
Bad prompt:
What is this item and what is it worth?
Better prompt:
You are not appraising yet.
Extract visible evidence only.
Separate visible facts, uncertain observations, interpretations, missing views, and risk flags.
Do not state period, maker, authenticity, or value as fact unless directly supported.
Do not generate a final price.
Example structured output:
{
"visible_facts": [
{
"fact": "The object is a ceramic vase form.",
"source_image": "front_view",
"confidence": 0.92
},
{
"fact": "Raised decorative outlines are visible around motifs.",
"source_image": "decoration_macro",
"confidence": 0.86
}
],
"uncertain_observations": [
{
"observation": "Raised decoration may be tube-lined, molded, or applied.",
"needed_evidence": "macro photo under angled light"
}
],
"missing_evidence": [
"clear base photo",
"foot rim macro",
"legible mark close-up",
"measurements",
"condition close-ups",
"provenance"
],
"risk_flags": [
"period_style_mismatch_possible",
"reproduction_possible",
"visual_similarity_not_authentication"
]
}
Use structured outputs for schema validation: Gemini structured outputs.
9. Lens as candidate discovery
Good representation:
{
"lens_candidates": [
"Meiji Satsuma vase",
"Japanese moriage vase",
"Satsuma-style decorative vase",
"Japanese export pottery"
],
"status": "candidate_discovery_only",
"warning": "Visual similarity does not establish period, maker, authenticity, condition, or value."
}
Rule:
Lens creates hypotheses.
Rubrics and comps test hypotheses.
QC decides whether claims are allowed, blocked, upside-only, or need expert review.
10. Add multimodal retrieval
Avoid the bottleneck:
image → Gemini description → FAISS text search
If the description misses “tube-lined imitation,” “moriage tourist ware,” “Satsuma-style reproduction,” or a mark clue, the right records may never be searched.
Add:
full image → similar object images
base crop → similar bases / foot rims
mark crop → similar marks
decoration crop → similar techniques
text query → similar descriptions
negative query → similar reproductions
Gemini Embedding 2 supports cross-modal retrieval: Gemini embeddings.
Use multiple indexes:
full_object_image_index
text_description_index
mark_crop_index
base_crop_index
foot_rim_index
decoration_detail_index
damage_detail_index
negative_example_index
auction_catalogue_page_index
If FAISS is enough, keep it. If metadata filtering becomes painful, consider Qdrant.
Example metadata filters:
{
"object_type": "vase",
"material": "ceramic",
"sale_status": "sold",
"source_type": "auction_result",
"has_base_photo": true,
"condition_known": true,
"period_claim": "Meiji-style",
"attribution_strength": "seller_claim | auction_house | specialist | authenticated"
}
11. Add negative retrieval
Most systems retrieve only positives:
Meiji Satsuma vase
Japanese vase
antique ceramic vase
Also retrieve lower-value confusables:
modern Satsuma-style reproduction
tube-lined imitation vase
moriage tourist ware
Meiji-style decorative vase
fake/apocryphal mark
Japanese-style ceramic reproduction
Chinese/Japanese-style decorative ceramic
Ask:
What lower-value confusable class could explain the same visual evidence?
Then compare:
Does the base match period examples or reproduction examples?
Does the decoration look hand-applied or mechanically uniform?
Does the mark support maker/period or merely style/import?
Do sold comps match the same material, technique, size, condition, and attribution strength?
QC should choose the best-supported hypothesis, not the cheaper one.
12. Use category-specific rubrics
There is no single perfect “all antiques” dataset. But cultural-heritage work shows the right pattern: expert-defined visual questions.
Useful references:
Start rubrics for:
Japanese / Chinese ceramics
silver vs silverplate
brass / bronze / resin
paintings vs prints
signed glass
furniture / joinery
designer decor
jewelry
Example rubric:
{
"category": "Japanese ceramic vase",
"required_views": [
"front",
"back",
"sides",
"top/interior",
"bottom/base",
"foot rim close-up",
"mark close-up",
"decoration macro",
"damage close-ups",
"scale photo"
],
"attributes_to_check": [
"object type",
"material/body",
"glaze",
"decoration technique",
"raised decoration method",
"mark type",
"foot rim",
"wear pattern",
"condition",
"restoration",
"period vs style",
"maker attribution strength"
],
"high_value_claims": [
{
"claim": "Meiji-period Satsuma vase",
"required_evidence": [
"period-consistent base and foot rim",
"period-consistent decoration technique",
"credible mark or provenance",
"no modern country-of-origin/import mark",
"matching sold comps from reliable sources",
"condition sufficiently documented"
]
}
],
"common_false_positives": [
"modern Satsuma-style decorative ware",
"tube-lined imitation",
"moriage tourist ware",
"Chinese/Japanese style confusion",
"seller-labeled Meiji without evidence",
"apocryphal or decorative marks"
],
"blocked_without_evidence": [
"authentic Meiji-period",
"verified maker",
"museum-quality",
"rare signed workshop piece"
],
"safe_language": [
"Satsuma-style",
"Japanese-style",
"unverified age",
"decorative ceramic vase",
"possibly later/revival"
]
}
Store rubrics, known reproductions, expert notes, and prior corrections in RAG/File Search: Gemini File Search.
13. Filter comps aggressively
A visually similar listing is not necessarily a valid comp.
A valid comp should match:
object type
material
technique
size
period/style
maker/attribution strength
condition
sale status
sale venue
source quality
photo completeness
sale date
Examples/inspiration:
Comp decision schema:
{
"comp_id": "lot_123",
"include_in_valuation": false,
"reason": "Rejected: visually similar but active listing only; no sold price; no base photo; period attribution unsupported.",
"matched_fields": [
"object_type",
"broad_style"
],
"missing_or_mismatched_fields": [
"sale_status",
"period",
"technique",
"condition",
"attribution_strength",
"base_photo"
]
}
Accepted comp schema:
{
"comp_id": "lot_456",
"include_in_valuation": true,
"reason": "Accepted: sold result, similar object type, similar size, similar later Satsuma-style decorative category, unsigned, comparable condition.",
"adjustments": [
"condition report incomplete",
"size within acceptable range",
"no verified maker, matching current item"
]
}
Do not average all retrieved prices.
Good report line:
Retrieved 12 visually similar records.
Rejected 8 as invalid comps.
Used 4 closer comps for supported valuation.
For reranking after retrieval, consider BGE-reranker-v2-m3.
14. QC with claim permissions
Classify claims:
allowed
blocked
upside-only
needs-human-review
Examples:
{
"claim": "Meiji-period Satsuma vase",
"status": "upside_only",
"reason": "Visual style is suggestive, but base/foot/mark/technique evidence is insufficient.",
"required_next_evidence": [
"clear base photo",
"foot rim macro",
"legible mark close-up",
"decoration macro under angled light",
"matching specialist sold comps"
]
}
{
"claim": "Japanese Satsuma-style decorative vase",
"status": "allowed",
"reason": "Supported by broad visual vocabulary and decoration style, while avoiding unsupported period authentication."
}
{
"claim": "authentic Meiji-period vase worth $500",
"status": "blocked",
"reason": "Current evidence does not establish period, maker, or technique strongly enough."
}
15. Supported value vs upside scenario
Do not output one number. Output:
supported value
conservative resale value
upside if authenticated
Example:
Supported value:
$40–80 as a later Satsuma-style decorative ceramic vase.
Conservative resale value:
$40–60 if listed honestly as unverified age/style only.
Upside scenario:
Potentially much higher if authenticated as Meiji-period or a signed workshop piece, but that requires stronger evidence: base, foot rim, mark, technique, condition, and specialist comps.
Current claim permission:
Do not list as authentic Meiji-period.
This avoids:
always choose cheap → miss rare finds
accept expensive visual match → overvalue fakes
Recommended final report format
Likely supported identity:
Japanese Satsuma-style decorative ceramic vase, likely later/revival rather than verified Meiji-period.
Supported value:
$40–80, assuming no major hidden damage.
Conservative resale value:
$40–60 if listed honestly as unverified age/style only.
Upside scenario:
Could be materially higher if authenticated as Meiji-period or a signed workshop piece, but current evidence does not support that claim.
Evidence supporting the supported identity:
- Japanese/Satsuma-style decorative vocabulary.
- Raised decorative outlines visible.
- Current photos do not verify period or maker.
- Technique may indicate later/revival/tube-lined imitation.
Evidence against the high-value Meiji claim:
- Visual similarity alone is not authentication.
- Mark/base/foot evidence is insufficient.
- Decoration technique needs closer verification.
- Valid comps must match size, material, technique, condition, and attribution.
Missing evidence:
- Sharp base photo.
- Foot rim macro.
- Legible mark close-up.
- Decoration macro under angled light.
- Measurements.
- Provenance or prior auction record.
Accepted comps:
- [comp IDs + reason]
Rejected comps:
- [comp IDs + reason]
Safe listing title:
Japanese Satsuma-style decorative ceramic vase, raised decoration, unverified age.
Do not list as:
Authentic Meiji-period Satsuma vase.
Confidence:
Medium-low from photos only.
Escalation:
Human expert review recommended if the user wants to sell, insure, purchase, donate, or consign based on the high-value scenario.
Human expert fallback
AI can triage and produce evidence reports, but human review is needed for high-stakes decisions. Human appraisal services use photo/info intake plus expert review, not one-shot photo guessing: ValueMyStuff: how it works. Appraisal guidance also increasingly treats generative AI as a tool that still requires professional judgment: Appraisal Foundation / USPAP / AO-41.
Use expert review when:
price spread across hypotheses is large
upside exceeds threshold
mark is visible but unclear
period/maker claim drives value
condition/restoration is uncertain
provenance is claimed
user wants insurance/tax/estate/sale support
Store expert corrections as future training/evaluation data.
Suggested stack
Gemini-centered core
Gemini image understanding
Gemini structured outputs
Gemini function calling
Gemini File Search / RAG
Gemini Embedding 2
Google Search grounding
URL Context
OSS / preprocessing
PaddleOCR / PaddleOCR-VL for marks and labels
Florence-2 / Grounding DINO / SAM / YOLO for crops
FAISS or Qdrant for vector retrieval
BM25 + dense embeddings for hybrid text search
BGE rerankers for comp filtering
Qwen3-VL / InternVL as optional OSS VLM baselines
Qwen3-VL-Embedding / Jina CLIP / CLIP / SigLIP as retrieval alternatives
External candidate discovery
Google Lens wrapper
eBay image search
Google Vision Web Detection
auction/sold-price databases
Workflow frameworks
Google ADK if you want Gemini-native agent orchestration
LangGraph if you want deterministic gates and human review
LlamaIndex if your RAG layer grows large
See also: Google ADK.
Build order
Phase 1 — Make Gemini API match Gemini Chat
5–10 labeled photos
base/foot/mark/detail views
appraisal-specific prompt
structured output
hypotheses + evidence + missing evidence
Goal:
Gemini API should reproduce the “tube-lined imitation / likely later decorative” insight when given the same evidence.
Phase 2 — Add crop + OCR
base crop
foot rim crop
mark crop
decoration crop
damage crop
OCR result
mark interpretation
Phase 3 — Add multimodal retrieval
full object
text description
mark crop
base crop
foot rim
decoration detail
negative examples
auction catalogue pages
Phase 4 — Add category rubrics
Japanese ceramics
silver vs silverplate
brass/bronze/resin
paintings vs prints
branded/decorator objects
Phase 5 — Add comp filtering
Reject:
active listings as primary comps
wrong material
wrong size
wrong period
wrong technique
wrong condition
signed vs unsigned mismatch
seller-label-only attribution
no base photo
no condition report
Phase 6 — Add QC and claim permissions
allowed
blocked
upside-only
needs-human-review
Phase 7 — Build a private benchmark
Create at least 100 labeled cases:
generic low-value items
branded/maker-mark items
Japanese/Asian ceramic confusables
OCR/mark-heavy items
damaged/restored items
Track:
unsupported high-value claim rate
reproduction detection
mark-reading accuracy
missing-photo detection
negative retrieval success
comp-filter precision
value-range overlap
confidence calibration
escalation accuracy
safe-title accuracy
Main metric:
Does the system avoid unsupported expensive claims while preserving rare-find upside?
Curated links
Gemini / Google API
Visual search
Cultural heritage / appraisal-style research
Market data / appraisal practice
OSS / infrastructure
Final answer
Your simple case is object recognition. Your branded case is exact-identity search. Your Japanese vase case is authenticity reasoning.
So the system should be three-tiered:
Generic item:
Gemini API + simple FAISS retrieval.
Branded item:
Gemini + OCR + Lens / image search + exact comps.
Antique item:
Gemini + multi-photo evidence + crops/OCR + Lens candidates + multimodal retrieval + negative examples + expert rubrics + comp filtering + QC gates.
The winning architecture is not:
Gemini vs Google Lens
It is:
Gemini for reasoning
+ Lens for candidate discovery
+ OCR/crops for evidence
+ multimodal retrieval for comps
+ negative examples for fake/reproduction detection
+ expert rubrics for domain checks
+ sold-comp filtering for valuation
+ QC gates for safe claims
+ human escalation for high-risk cases
That architecture can match or surpass Gemini Chat because it replicates what made the chat product useful—multi-photo context, conversational inspection, and tool-like behavior—while adding what Gemini Chat does not have by default: your private price database, negative-example corpus, category rubrics, comp filters, structured outputs, and deterministic appraisal safeguards.