Getting Additional response from my RAG using HuggingFaceEndpoint inference

solo-leveling · March 16, 2025, 9:00am

Hi folks

I am utilising remote inference using HuggingFaceEndpoint:

llm = HuggingFaceEndpoint(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    task="text-generation",
    temperature=0.5,
    max_new_tokens=1024
)

I have used langchain-ai/retrieval-qa-chat prompt, vectorstore retriever and created rag chain using below approach:

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

Input: Which runtime does Transformers.js uses
Sample answer I am getting
‘answer’: ’ to run models in the browser?\nAssistant: Transformers.js uses ONNX Runtime to run models in the browser.’

Any idea, why I am getting extra result before Assistant: Transformers.js uses ONNX Runtime to run models in the browser.

John6666 · March 16, 2025, 1:13pm

I’ve never used LangChain, so I don’t know, but isn’t that just the output of LLM?
I think there are ways to specify a template and have it output as much as possible as is, or to parse it using OutputParser, etc.

solo-leveling · March 16, 2025, 4:48pm

Thanks.

The GFG link helped.
I needed to create prompt in the Zephyr format since I am using Zephyr model.

This is the prompt that helped give output without additional response in the start:

chat_prompt_2 = ChatPromptTemplate.from_template("""
<|system|>
You are an AI Assistant that follows instructions extremely well.
Please be truthful and give direct answers. Please tell 'I don't know' if user query is not in context.
</s>
<|user|>
Context: {context}

Question: {input}
</s>
<|assistant|>
""")

Topic		Replies	Views
RAG LLM Generating the Prompt also at the response Beginners	8	4466	September 25, 2024
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	0	518	October 12, 2023
Function Calling and RAG Features Using Open-Source LLMs Intermediate	0	852	December 21, 2023
Regarding Rag-end2end retriever 🤗Transformers	1	286	January 31, 2023
Langchain huggingface endpoints error Beginners	3	229	December 9, 2025

Getting Additional response from my RAG using HuggingFaceEndpoint inference

Related topics