Inquiry regarding the usage and unexpected generation results of UserLM-8B
We are experiencing issues when trying to run UserLM-8B. Regardless of whether we use the transformers library or vLLM with paper-specific prompts, the model fails to generate the expected human-like user queries.
1. Issues with the Hugging Face Example Code
We first executed the example code provided in the Hugging Face repository:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the model and tokenizer
model_path = "microsoft/UserLM-8b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
# Create a conversation
messages = [{"role": "system", "content": "You are a user who wants to implement a special type of sequence. The sequence sums up the two previous numbers in the sequence and adds 1 to the result. The first two numbers in the sequence are 1 and 1."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
end_token = "<|eot_id|>"
end_token_id = tokenizer.encode(end_token, add_special_tokens=False)
end_conv_token = "<|endconversation|>"
end_conv_token_id = tokenizer.encode(end_conv_token, add_special_tokens=False)
outputs = model.generate(
input_ids=inputs,
do_sample=True,
top_p=0.8,
temperature=1,
max_new_tokens=512,
eos_token_id=end_token_id,
pad_token_id=tokenizer.eos_token_id,
bad_words_ids=[[token_id] for token_id in end_conv_token_id]
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
Observation:
Instead of acting as a "user" and asking a question, the model directly outputted a block of code (which appears to be Java/C-style logic) rather than a natural language prompt:
private static long[] buildSequence(long k) {
long[] sequence = new long[(int) ((k * (k + 1)) / 2)];
long previousNumber = 1;
...
2. Issues with vLLM and Paper Prompts
We then attempted to deploy the model using vLLM and tested it with the prompt template described in the UserLM paper.
vLLM Command:
CUDA_VISIBLE_DEVICES="0" vllm serve microsoft/UserLM-8b \
--dtype auto \
--served-model-name UserLM-8b \
--tensor-parallel-size 1 \
--port 8021
Test Script:
from openai import OpenAI
server = OpenAI(base_url="http://localhost:8021/v1", api_key="token")
response = server.chat.completions.create(
model="UserLM-8b",
messages=[
{"role": "user", "content": """You are a human user interacting with an AI system to
Write a Python function: given an array of integers, sort ones between 1 and 9 inclusive, reverse the array, and replace digits by their name from "One", "Two", "Three", etc.
Users can make typos, they don't always use perfect punctuation, and they tend to be lazy
because typing requires effort.
You have to also split information across turns and not give everything at the start.
However, you should not make overdo these things in your outputs, you must realistically act
like a human.
Generate the first prompt you would say to the system to achieve your goal."""},
],
)
print(response.choices[0].message.content)
Observation:
The model immediately returns the <|endconversation|> token, or empty content, without generating any user-side dialogue:
# Response output
'\n<|endconversation|>'
Furthermore, we tried changing the message role to system, but the response is only copy the intent:
ChatCompletion(id='chatcmpl-8fc4f9307c8b29cb', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\nWrite a Python function: given an array of integers, sort ones between 1 and 9 inclusive, reverse the array, and replace digits by their name from “One”, “Two”, “Three”, etc.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning=None, reasoning_content=None), stop_reason=None, token_ids=None)], created=1766407524, model='UserLM-8b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=45, prompt_tokens=143, total_tokens=188, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, prompt_token_ids=None, kv_transfer_params=None)
Questions
- Is there a specific Chat Template or format we should be using that differs from the standard Llama-3/Hugging Face implementation?
- Does the model require a specific sequence of roles (e.g., must start with a specific system message or user prefix)?
We would appreciate any guidance on the correct way to prompt the model to ensure it generates the first turn of a user-AI interaction as intended.