Struggling with reproducing paper results

#57
by Lukaas - opened

Hi,

I have been struggling with this for some time now so that's why I wanted to reach out.

I trying to generate a library of proteins with similar properties to those discussed in the paper to ensure I am calling the model correctly.
However I keep getting a lot longer proteins than the ones generated in the paper.
My 1000 generations are on average around 300AA when I discard those that generate without an eos token.
But the 10000 ProtGPT2 generations average around 145AA which is similar to the 135AA pretraining data.
Other properties also don't match but this is the easiest to measure.
This is happening for my finetuned models but also the unmodified model gotten from huggingface which makes me question my method of generation.

Could you provide more information on how you generated the library of proteins that the paper was based on?

Code snippet to reproduce my issue:

model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
tokenizer.batch_decode(model.generate(torch.tensor([tokenizer("<|endoftext|>")['input_ids']]), max_new_tokens=250, temperature=1, top_k=950, top_p=1, repetition_penalty=1.2, do_sample=True, num_return_sequences=100), skip_special_tokens=False)

Is it posible that max new tokens was set to 100? This does give me the same rate of dropping truncated sequences and average resulting length.
I cannot unify this with the statement in the publication that says a context window of 250 tokens was used thought.

Hi Lukaas,

Thanks for reaching out! Let me see if we can fix this. Could you compute the perplexity of the generated sequences? Do you observe that the ones with lower perplexity are shorter? If so, I'd take the top 25-35% and proceed with those. It's been a while for me to remember the details of the manuscript but if that does not provide something closer to what you'd expect, we'd need to dig a lot deeper to see what is going on. Let me know how this goes anytime!

Hi nferruz. Thanks for responding!

When setting the max_new_tokens to 100 instead of 250 I do get the length distribution and rate of sequences dropped (without EOS token after 100 tokens) that is described in the paper. But I was confused by the statement in the paper that the generated proteins were generated with a context window of 250 tokens. Can you shed some light on this?

Three sequence datasets were produced to compare their properties. The ProtGPT2 dataset was generated by sampling 1000 batches of 100 sequences, each with the selected inference parameters and a window context of 250 tokens. This step produced 100,000 sequences. We filtered from this set those sequences whose length had been cut due to the window context, giving a total of 29,876 sequences. From this set, we randomly selected 10,000 sequences. Their average length is 149.2 ± 50.9 amino acids.

Sign up or log in to comment