PEFT
Safetensors
code
code-search
text-embeddings
decoder-only
supervised-contrastive-learning
codegemma
llm2vec
Instructions to use SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-ruby with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-ruby with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - code | |
| library_name: peft | |
| tags: | |
| - code-search | |
| - text-embeddings | |
| - decoder-only | |
| - supervised-contrastive-learning | |
| - codegemma | |
| - llm2vec | |
| ## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search? | |
| This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. | |
| In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. | |
| For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: | |
| β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** | |
| --- | |
| # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN | |
| ## π Model Description | |
| This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above. | |
| The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets. | |
| ## π¬ Model Performance & Reproducibility | |
| The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation. | |
| | Attribute | Details | | |
| | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ | | |
| | **Base Model** | `google/codegemma-7b-it` | | |
| | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | | | |
| | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) | | |
| | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. | | |
| --- | |
| ## π How to Use (with `llm2vec`) | |
| For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model. | |
| **1. Install Dependencies** | |
| ```bash | |
| pip install llm2vec transformers torch peft accelerate | |
| ``` | |
| **2. Example Usage** | |
| > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter. | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel, AutoConfig | |
| from peft import PeftModel | |
| from llm2vec import LLM2Vec | |
| # --- 1. Define Model IDs --- | |
| base_model_id = "google/codegemma-7b-it" | |
| mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP" | |
| supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-ruby" | |
| # --- 2. Load Base Model and MNTP Adapter --- | |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) | |
| config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) | |
| model = AutoModel.from_pretrained( | |
| base_model_id, | |
| trust_remote_code=True, | |
| config=config, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda" if torch.cuda.is_available() else "cpu", | |
| ) | |
| model = PeftModel.from_pretrained(model, mntp_model_id) | |
| model = model.merge_and_unload() | |
| # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model --- | |
| model = PeftModel.from_pretrained(model, supcon_model_id) | |
| # --- 4. Use the LLM2Vec Wrapper for Encoding --- | |
| l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) | |
| queries = ["how to read a file in Python?"] | |
| code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"] | |
| query_embeddings = l2v.encode(queries) | |
| code_embeddings = l2v.encode(code_snippets) | |
| print("Query Embedding Shape:", query_embeddings.shape) | |
| # This usage example is adapted from the official llm2vec repository. Credits to the original authors. | |
| ``` | |
| --- | |
| ## π Citation | |
| If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work. | |
| **Our Paper:** | |
| * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240) | |
| * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch) | |
| * **BibTeX:** | |
| ```bibtex | |
| @article{chen2024decoder, | |
| title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, | |
| author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, | |
| journal={arXiv preprint arXiv:2410.22240}, | |
| year={2024} | |
| } | |
| ``` | |
| **llm2vec (Foundational Work):** | |
| * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961) | |
| * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec) | |
| * **BibTeX:** | |
| ```bibtex | |
| @article{vaishaal2024llm2vec, | |
| title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, | |
| author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, | |
| journal={arXiv preprint arXiv:2404.05961}, | |
| year={2024} | |
| } | |
| ``` |