| --- |
| tags: |
| - language-model |
| - next-token-prediction |
| - neural-network |
| - English |
| - Amharic |
| license: apache-2.0 |
| datasets: |
| - my-dataset |
| metrics: |
| - accuracy |
| - perplexity |
| model_type: nn |
| language: |
| - en |
| - am |
| --- |
| |
| # Neural Network-Based Language Model for Next Token Prediction |
|
|
| ## Overview |
| This project is a midterm assignment focused on developing a neural network-based language model for next token prediction. The model was trained using a custom dataset with two languages, English and Amharic. The project incorporates techniques in neural networks to predict the next token in a sequence, demonstrating a non-transformer approach to language modeling. |
|
|
| ## Project Objectives |
| The main objective of this project was to: |
| - Develop a neural network-based model for next token prediction without using transformers or encoder-decoder architectures. |
| - Experiment with multiple languages to observe model performance. |
| - Implement checkpointing to save model progress and generate text during different training stages. |
| - Present a video demo showcasing the model's performance in generating text in both English and Amharic. |
|
|
| ## Project Details |
|
|
| ### 1. Training Languages |
| The model was trained using datasets in English and Amharic. The datasets were cleaned and prepared, including tokenization and embedding for improved model training. |
|
|
| ### 2. Tokenizer |
| A custom tokenizer was created using Byte Pair Encoding (BPE). This tokenizer was trained on five languages: English, Amharic, Sanskrit, Nepali, and Hindi, but the model specifically utilized English and Amharic for this task. |
|
|
| ### 3. Embedding Model |
| A custom embedding model was employed to convert tokens into vector representations, allowing the neural network to better understand the structure and meaning of the input data. |
|
|
| ### 4. Model Architecture |
| The project uses an LSTM (Long Short-Term Memory) neural network to predict the next token in a sequence. LSTMs are well-suited for sequential data and are a popular choice for language modeling due to their ability to capture long-term dependencies. |
|
|
| ## Results and Evaluation |
|
|
| ### Training Curve and Loss |
| The model’s training and validation loss over time are documented and included in the repository (`loss_values.csv`). The training curve demonstrates the model's learning progress, with explanations provided for key observations in the loss trends. |
|
|
| ### Checkpoint Implementation |
| Checkpointing was implemented to save model states at different training stages, allowing for partial model evaluations and text generation demos. Checkpoints are included in the repository for reference. |
|
|
| ### Perplexity Score |
| The model's perplexity score, calculated during training, is available in the `perplexity.csv` file. This score provides an indication of the model's predictive accuracy over time. |
|
|
| ## Demonstration |
| A video demo, linked below, demonstrates: |
| - Random initialization text generation in English. |
| - Text generation using the trained model in both English and Amharic, with English translations provided using Google Translate. |
|
|
| **Video Demo Link:** [YouTube Demo](https://youtu.be/1m21NYmLSC4) |
|
|
| ## Instructions for Reproducing the Results |
| 1. Install dependencies (Python, PyTorch, and other required libraries). |
| 2. Load the .ipynb notebook and run cells sequentially to replicate training and evaluation. |
| 3. Refer to HuggingFace documentation for downloading the model and tokenizer files. |
|
|
| Note: The data for the project has been taken from [saillab/taco-datasets](https://huggingface.co/datasets/saillab/taco-datasets) |