| | --- |
| | title: Semantic Deduplication |
| | emoji: 🧹 |
| | colorFrom: green |
| | colorTo: green |
| | sdk: gradio |
| | sdk_version: 5.32.1 |
| | app_file: app.py |
| | pinned: false |
| | license: mit |
| | short_description: Deduplicate HuggingFace datasets in seconds |
| | hf_oauth: true |
| | hf_oauth_scopes: |
| | - write-repos |
| | - manage-repos |
| | --- |
| | |
| | # Semantic Text Deduplication Using SemHash |
| |
|
| | This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings. |
| |
|
| | ## Features |
| |
|
| | - **Two deduplication modes**: |
| | - **Single dataset**: Find and remove duplicates within one dataset |
| | - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1 |
| |
|
| | - **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only) |
| |
|
| | - **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted |
| |
|
| | - **Hub Integration**: 🆕 **Push deduplicated datasets directly to the Hugging Face Hub** after logging in |
| |
|
| | ## How to Use |
| |
|
| | ### 1. Choose Deduplication Type |
| | - **Cross-dataset**: Useful for removing training data contamination from test sets |
| | - **Single dataset**: Clean up duplicate entries within a single dataset |
| |
|
| | ### 2. Configure Datasets |
| | - Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`) |
| | - Specify the dataset splits (e.g., `train`, `test`, `validation`) |
| | - Set the text column name (usually `text`, `sentence`, or `content`) |
| |
|
| | ### 3. Set Similarity Threshold |
| | - **0.9** (default): Good balance between precision and recall |
| | - **Higher values** (0.95-0.99): More conservative, only removes very similar texts |
| | - **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts |
| |
|
| | ### 4. Run Deduplication |
| | Click **"Deduplicate"** to start the process. You'll see: |
| | - Loading progress for datasets |
| | - Deduplication progress |
| | - Results with statistics and example duplicates |
| |
|
| | ### 5. Push to Hub (New!) |
| | After deduplication completes: |
| | 1. **Log in** with your Hugging Face account using the login button |
| | 2. Enter a **dataset name** for your cleaned dataset |
| | 3. Click **"Push to Hub"** to upload the deduplicated dataset |
| |
|
| | The dataset will be saved as `your-username/dataset-name` and be publicly available. |
| |
|
| |
|
| | ## Notes |
| |
|
| | - The app preserves all original columns from the datasets |
| | - Only the text similarity is used for deduplication decisions |
| | - Deduplicated datasets maintain the same structure as the original |
| | - OAuth login is required only for pushing to the Hub, not for deduplication |
| |
|