conglt-mylong commited on
Commit
96aa63a
·
verified ·
1 Parent(s): 9865dcf

Upload 8 files

Browse files
README.md ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: distilbert/distilbert-base-uncased
3
+ language: en
4
+ license: apache-2.0
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - text-classification
8
+ - sentiment-analysis
9
+ - sentiment
10
+ - synthetic data
11
+ - multi-class
12
+ - social-media-analysis
13
+ - customer-feedback
14
+ - product-reviews
15
+ - brand-monitoring
16
+ widget:
17
+ - text: I absolutely loved this movie! The acting was superb and the plot was engaging.
18
+ example_title: Very Positive Review
19
+ - text: The service at this restaurant was terrible. I'll never go back.
20
+ example_title: Very Negative Review
21
+ - text: The product works as expected. Nothing special, but it gets the job done.
22
+ example_title: Neutral Review
23
+ - text: I'm somewhat disappointed with my purchase. It's not as good as I hoped.
24
+ example_title: Negative Review
25
+ - text: This book changed my life! I couldn't put it down and learned so much.
26
+ example_title: Very Positive Review
27
+ inference:
28
+ parameters:
29
+ temperature: 1
30
+ ---
31
+
32
+
33
+
34
+ # 🚀 (distil)BERT-based Sentiment Classification Model: Unleashing the Power of Synthetic Data
35
+
36
+
37
+ <!-- TRY IT HERE: https://huggingface.co/spaces/vdmbrsv/sentiment-analysis-english-five-classes
38
+
39
+ [![Join Our Discord](https://img.shields.io/badge/Discord-Join%20Now-7289DA?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/sznxwdqBXj)
40
+
41
+ ---- -->
42
+
43
+ # NEWS!
44
+
45
+ - 2024/12: We uploaded an even better and more robust sentiment model! The error rate is reduced by 10%, and overall accuracy is improved!
46
+
47
+
48
+
49
+
50
+
51
+
52
+
53
+ ----
54
+
55
+ ## Model Details
56
+ - **Model Name:** tabularisai/robust-sentiment-analysis
57
+ - **Base Model:** distilbert/distilbert-base-uncased
58
+ - **Task:** Text Classification (Sentiment Analysis)
59
+ - **Language:** English
60
+ - **Number of Classes:** 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
61
+ - **Usage:**
62
+ - Social media analysis
63
+ - Customer feedback analysis
64
+ - Product reviews classification
65
+ - Brand monitoring
66
+ - Market research
67
+ - Customer service optimization
68
+ - Competitive intelligence
69
+
70
+ ## Model Description
71
+
72
+ This model is a fine-tuned version of `distilbert/distilbert-base-uncased` for sentiment analysis. **Trained only on syntethic data**.
73
+
74
+ ### Training Data
75
+
76
+ The model was fine-tuned on synthetic data, which allows for targeted training on a diverse range of sentiment expressions without the limitations often found in real-world datasets.
77
+
78
+ ### Training Procedure
79
+
80
+ - The model was fine-tuned for 5 epochs.
81
+ - Achieved a train_acc_off_by_one (accuracy allowing for predictions off by one class) of approximately *0.95* on the validation dataset.
82
+
83
+ ## Intended Use
84
+
85
+ This model is designed for sentiment analysis tasks, particularly useful for:
86
+ - Social media monitoring
87
+ - Customer feedback analysis
88
+ - Product review sentiment classification
89
+ - Brand sentiment tracking
90
+
91
+ ## How to Use
92
+
93
+ Here's a quick example of how to use the model:
94
+
95
+ ```python
96
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
97
+ import torch
98
+
99
+ # Load model and tokenizer
100
+ model_name = "tabularisai/robust-sentiment-analysis"
101
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
102
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
103
+
104
+ # Function to predict sentiment
105
+ def predict_sentiment(text):
106
+ inputs = tokenizer(text.lower(), return_tensors="pt", truncation=True, padding=True, max_length=512)
107
+ with torch.no_grad():
108
+ outputs = model(**inputs)
109
+
110
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
111
+ predicted_class = torch.argmax(probabilities, dim=-1).item()
112
+
113
+ sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
114
+ return sentiment_map[predicted_class]
115
+
116
+ # Example usage
117
+ texts = [
118
+ "I absolutely loved this movie! The acting was superb and the plot was engaging.",
119
+ "The service at this restaurant was terrible. I'll never go back.",
120
+ "The product works as expected. Nothing special, but it gets the job done.",
121
+ "I'm somewhat disappointed with my purchase. It's not as good as I hoped.",
122
+ "This book changed my life! I couldn't put it down and learned so much."
123
+ ]
124
+
125
+ for text in texts:
126
+ sentiment = predict_sentiment(text)
127
+ print(f"Text: {text}")
128
+ print(f"Sentiment: {sentiment}\n")
129
+ ```
130
+
131
+ ## Model Performance
132
+
133
+ The model demonstrates strong performance across various sentiment categories. Here are some example predictions:
134
+ ```
135
+ 1. "I absolutely loved this movie! The acting was superb and the plot was engaging."
136
+ Predicted Sentiment: Very Positive
137
+
138
+ 2. "The service at this restaurant was terrible. I'll never go back."
139
+ Predicted Sentiment: Very Negative
140
+
141
+ 3. "The product works as expected. Nothing special, but it gets the job done."
142
+ Predicted Sentiment: Neutral
143
+
144
+ 4. "I'm somewhat disappointed with my purchase. It's not as good as I hoped."
145
+ Predicted Sentiment: Negative
146
+
147
+ 5. "This book changed my life! I couldn't put it down and learned so much."
148
+ Predicted Sentiment: Very Positive
149
+ ```
150
+
151
+
152
+
153
+ ## JS example
154
+
155
+ ```html
156
+ <!DOCTYPE html>
157
+ <html lang="en">
158
+ <head>
159
+ <meta charset="UTF-8">
160
+ <title>Tabularis Sentiment Analysis</title>
161
+ </head>
162
+ <body>
163
+ <div id="output"></div>
164
+
165
+ <script type="module">
166
+ import { AutoTokenizer, AutoModel, env } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]';
167
+
168
+ env.allowLocalModels = false;
169
+ env.useCDN = true;
170
+
171
+ const MODEL_NAME = 'tabularisai/robust-sentiment-analysis';
172
+
173
+ function softmax(arr) {
174
+ const max = Math.max(...arr);
175
+ const exp = arr.map(x => Math.exp(x - max));
176
+ const sum = exp.reduce((acc, val) => acc + val);
177
+ return exp.map(x => x / sum);
178
+ }
179
+
180
+ async function analyzeSentiment() {
181
+ try {
182
+ const tokenizer = await AutoTokenizer.from_pretrained(MODEL_NAME);
183
+ const model = await AutoModel.from_pretrained(MODEL_NAME);
184
+
185
+ const texts = [
186
+ "I absolutely loved this movie! The acting was superb and the plot was engaging.",
187
+ "The service at this restaurant was terrible. I'll never go back.",
188
+ "The product works as expected. Nothing special, but it gets the job done.",
189
+ "I'm somewhat disappointed with my purchase. It's not as good as I hoped.",
190
+ "This book changed my life! I couldn't put it down and learned so much."
191
+ ];
192
+
193
+ const output = document.getElementById('output');
194
+
195
+ for (const text of texts) {
196
+ const inputs = await tokenizer(text, { return_tensors: 'pt' });
197
+ const result = await model(inputs);
198
+
199
+ console.log('Model output:', result);
200
+
201
+ if (result.output && result.output.data) {
202
+ const logitsArray = Array.from(result.output.data);
203
+ console.log('Logits array:', logitsArray);
204
+
205
+ const probabilities = softmax(logitsArray);
206
+ const predicted_class = probabilities.indexOf(Math.max(...probabilities));
207
+
208
+ const sentimentMap = {
209
+ 0: "Very Negative",
210
+ 1: "Negative",
211
+ 2: "Neutral",
212
+ 3: "Positive",
213
+ 4: "Very Positive"
214
+ };
215
+
216
+ const sentiment = sentimentMap[predicted_class];
217
+ const score = probabilities[predicted_class];
218
+
219
+ output.innerHTML += `Text: "${text}"<br>`;
220
+ output.innerHTML += `Sentiment: ${sentiment}, Score: ${score.toFixed(4)}<br><br>`;
221
+ } else {
222
+ console.error('Unexpected model output structure:', result);
223
+ output.innerHTML += `Unable to process: "${text}"<br><br>`;
224
+ }
225
+ }
226
+ } catch (error) {
227
+ console.error('Error:', error);
228
+ document.getElementById('output').innerHTML = 'An error occurred. Please check the console for details.';
229
+ }
230
+ }
231
+
232
+ analyzeSentiment();
233
+ </script>
234
+ </body>
235
+ </html>
236
+ ```
237
+
238
+ ## Training Procedure
239
+
240
+ The model was fine-tuned on synthetic data using the `distilbert/distilbert-base-uncased` architecture. The training process involved:
241
+
242
+ - Dataset: Synthetic data designed to cover a wide range of sentiment expressions
243
+ - Training framework: PyTorch Lightning
244
+ - Number of epochs: 5
245
+ - Performance metric: Achieved train_acc_off_by_one of approximately 0.95 on the validation dataset
246
+
247
+ ## Ethical Considerations
248
+
249
+ While efforts have been made to create a balanced and fair model through the use of synthetic data, users should be aware that the model may still exhibit biases. It's crucial to thoroughly test the model in your specific use case and monitor its performance over time.
250
+
251
+ ## Citation
252
+ ```
253
+ Will be included
254
+ ```
255
+
256
+ ## Contact
257
+
258
+ For questions or private and reliable API with out model please contact `[email protected]`
259
+
260
+
261
+ tabularis.ai
262
+
263
+
264
+
265
+ <table align="center">
266
+ <tr>
267
+ <td align="center">
268
+ <a href="https://www.linkedin.com/company/tabularis-ai/">
269
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
270
+ </a>
271
+ </td>
272
+ <td align="center">
273
+ <a href="https://x.com/tabularis_ai">
274
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
275
+ </a>
276
+ </td>
277
+ <td align="center">
278
+ <a href="https://github.com/tabularis-ai">
279
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
280
+ </a>
281
+ </td>
282
+ <td align="center">
283
+ <a href="https://tabularis.ai">
284
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
285
+ </a>
286
+ </td>
287
+ </tr>
288
+ </table>
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "results/checkpoint-3000",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "Very Negative",
13
+ "1": "Negative",
14
+ "2": "Neutral",
15
+ "3": "Positive",
16
+ "4": "Very Positive"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "label2id": {
20
+ "Very Negative":0,
21
+ "Negative":1,
22
+ "Neutral":2,
23
+ "Positive":3,
24
+ "Very Positive":5
25
+ },
26
+ "max_position_embeddings": 512,
27
+ "model_type": "distilbert",
28
+ "n_heads": 12,
29
+ "n_layers": 6,
30
+ "pad_token_id": 0,
31
+ "qa_dropout": 0.1,
32
+ "seq_classif_dropout": 0.2,
33
+ "sinusoidal_pos_embds": false,
34
+ "tie_weights_": true,
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.46.3",
37
+ "vocab_size": 30522
38
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4dd7985622e09ea95da3334e4f07fa09a04fc65b7734fb8dd140d178fb0fc2b7
3
+ size 267841796
quantize_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "per_channel": true,
3
+ "reduce_range": true,
4
+ "per_model_config": {
5
+ "model": {
6
+ "op_types": [
7
+ "Reshape",
8
+ "Sqrt",
9
+ "Tanh",
10
+ "Mul",
11
+ "Concat",
12
+ "Add",
13
+ "ReduceMean",
14
+ "Cast",
15
+ "Sub",
16
+ "Erf",
17
+ "MatMul",
18
+ "Slice",
19
+ "Unsqueeze",
20
+ "Transpose",
21
+ "Constant",
22
+ "Gemm",
23
+ "Shape",
24
+ "Softmax",
25
+ "Gather",
26
+ "Div",
27
+ "Pow"
28
+ ],
29
+ "weight_type": "QInt8"
30
+ }
31
+ }
32
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff