Lauther/d4-embeddings-TripletLoss
Viewer • Updated • 5.19k • 105
How to use Lauther/d4-embeddings-v3.0 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lauther/d4-embeddings-v3.0")
sentences = [
"flow computer tags",
"What is an Uncertainty Curve Point?\nAn Uncertainty Curve Point represents a data point used to construct the uncertainty curve of a measurement system. These curves help analyze how measurement uncertainty behaves under different flow rate conditions, ensuring accuracy and reliability in uncertainty assessments.\n\nKey Aspects of an Uncertainty Curve Point:\n- Uncertainty File ID: Links the point to the specific uncertainty dataset, ensuring traceability.\nEquipment Tag ID: Identifies the equipment associated with the uncertainty measurement, crucial for system validation.\n- Uncertainty Points: Represent a list uncertainty values recorded at specific conditions, forming part of the overall uncertainty curve. Do not confuse this uncertainty points with the calculated uncertainty. \n- Flow Rate Points: Corresponding flow rate values at which the uncertainty was measured, essential for evaluating performance under varying operational conditions.\nThese points are fundamental for generating uncertainty curves, which are used in calibration, validation, and compliance assessments to ensure measurement reliability in industrial processes.\"\n\n**IMPORTANT**: Do not confuse the two types of **Points**:\n - **Uncertainty Curve Point**: Specific to a measurement system uncertainty or uncertainty simulation or uncertainty curve.\n - **Calibration Point**: Specific to the calibration.\n - **Uncertainty values**: Do not confuse these uncertainty points with the single calculated uncertainty.",
"What is a flow computer?\nA flow computer is a device used in measurement engineering. It collects analog and digital data from flow meters and other sensors.\n\nKey features of a flow computer:\n- It has a unique name, firmware version, and manufacturer information.\n- It is designed to record and process data such as temperature, pressure, and fluid volume (for gases or oils).",
"What is a Measured Magnitude Value?\nA Measured Magnitude Value represents a **DAILY** recorded physical measurement of a variable within a monitored fluid. These values are essential for tracking system performance, analyzing trends, and ensuring accurate monitoring of fluid properties.\n\nKey Aspects of a Measured Magnitude Value:\n- Measurement Date: The timestamp indicating when the measurement was recorded.\n- Measured Value: The daily numeric result of the recorded physical magnitude.\n- Measurement System Association: Links the measured value to a specific measurement system responsible for capturing the data.\n- Variable Association: Identifies the specific variable (e.g., temperature, pressure, flow rate) corresponding to the recorded value.\nMeasured magnitude values are crucial for real-time monitoring, historical analysis, and calibration processes within measurement systems.\n\nDatabase advices:\nThis values also are in **historics of a flow computer report**. Although, to go directly instead querying the flow computer report you can do it by going to the table of variables data in the database."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large-instruct on the d4-embeddings-triplet_loss dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Lauther/d4-embeddings-v3.0")
# Run inference
sentences = [
'PTE BRAGANÇA PAULISTA C',
'What is an Uncertainty Composition?\nAn Uncertainty Composition represents a specific factor that contributes to the overall uncertainty of a measurement system. These components are essential for evaluating the accuracy and reliability of measurements by identifying and quantifying the sources of uncertainty.\n\nKey Aspects of an Uncertainty Component:\n- Component Name: Defines the uncertainty factor (e.g., diameter, density, variance, covariance) influencing the measurement system.\n- Value of Composition: Quantifies the component’s contribution to the total uncertainty, helping to analyze which factors have the greatest impact.\n- Uncertainty File ID: Links the component to a specific uncertainty dataset for traceability and validation.\nUnderstanding these components is critical for uncertainty analysis, ensuring compliance with industry standards and improving measurement precision.',
'What is a Measurement Unit?\nA Measurement Unit defines the standard for quantifying a physical magnitude (e.g., temperature, pressure, volume). It establishes a consistent reference for interpreting values recorded in a measurement system.\n\nEach measurement unit is associated with a specific magnitude, ensuring that values are correctly interpreted within their context. For example:\n\n- °C (Celsius) → Used for temperature\n- psi (pounds per square inch) → Used for pressure\n- m³ (cubic meters) → Used for volume\nMeasurement units are essential for maintaining consistency across recorded data, ensuring comparability, and enabling accurate calculations within measurement systems.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
anchor, positive, and negative| anchor | positive | negative | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| anchor | positive | negative |
|---|---|---|
Orifice Diameter (mm) |
What is uncertainty? |
What is an Equipment Class? |
prueba_gonzalo |
What is a measurement system? |
What is a Measured Magnitude Value? |
Vazao Instantanea |
What is uncertainty? |
What is a report index or historic index? |
TripletLoss with these parameters:{
"distance_metric": "TripletDistanceMetric.COSINE",
"triplet_margin": 0.3
}
anchor, positive, and negative| anchor | positive | negative | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| anchor | positive | negative |
|---|---|---|
FQI-4301.4522B |
What is a measurement system? |
What is uncertainty? |
PTE GUARATINGUETA B |
What are historical report values? |
What is Equipment? |
PTE BRAGANÇA PAULISTA B |
What is an Uncertainty Composition? |
What is uncertainty? |
TripletLoss with these parameters:{
"distance_metric": "TripletDistanceMetric.COSINE",
"triplet_margin": 0.3
}
eval_strategy: stepsper_device_train_batch_size: 80per_device_eval_batch_size: 80weight_decay: 0.01max_grad_norm: 0.5num_train_epochs: 15lr_scheduler_type: cosinewarmup_ratio: 0.1fp16: Truedataloader_num_workers: 4overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 80per_device_eval_batch_size: 80per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 0.5num_train_epochs: 15max_steps: -1lr_scheduler_type: cosinelr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 4dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{hermans2017defense,
title={In Defense of the Triplet Loss for Person Re-Identification},
author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
year={2017},
eprint={1703.07737},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Base model
intfloat/multilingual-e5-large-instruct