macpaw-research/mnemos_entity_extractor_v1_small

3e3b1e0 verified 4 months ago

8.55 kB

tags:
  - span-marker
  - token-classification
  - ner
  - named-entity-recognition
  - generated_from_span_marker_trainer
widget:
  - text: >-
      On 07 Nov, send my brother the summary from section 2 of the document and
      enable airplane mode on my phone
  - text: >-
      Could you please share the' Budget Reports' folder with me and update the
      notification settings in Slack before the Quarterly Review Meeting? Also,
      send the details to my email at emily . chen @ workmail . com
  - text: >-
      Find all images from March 3rd that are less than 1MB, and read out the
      caption under figure 5 . Set the device to silent mode
  - text: >-
      Please send the document named annual_report_2023 . xlsx from the Finance
      folder, specifically the summary on page 5, to my manager at manager @
      acme . com
  - text: >-
      Text my mother at + 44 7911 123456 the summary from paragraph 4, and then
      enable bluetooth
pipeline_tag: token-classification
library_name: span-marker
metrics:
  - precision
  - recall
  - f1
model-index:
  - name: SpanMarker
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: Unknown
          type: unknown
          split: eval
        metrics:
          - type: f1
            value: 0.8683998712169995
            name: F1
          - type: precision
            value: 0.8558622877994606
            name: Precision
          - type: recall
            value: 0.8813102434242771
            name: Recall

SpanMarker

This is a SpanMarker model that can be used for Named Entity Recognition.

Model Details

Model Description

Model Type: SpanMarker
Maximum Sequence Length: 512 tokens
Maximum Entity Length: 12 words

Model Sources

Repository: SpanMarker on GitHub
Thesis: SpanMarker For Named Entity Recognition

Model Labels

Label	Examples
action	"Remind", "scheduled", "review"
app_data_type	"items", "images", "videos"
app_name	"Camera", "phone", "Slack"
contact_info	"sarah . lee @ company . org", "123 Maple Street , Springfield", "home address"
date	"20 . 10 . 1999", "before", "January 18 - June 15"
event_title	"team sync", "Marketing Strategy Meeting", "Budget Planning"
file_name	"notes", "budget_overview . xlsx", "project_plan . docx"
file_size	"under 500 kb", "smaller than 50 kb", "exceeding 100 mb"
file_type	"documents", "document", "image"
folder_name	"Projects", "Work", "Photos"
in_file_data	"appendix section", "page 10", "section 5"
limits	"top 8", "all", "every"
location	"Room 204", "server room", "library"
person_name	"Jonathan Kim", "Mr . Osei", "Lucas Müller"
relationship	"manager", "brother", "cousin"
setting	"brightness", "airplane mode", "notifications"
system_command	"disable", "move", "switch on"
time	"9 : 00 AM", "10 : 45", "10 : 00 AM"

Evaluation

Metrics

Label	Precision	Recall	F1
all	0.8559	0.8813	0.8684
action	0.8173	0.9245	0.8676
app_data_type	0.7960	0.6828	0.7351
app_name	0.9432	0.9432	0.9432
contact_info	0.8722	0.9091	0.8903
date	0.9160	0.8993	0.9076
event_title	0.8659	0.9107	0.8877
file_name	0.9371	0.9280	0.9326
file_size	0.7810	0.7810	0.7810
file_type	0.7731	0.8786	0.8225
folder_name	0.9618	0.8968	0.9282
in_file_data	0.7486	0.7867	0.7672
limits	0.9048	0.6786	0.7755
location	0.8917	0.8571	0.8741
person_name	0.9885	0.9885	0.9885
relationship	0.9505	0.9541	0.9523
setting	0.8974	0.9255	0.9112
system_command	0.7889	0.7441	0.7659
time	0.9076	0.8587	0.8825

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("span_marker_model_id")
# Run inference
entities = model.predict("Text my mother at + 44 7911 123456 the summary from paragraph 4, and then enable bluetooth")

Downstream Use

You can finetune this model on your own dataset.

Click to expand

from span_marker import SpanMarkerModel, Trainer

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("span_marker_model_id")

# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003

# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("span_marker_model_id-finetuned")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Sentence length	3	19.0206	53
Entities per sentence	1	5.7015	13

Training Hyperparameters

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 5
mixed_precision_training: Native AMP

Training Results

Epoch	Step	Validation Loss	Validation Precision	Validation Recall	Validation F1	Validation Accuracy
1.8553	1000	0.0344	0.8301	0.8650	0.8472	0.9204
3.7106	2000	0.0271	0.8524	0.8804	0.8662	0.9316

Framework Versions

Python: 3.12.12
SpanMarker: 1.7.0
Transformers: 4.51.3
PyTorch: 2.8.0+cu126
Datasets: 3.6.0
Tokenizers: 0.21.4

Citation

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}