--- tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer widget: - text: On 07 Nov, send my brother the summary from section 2 of the document and enable airplane mode on my phone - text: Could you please share the' Budget Reports' folder with me and update the notification settings in Slack before the Quarterly Review Meeting? Also, send the details to my email at emily . chen @ workmail . com - text: Find all images from March 3rd that are less than 1MB, and read out the caption under figure 5 . Set the device to silent mode - text: Please send the document named annual_report_2023 . xlsx from the Finance folder, specifically the summary on page 5, to my manager at manager @ acme . com - text: Text my mother at + 44 7911 123456 the summary from paragraph 4, and then enable bluetooth pipeline_tag: token-classification library_name: span-marker metrics: - precision - recall - f1 model-index: - name: SpanMarker results: - task: type: token-classification name: Named Entity Recognition dataset: name: Unknown type: unknown split: eval metrics: - type: f1 value: 0.8683998712169995 name: F1 - type: precision value: 0.8558622877994606 name: Precision - type: recall value: 0.8813102434242771 name: Recall --- # SpanMarker This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Maximum Sequence Length:** 512 tokens - **Maximum Entity Length:** 12 words ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:---------------|:--------------------------------------------------------------------------------| | action | "Remind", "scheduled", "review" | | app_data_type | "items", "images", "videos" | | app_name | "Camera", "phone", "Slack" | | contact_info | "sarah . lee @ company . org", "123 Maple Street , Springfield", "home address" | | date | "20 . 10 . 1999", "before", "January 18 - June 15" | | event_title | "team sync", "Marketing Strategy Meeting", "Budget Planning" | | file_name | "notes", "budget_overview . xlsx", "project_plan . docx" | | file_size | "under 500 kb", "smaller than 50 kb", "exceeding 100 mb" | | file_type | "documents", "document", "image" | | folder_name | "Projects", "Work", "Photos" | | in_file_data | "appendix section", "page 10", "section 5" | | limits | "top 8", "all", "every" | | location | "Room 204", "server room", "library" | | person_name | "Jonathan Kim", "Mr . Osei", "Lucas Müller" | | relationship | "manager", "brother", "cousin" | | setting | "brightness", "airplane mode", "notifications" | | system_command | "disable", "move", "switch on" | | time | "9 : 00 AM", "10 : 45", "10 : 00 AM" | ## Evaluation ### Metrics | Label | Precision | Recall | F1 | |:---------------|:----------|:-------|:-------| | **all** | 0.8559 | 0.8813 | 0.8684 | | action | 0.8173 | 0.9245 | 0.8676 | | app_data_type | 0.7960 | 0.6828 | 0.7351 | | app_name | 0.9432 | 0.9432 | 0.9432 | | contact_info | 0.8722 | 0.9091 | 0.8903 | | date | 0.9160 | 0.8993 | 0.9076 | | event_title | 0.8659 | 0.9107 | 0.8877 | | file_name | 0.9371 | 0.9280 | 0.9326 | | file_size | 0.7810 | 0.7810 | 0.7810 | | file_type | 0.7731 | 0.8786 | 0.8225 | | folder_name | 0.9618 | 0.8968 | 0.9282 | | in_file_data | 0.7486 | 0.7867 | 0.7672 | | limits | 0.9048 | 0.6786 | 0.7755 | | location | 0.8917 | 0.8571 | 0.8741 | | person_name | 0.9885 | 0.9885 | 0.9885 | | relationship | 0.9505 | 0.9541 | 0.9523 | | setting | 0.8974 | 0.9255 | 0.9112 | | system_command | 0.7889 | 0.7441 | 0.7659 | | time | 0.9076 | 0.8587 | 0.8825 | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Run inference entities = model.predict("Text my mother at + 44 7911 123456 the summary from paragraph 4, and then enable bluetooth") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("span_marker_model_id-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 3 | 19.0206 | 53 | | Entities per sentence | 1 | 5.7015 | 13 | ### Training Hyperparameters - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 5 - mixed_precision_training: Native AMP ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 1.8553 | 1000 | 0.0344 | 0.8301 | 0.8650 | 0.8472 | 0.9204 | | 3.7106 | 2000 | 0.0271 | 0.8524 | 0.8804 | 0.8662 | 0.9316 | ### Framework Versions - Python: 3.12.12 - SpanMarker: 1.7.0 - Transformers: 4.51.3 - PyTorch: 2.8.0+cu126 - Datasets: 3.6.0 - Tokenizers: 0.21.4 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```