Spaces:

zbller
/

Mecari

Sleeping

App Files Files Community

zbller commited on Sep 7, 2025

Commit

34c8a90

verified ·

1 Parent(s): 4150c2c

Upload folder using huggingface_hub

Browse files

Files changed (30) hide show

.gitattributes +1 -0
.gitignore +17 -0
README.md +147 -8
abst.png +3 -0
app.py +174 -0
configs/base.yaml +11 -0
configs/gatv2.yaml +36 -0
evaluate.py +288 -0
infer.py +797 -0
mecari/__init__.py +9 -0
mecari/analyzers/mecab.py +151 -0
mecari/config/config.py +84 -0
mecari/data/data_module.py +361 -0
mecari/featurizers/lexical.py +116 -0
mecari/models/__init__.py +6 -0
mecari/models/base.py +214 -0
mecari/models/gatv2.py +139 -0
mecari/utils/__init__.py +4 -0
mecari/utils/morph_utils.py +51 -0
mecari/utils/signature.py +39 -0
packages.txt +5 -0
preprocess.py +366 -0
pyproject.toml +56 -0
requirements.txt +20 -0
runtime.txt +1 -0
sample_model/config.yaml +39 -0
sample_model/model.pt +3 -0
train.py +388 -0
up_hf.py +18 -0
uv.lock +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+abst.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+.venv
+annotations*/
+experiments/
+lightning_logs/
+cache/
+results/
+KWDLC/

README.md CHANGED Viewed

@@ -1,14 +1,153 @@
 ---
-title: Mecari
-emoji: 🦀
-colorFrom: yellow
-colorTo: purple
 sdk: gradio
-sdk_version: 5.44.1
 app_file: app.py
 pinned: false
-license: cc-by-nc-4.0
-short_description: 'Demo of Mecari: GNN-based Morphological Analyzer'
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Mecari Morpheme Analyzer
+emoji: 🧩
+colorFrom: indigo
+colorTo: blue
 sdk: gradio
+sdk_version: 4.37.2
 app_file: app.py
 pinned: false
 ---
+# Mecari (Japanese Morphological Analysis with Graph Neural Networks)
+## Training
+### Overview
+Mecari [1] is a GNN‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.
+<p align="center">
+  <img src="abst.png" alt="Overview" width="70%" />
+  <!-- Adjust width (e.g., 60%, 50%, or px) as desired -->
+</p>
+### Graph
+The graph is built from MeCab morpheme candidates.
+### Annotation
+Annotations are created by matching morpheme candidates to gold labels.
+Annotations serve as the training targets (supervision) during learning.
+- `+`: Candidate that exactly matches the gold.
+- `-`: Any other candidate that overlaps by 1+ character with a `+` candidate.
+- `?`: All other candidates (ignored during training).
+### Training
+Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.
+### Inference
+Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.
+## Results (KWDLC test)
+- Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
+- MeCab (JUMANDIC) baseline:   Seg F1 0.9677, POS F1 0.9465
+The GATv2 model trained with this repository (current code and `configs/gatv2.yaml`) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.
+## Tested Environment
+- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
+- Python: 3.11.3
+- PyTorch: 2.2.2+cu121
+- CUDA (runtime): 12.1 (cu121)
+- MeCab (binary): 0.996
+- JUMANDIC: `/var/lib/mecab/dic/juman-utf8`
+## MeCab Setup (Ubuntu 24.04)
+1) Install packages (includes the JUMANDIC dictionary)
+```bash
+sudo apt update
+sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
+```
+2) Verify installation
+```bash
+mecab -v                       # e.g., mecab of 0.996
+test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"
+```
+## Project Setup
+```bash
+# Install uv if needed
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Create venv and install dependencies
+uv venv
+source .venv/bin/activate
+uv sync
+```
+## Quickstart (Morphological analysis)
+```bash
+# Analyze a single sentence with the bundled sample model
+python infer.py --text "東京都の外国人参政権"
+# Interactive mode
+python infer.py
+# After training, specify an experiment to use a custom model
+python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."
+```
+Note
+- When no experiment is specified, the model at `sample_model/` is loaded by default.
+## Train by yourself
+### KWDLC Setup (Required)
+```bash
+cd /path/to/Mecari
+git clone --depth 1 https://github.com/ku-nlp/KWDLC
+```
+- Training requires KWDLC (non‑KWDLC training is not supported at the moment).
+- Splits strictly follow the official `dev.id` / `test.id` files.
+### Preprocessing
+```bash
+python preprocess.py --config configs/gatv2.yaml
+```
+### Training
+```bash
+python train.py --config configs/gatv2.yaml
+```
+- Outputs are saved under `experiments/<name>/`.
+- The bundled model was trained with the current codebase and configuration (`configs/gatv2.yaml`).
+### Evaluation
+```bash
+python evaluate.py --max-samples 50 \
+  --experiment gatv2_YYYYMMDD_HHMMSS
+```
+## License
+CC BY‑NC 4.0 (non‑commercial use only)
+## Acknowledgments
+- [1] Technical inspiration: Mecari, a morphological analysis system developed by Google, as described in “Data processing for Japanese text‑to‑pronunciation models” by G. Mazovetskiy and T. Kudo (NLP2024 Workshop on Japanese Language Resources). URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
+- [2] Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav. "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?." 10th International Conference on Learning Representations, ICLR 2022. 2022.
+## Disclaimer
+- Independent academic implementation for educational and research purposes.
+- Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
+- Not affiliated with, endorsed by, or connected to Google or its subsidiaries.
+## Purpose
+- Academic research
+- Education
+- Technical skill development
+- Understanding of NLP techniques

abst.png ADDED Viewed

Git LFS Details

SHA256: 0a8f3de11ee14f75fe879878912d5c49fb761b5ee773ad97418427922a521742
Pointer size: 131 Bytes
Size of remote file: 344 kB

app.py ADDED Viewed

	@@ -0,0 +1,174 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import os
+import subprocess
+import gradio as gr
+# Ensure wandb never starts in Spaces
+os.environ["WANDB_MODE"] = "disabled"
+# Resolve MeCab binary for this process
+_default_mecab = "/usr/bin/mecab" if os.path.exists("/usr/bin/mecab") else "mecab"
+MECAB_BIN = os.getenv("MECAB_BIN", _default_mecab)
+os.environ["MECAB_BIN"] = MECAB_BIN
+# Lazy-loaded model
+_model = None
+_exp_info = None
+def _ensure_model():
+    global _model, _exp_info
+    if _model is None:
+        from infer import load_model
+        result = load_model()
+        if result is None:
+            raise RuntimeError(
+                "Model could not be loaded. Ensure sample_model/ exists with config.yaml and model.pt."
+            )
+        _model, _exp_info = result
+def _to_mecab_lines(results, optimal_morphemes=None) -> str:
+    # Build MeCab-like output lines
+    def mecab_features(m):
+        pos = m.get("pos", "*")
+        pos1 = m.get("pos_detail1", "*")
+        pos2 = m.get("pos_detail2", "*")
+        ctype = m.get("inflection_type", "*")
+        cform = m.get("inflection_form", "*")
+        base = m.get("base_form", m.get("lemma", "*")) or "*"
+        # Mecari output includes reading as 7th field
+        reading = m.get("reading", "*") or "*"
+        return f"{pos},{pos1},{pos2},{ctype},{cform},{base},{reading}"
+    items = (
+        optimal_morphemes
+        if optimal_morphemes
+        else [
+            {
+                "surface": r.get("surface", ""),
+                "pos": r.get("pos", "*"),
+                "pos_detail1": "*",
+                "pos_detail2": "*",
+                "inflection_type": "*",
+                "inflection_form": "*",
+                "base_form": r.get("surface", ""),
+                "reading": r.get("reading", "*"),
+            }
+            for r in results
+        ]
+    )
+    lines = [f"{m.get('surface','')}\t{mecab_features(m)}" for m in items]
+    lines.append("EOS")
+    return "\n".join(lines)
+def mecab_plain(text: str) -> str:
+    """Run system MeCab and return its raw parsing (surface\tCSV ...\nEOS)."""
+    try:
+        from mecari.analyzers.mecab import MeCabAnalyzer
+        analyzer = MeCabAnalyzer()
+        mecab_bin = os.getenv("MECAB_BIN", analyzer.mecab_bin)
+        args = [mecab_bin]
+        if isinstance(analyzer.jumandic_path, str) and os.path.isdir(analyzer.jumandic_path):
+            args += ["-d", analyzer.jumandic_path]
+        p = subprocess.run(args, input=text, text=True, capture_output=True)
+        out = (p.stdout or "") + ("\n" + p.stderr if p.stderr else "")
+        if p.returncode != 0:
+            return out.strip() or f"mecab error rc={p.returncode}"
+        # Trim extra tail fields (e.g., カテゴリ:*, ドメイン:*) and keep first 6 features
+        lines = []
+        for line in out.splitlines():
+            if not line or line.strip() == "EOS":
+                lines.append("EOS")
+                continue
+            if "\t" in line:
+                surface, feats = line.split("\t", 1)
+                parts = [s.strip() for s in feats.split(",")]
+                trimmed = parts[:6]
+                while len(trimmed) < 6:
+                    trimmed.append("*")
+                lines.append(f"{surface}\t{','.join(trimmed)}")
+            else:
+                lines.append(line)
+        # Ensure trailing EOS only once
+        if not lines or lines[-1] != "EOS":
+            lines.append("EOS")
+        return "\n".join(lines)
+    except FileNotFoundError:
+        return "MeCabバイナリが見つかりません（MECAB_BINやpackages.txtを確認）。"
+    except Exception as e:
+        return f"mecab実行時エラー: {e}"
+def analyze(text: str):
+    if not text or not text.strip():
+        return "", ""
+    try:
+        _ensure_model()
+        from infer import predict_morphemes_from_text
+        text = text.strip()
+        result = predict_morphemes_from_text(text, _model, _exp_info, silent=True)
+        if not result:
+            return "推論に失敗しました。", mecab_plain(text)
+        results, optimal_morphemes = result
+        mecari_out = _to_mecab_lines(results, optimal_morphemes)
+        mecab_out = mecab_plain(text)
+        return mecari_out, mecab_out
+    except FileNotFoundError:
+        return (
+            "MeCabが見つかりません。Spaceのpackages.txtに 'mecab' と 'mecab-jumandic-utf8' を含めてビルドし直すか、\n"
+            "変数 MECAB_BIN=/usr/bin/mecab を設定してください。"
+        ), ""
+    except Exception as e:
+        import traceback
+        tb = traceback.format_exc()
+        return f"エラー: {e}\n\n{tb}", ""
+FONT_CSS = """
+/* Prefer common system fonts for Latin text */
+body, .gradio-container, .prose, textarea, input, button,
+.gr-text-input input, .gr-text-input textarea, .gr-textbox textarea {
+  font-family: system-ui, -apple-system, 'Segoe UI', Roboto, 'Noto Sans',
+               'Helvetica Neue', Arial, 'Apple Color Emoji', 'Segoe UI Emoji',
+               sans-serif !important;
+}
+"""
+with gr.Blocks(theme=gr.themes.Soft(), css=FONT_CSS) as demo:
+    gr.Markdown(
+        """
+    # Mecari Morpheme Analyzer
+    GNNベースの形態素解析器"Mecari"のデモです。github: https://github.com/zbller/Mecari
+    """
+    )
+    with gr.Row():
+        inp = gr.Textbox(label="テキスト入力", value="とうきょうに行った", placeholder="とうきょうに行った", lines=3)
+    btn = gr.Button("解析する")
+    with gr.Row():
+        out_mecari = gr.Textbox(label="Mecari", lines=10)
+        out_mecab = gr.Textbox(label="MeCab（Jumandic）", lines=10)
+    btn.click(fn=analyze, inputs=inp, outputs=[out_mecari, out_mecab])
+    # Optional warm-up
+    def _warmup():
+        try:
+            _ensure_model()
+        except Exception:
+            pass
+    _warmup()
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=int(os.getenv("PORT", "7863")))

configs/base.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+features:
+  lexical_feature_dim: 100000
+training:
+  deterministic: false
+  annotations_dir: "annotations"
+  project_name: "mecari"
+inference:
+  checkpoint_dir: "experiments"
+  experiment_name: null

configs/gatv2.yaml ADDED Viewed

	@@ -0,0 +1,36 @@

+extends: "base.yaml"
+model:
+  type: "gatv2"
+  hidden_dim: 64
+  num_layers: 4
+  num_heads: 4
+  dropout: 0.1
+  num_classes: 1
+  share_weights: false
+edge_features:
+  use_bidirectional_edges: true
+training:
+  learning_rate: 0.001
+  batch_size: 128
+  max_steps: 10000
+  patience: 10
+  gradient_clip_val: 0.5
+  gradient_clip_algorithm: "norm"
+  num_workers: 4
+  accumulate_grad_batches: 1
+  seed: 42
+  warmup_steps: 500
+  warmup_start_lr: 0.0
+  optimizer:
+    type: "adamw"
+    weight_decay: 0.001
+  use_wandb: true
+  log_every_n_steps: 50
+  val_check_interval: 1.0
+loss:
+  use_pos_weight: true
+  label_smoothing: 0.0

evaluate.py ADDED Viewed

	@@ -0,0 +1,288 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""Unified evaluation for MeCab (JUMANDIC) and the trained model.
+Evaluates both systems on the same KWDLC test data and compares results.
+"""
+import argparse
+import subprocess
+from pathlib import Path
+from typing import Dict, List
+import torch
+from tqdm import tqdm
+def parse_knp_file(knp_file: Path) -> List[Dict]:
+    """Extract gold morphemes from a KNP file."""
+    sentences = []
+    current_sentence = []
+    current_text = ""
+    with open(knp_file, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.rstrip("\n")
+            if line.startswith("#"):
+                if line.startswith("# S-ID:"):
+                    if current_sentence:
+                        sentences.append({"morphemes": current_sentence, "text": current_text})
+                        current_sentence = []
+                        current_text = ""
+                continue
+            elif line == "EOS":
+                if current_sentence:
+                    sentences.append({"morphemes": current_sentence, "text": current_text})
+                    current_sentence = []
+                    current_text = ""
+            elif line.startswith("+") or line.startswith("*"):
+                continue
+            elif line:
+                parts = line.split(" ")
+                if len(parts) >= 4:
+                    surface = parts[0]
+                    reading = parts[1]
+                    pos = parts[3]
+                    current_sentence.append({"surface": surface, "reading": reading, "pos": pos})
+                    current_text += surface
+    return sentences
+def analyze_with_mecab(text: str) -> List[Dict]:
+    """Analyze text with MeCab (JUMANDIC) using a simple best-path parse."""
+    try:
+        result = subprocess.run(
+            ["mecab", "-d", "/var/lib/mecab/dic/juman-utf8"],
+            input=text,
+            capture_output=True,
+            text=True,
+            encoding="utf-8",
+        )
+        if result.returncode != 0:
+            return []
+        morphemes = []
+        for line in result.stdout.strip().split("\n"):
+            if line == "EOS":
+                break
+            parts = line.split("\t")
+            if len(parts) >= 2:
+                surface = parts[0]
+                features = parts[1].split(",")
+                if len(features) >= 7:
+                    pos = features[0]
+                    # Do not fallback reading to surface when missing ('*')
+                    reading = features[7] if len(features) > 7 and features[7] != "*" else ""
+                    morphemes.append({"surface": surface, "reading": reading, "pos": pos})
+        return morphemes
+    except Exception as e:
+        print(f"MeCab error: {e}")
+        return []
+def analyze_with_jumanpp(text: str) -> List[Dict]:
+    """Analyze text with JUMAN++ (optional baseline)."""
+    try:
+        result = subprocess.run(["jumanpp"], input=text, capture_output=True, text=True, encoding="utf-8")
+        if result.returncode != 0:
+            return []
+        morphemes = []
+        for line in result.stdout.strip().split("\n"):
+            if line.startswith("@") or line == "EOS":
+                continue
+            parts = line.split(" ")
+            if len(parts) >= 12:
+                surface = parts[0]
+                reading = parts[1]
+                pos = parts[3]
+                morphemes.append({"surface": surface, "reading": reading, "pos": pos})
+        return morphemes
+    except Exception as e:
+        print(f"JUMAN++ error: {e}")
+        return []
+def analyze_with_model(text: str, model, experiment_info) -> List[Dict]:
+    """Analyze text with the trained model."""
+    try:
+        import infer
+        results, optimal_morphemes = infer.predict_morphemes_from_text(
+            text, model=model, experiment_info=experiment_info, silent=True
+        )
+        morphemes = []
+        for morph in optimal_morphemes:
+            morphemes.append(
+                {"surface": morph["surface"], "reading": morph.get("reading", ""), "pos": morph.get("pos", "*")}
+            )
+        return morphemes
+    except Exception as e:
+        print(f"Model inference error: {e}")
+        return []
+def evaluate_morphemes(gold_morphemes: List[Dict], pred_morphemes: List[Dict]) -> Dict:
+    """Compute segmentation and POS F1 between gold and predictions."""
+    gold_spans = []
+    pred_spans = []
+    # Gold spans (from gold morphemes)
+    pos = 0
+    for m in gold_morphemes:
+        surface = m["surface"]
+        end = pos + len(surface)
+        gold_spans.append((pos, end, m["pos"]))
+        pos = end
+    # Predicted spans (from predictions)
+    pos = 0
+    for m in pred_morphemes:
+        surface = m["surface"]
+        end = pos + len(surface)
+        pred_spans.append((pos, end, m["pos"]))
+        pos = end
+    # Segmentation accuracy (without POS)
+    gold_seg = {(s, e) for s, e, _ in gold_spans}
+    pred_seg = {(s, e) for s, e, _ in pred_spans}
+    seg_correct = len(gold_seg & pred_seg)
+    seg_precision = seg_correct / len(pred_seg) if pred_seg else 0
+    seg_recall = seg_correct / len(gold_seg) if gold_seg else 0
+    seg_f1 = 2 * seg_precision * seg_recall / (seg_precision + seg_recall) if (seg_precision + seg_recall) > 0 else 0
+    # Accuracy with POS
+    gold_pos = set(gold_spans)
+    pred_pos = set(pred_spans)
+    pos_correct = len(gold_pos & pred_pos)
+    pos_precision = pos_correct / len(pred_pos) if pred_pos else 0
+    pos_recall = pos_correct / len(gold_pos) if gold_pos else 0
+    pos_f1 = 2 * pos_precision * pos_recall / (pos_precision + pos_recall) if (pos_precision + pos_recall) > 0 else 0
+    return {
+        "seg_precision": seg_precision,
+        "seg_recall": seg_recall,
+        "seg_f1": seg_f1,
+        "pos_precision": pos_precision,
+        "pos_recall": pos_recall,
+        "pos_f1": pos_f1,
+    }
+def main():
+    parser = argparse.ArgumentParser(description="Unified evaluation script")
+    parser.add_argument("--kwdlc-dir", type=str, default="KWDLC", help="Path to KWDLC root directory")
+    parser.add_argument(
+        "--test-ids", type=str, default="KWDLC/id/split_for_pas/test.id", help="File containing test IDs (one per line)"
+    )
+    parser.add_argument(
+        "--max-samples", type=int, default=None, help="Max number of samples to evaluate (default: all)"
+    )
+    parser.add_argument("--experiment", "-e", type=str, required=True, help="Experiment name to evaluate")
+    args = parser.parse_args()
+    # Load test document IDs
+    test_ids = []
+    with open(args.test_ids, "r") as f:
+        for line in f:
+            test_ids.append(line.strip())
+    if args.max_samples is not None:
+        test_ids = test_ids[: args.max_samples]
+    print(f"Evaluating: {len(test_ids)} files")
+    import infer
+    model_info = infer.load_model(experiment_name=args.experiment)
+    if model_info:
+        model, experiment_info = model_info
+        # Force CPU execution for evaluation
+        device = torch.device("cpu")
+        model = model.to(device)
+        experiment_info["device"] = device
+        print(f"Model: {experiment_info['name']}")
+    else:
+        print("Failed to load model")
+        model = None
+        experiment_info = None
+    mecab_results = []
+    model_results = []
+    print("\nStart evaluation...")
+    for test_id in tqdm(test_ids, desc="evaluating"):
+        # Find KNP file
+        found = False
+        knp_base = Path(args.kwdlc_dir) / "knp"
+        for subdir in knp_base.glob("w*"):
+            candidate = subdir / f"{test_id}.knp"
+            if candidate.exists():
+                knp_path = candidate
+                found = True
+                break
+        if not found:
+            continue
+        # Read gold data
+        gold_sentences = parse_knp_file(knp_path)
+        for sent_data in gold_sentences:
+            text = sent_data["text"]
+            gold_morphemes = sent_data["morphemes"]
+            # MeCab (JUMANDIC)
+            pred_mecab = analyze_with_mecab(text)
+            if pred_mecab:
+                result = evaluate_morphemes(gold_morphemes, pred_mecab)
+                mecab_results.append(result)
+            # Trained model
+            if model is not None:
+                pred_model = analyze_with_model(text, model, experiment_info)
+                if pred_model:
+                    model_eval = evaluate_morphemes(gold_morphemes, pred_model)
+                    model_results.append(model_eval)
+    # Aggregate and display results
+    print("\n" + "=" * 70)
+    print("Evaluation Results (KWDLC test data)")
+    print("=" * 70)
+    print(f"Num evaluated: MeCab={len(mecab_results)}, Model={len(model_results)}")
+    # MeCab (JUMANDIC)
+    if mecab_results:
+        avg_seg_f1 = sum(r["seg_f1"] for r in mecab_results) / len(mecab_results)
+        avg_pos_f1 = sum(r["pos_f1"] for r in mecab_results) / len(mecab_results)
+        print("\n[1] MeCab (JUMANDIC):")
+        print(f"    Seg F1:     {avg_seg_f1:.4f}")
+        print(f"    POS F1:     {avg_pos_f1:.4f}")
+    # Trained model
+    if model_results:
+        avg_seg_f1 = sum(r["seg_f1"] for r in model_results) / len(model_results)
+        avg_pos_f1 = sum(r["pos_f1"] for r in model_results) / len(model_results)
+        print(f"\n[2] Trained model ({experiment_info['name']}):")
+        print(f"    Seg F1:     {avg_seg_f1:.4f}")
+        print(f"    POS F1:     {avg_pos_f1:.4f}")
+if __name__ == "__main__":
+    main()

infer.py ADDED Viewed

	@@ -0,0 +1,797 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Show immediate feedback from the moment the command starts
+print("Loading model...", flush=True)
+import os
+import random
+from typing import Any, Dict, Optional, Tuple
+# Disable WandB during inference to avoid hanging processes
+os.environ["WANDB_MODE"] = "disabled"
+from importlib import import_module
+import numpy as np
+import torch
+import yaml
+from mecari.analyzers.mecab import MeCabAnalyzer
+from mecari.data.data_module import DataModule
+from mecari.utils.morph_utils import build_adjacent_edges, dedup_morphemes, normalize_mecab_candidates
+def set_seed(seed: int = 42) -> None:
+    """Set random seeds for reproducibility during inference.
+    Args:
+        seed: Random seed value.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+set_seed(42)
+def _find_best_checkpoint(checkpoints_dir: str, prefer_metric: str = "val_error") -> Tuple[Optional[str], float]:
+    """Find the best checkpoint file in a directory.
+    Args:
+        checkpoints_dir: Path to the checkpoints directory.
+        prefer_metric: Preferred metric ("val_error" or "val_loss").
+    Returns:
+        Tuple of (best checkpoint filename, score).
+    """
+    checkpoint_files = [f for f in os.listdir(checkpoints_dir) if f.endswith(".ckpt")]
+    if not checkpoint_files:
+        return None, float("inf")
+    best_checkpoint = None
+    best_score = float("inf")
+    # Prefer filenames that include the metric keyword (e.g., val_error=..., val_error_epoch=...)
+    for ckpt_file in checkpoint_files:
+        if prefer_metric == "val_loss" and ("val_loss=" in ckpt_file or "val_loss_epoch=" in ckpt_file):
+            try:
+                if "val_loss_epoch=" in ckpt_file:
+                    score_str = ckpt_file.split("val_loss_epoch=")[-1].split(".ckpt")[0]
+                else:
+                    score_str = ckpt_file.split("val_loss=")[-1].split(".ckpt")[0]
+                score = float(score_str)
+                if score < best_score:
+                    best_score = score
+                    best_checkpoint = ckpt_file
+            except (ValueError, IndexError):
+                pass
+        elif prefer_metric == "val_error" and ("val_error=" in ckpt_file or "val_error_epoch=" in ckpt_file):
+            try:
+                if "val_error_epoch=" in ckpt_file:
+                    score_str = ckpt_file.split("val_error_epoch=")[-1].split(".ckpt")[0]
+                else:
+                    score_str = ckpt_file.split("val_error=")[-1].split(".ckpt")[0]
+                score = float(score_str)
+                if score < best_score:
+                    best_score = score
+                    best_checkpoint = ckpt_file
+            except (ValueError, IndexError):
+                pass
+    # If not found, try the alternative metric
+    if not best_checkpoint:
+        other_metric = "val_loss" if prefer_metric == "val_error" else "val_error"
+        for ckpt_file in checkpoint_files:
+            if other_metric == "val_loss" and "val_loss=" in ckpt_file:
+                try:
+                    score_str = ckpt_file.split("val_loss=")[1].split("-loss.ckpt")[0]
+                    score = float(score_str)
+                    if score < best_score:
+                        best_score = score
+                        best_checkpoint = ckpt_file
+                except (ValueError, IndexError):
+                    pass
+            elif other_metric == "val_error" and "val_error=" in ckpt_file:
+                try:
+                    score_str = ckpt_file.split("val_error=")[1].split(".ckpt")[0]
+                    score = float(score_str)
+                    if score < best_score:
+                        best_score = score
+                        best_checkpoint = ckpt_file
+                except (ValueError, IndexError):
+                    pass
+    # Additional fallback: parse score from filename pattern (model-epoch-score.ckpt)
+    if not best_checkpoint:
+        for ckpt_file in sorted(checkpoint_files):
+            if ckpt_file == "last.ckpt":
+                continue
+            try:
+                stem = ckpt_file[:-5] if ckpt_file.endswith(".ckpt") else ckpt_file
+                # Fallback: treat the last hyphen-separated token as a score
+                last_tok = stem.split("-")[-1]
+                score = float(last_tok)
+                if score < best_score:
+                    best_score = score
+                    best_checkpoint = ckpt_file
+            except Exception:
+                continue
+    # Final fallback: use last.ckpt or the first file
+    if not best_checkpoint:
+        if "last.ckpt" in checkpoint_files:
+            best_checkpoint = "last.ckpt"
+        else:
+            best_checkpoint = sorted(checkpoint_files)[0]
+    return best_checkpoint, best_score
+def _load_model_by_type(model_type: str, checkpoint_path: str) -> Any:
+    """Load the appropriate model class based on type.
+    Args:
+        model_type: Model type ("gat" or "gatv2").
+        checkpoint_path: Path to the checkpoint file.
+    Returns:
+        Loaded model instance.
+    """
+    if model_type == "gatv2":
+        cls = getattr(import_module("mecari.models.gatv2"), "MecariGATv2")
+    model = cls.load_from_checkpoint(checkpoint_path, strict=False, map_location="cpu")
+    model.eval()
+    model.cpu()
+    return model
+def _instantiate_model_from_config(config: Dict[str, Any]):
+    """Instantiate a model using config fields (no checkpoint loading)."""
+    model_cfg = config.get("model", {})
+    training_cfg = config.get("training", {})
+    features_cfg = config.get("features", {})
+    if model_cfg.get("type") != "gatv2":
+        raise ValueError(f"Unsupported model type: {model_cfg.get('type')}")
+    MecariGATv2 = getattr(import_module("mecari.models.gatv2"), "MecariGATv2")
+    model = MecariGATv2(
+        hidden_dim=model_cfg.get("hidden_dim", 64),
+        num_classes=model_cfg.get("num_classes", 1),
+        learning_rate=training_cfg.get("learning_rate", 1e-3),
+        lexical_feature_dim=features_cfg.get("lexical_feature_dim", 100000),
+        num_heads=model_cfg.get("num_heads", 4),
+        share_weights=model_cfg.get("share_weights", False),
+        dropout=model_cfg.get("dropout", 0.1),
+        attn_dropout=model_cfg.get("attn_dropout", model_cfg.get("attention_dropout", 0.1)),
+        add_self_loops_flag=model_cfg.get("add_self_loops", True),
+        edge_dropout=model_cfg.get("edge_dropout", 0.0),
+        norm=model_cfg.get("norm", "layer"),
+    )
+    return model
+def _load_model_from_state(config_path: str, state_path: str):
+    """Load model from a plain state_dict plus config.yaml."""
+    with open(config_path, "r", encoding="utf-8") as f:
+        config = yaml.safe_load(f)
+    model = _instantiate_model_from_config(config)
+    state = torch.load(state_path, map_location="cpu")
+    # Lightning checkpoints saved via export may store under 'state_dict' already
+    if (
+        isinstance(state, dict)
+        and "state_dict" in state
+        and all(k.startswith("model.") for k in state["state_dict"].keys())
+    ):
+        state = state["state_dict"]
+    # Remove potential 'model.' prefix if present (depends on save path)
+    new_state = {}
+    for k, v in state.items():
+        nk = k
+        if k.startswith("model."):
+            nk = k[len("model.") :]
+        new_state[nk] = v
+    model.load_state_dict(new_state, strict=False)
+    model.eval()
+    model.cpu()
+    return model
+def load_model(
+    experiment_name: Optional[str] = None, model_type: Optional[str] = None, prefer_metric: str = "val_error"
+) -> Optional[Tuple[Any, Dict[str, Any]]]:
+    """Load a trained model and its experiment info.
+    Default behavior: load the single model under sample_model/.
+    If --experiment is provided (or sample_model is unavailable), use experiments/.
+    """
+    # Default: load from sample_model/
+    if not experiment_name:
+        root = "sample_model"
+        if os.path.exists(root):
+            fixed_config = os.path.join(root, "config.yaml")
+            state_path = os.path.join(root, "model.pt")
+            if os.path.exists(fixed_config) and os.path.exists(state_path):
+                try:
+                    with open(fixed_config, "r", encoding="utf-8") as f:
+                        config = yaml.safe_load(f)
+                    model = _load_model_from_state(fixed_config, state_path)
+                    experiment_info = {
+                        "name": os.path.basename(root),
+                        "path": root,
+                        "best_metric": None,
+                        "best_score": None,
+                        "model_type": config.get("model", {}).get("type", "unknown"),
+                        "best_model_path": state_path,
+                        "config": config,
+                    }
+                    return model, experiment_info
+                except Exception as e:
+                    print(f"Failed to load sample model: {e}")
+                    return None
+            print("sample_model/model.pt or config.yaml not found")
+            return None
+        else:
+            print("sample_model directory not found")
+            return None
+    # Specific experiment provided
+    if experiment_name:
+        exp_path = os.path.join("experiments", experiment_name)
+        config_path = os.path.join(exp_path, "config.yaml")
+        checkpoints_dir = os.path.join(exp_path, "checkpoints")
+        if not os.path.exists(config_path) or not os.path.exists(checkpoints_dir):
+            print(f"Experiment not found: {experiment_name}")
+            return None
+        try:
+            with open(config_path, "r", encoding="utf-8") as f:
+                config = yaml.safe_load(f)
+            model_type_from_config = config.get("model", {}).get("type", "unknown")
+            best_checkpoint, best_score = _find_best_checkpoint(checkpoints_dir, prefer_metric)
+            if not best_checkpoint:
+                print("No checkpoint found")
+                return None
+            metric_name = "val_loss" if prefer_metric == "val_loss" else "val_error"
+            experiment_info = {
+                "name": experiment_name,
+                "path": exp_path,
+                "val_error": best_score if prefer_metric == "val_error" else None,
+                "val_loss": best_score if prefer_metric == "val_loss" else None,
+                "best_metric": metric_name,
+                "best_score": best_score,
+                "model_type": model_type_from_config,
+                "best_model_path": os.path.join(checkpoints_dir, best_checkpoint),
+                "config": config,
+            }
+        except Exception as e:
+            print(f"Failed to read experiment info: {e}")
+            return None
+    # Auto-select the best experiment
+    else:
+        if not os.path.exists(experiments_dir):
+            print("Experiments directory does not exist")
+            return None
+        experiments = []
+        for exp_dir in os.listdir(experiments_dir):
+            exp_path = os.path.join(experiments_dir, exp_dir)
+            config_path = os.path.join(exp_path, "config.yaml")
+            checkpoints_dir = os.path.join(exp_path, "checkpoints")
+            if not os.path.exists(config_path) or not os.path.exists(checkpoints_dir):
+                continue
+            try:
+                with open(config_path, "r", encoding="utf-8") as f:
+                    config = yaml.safe_load(f)
+                exp_model_type = config.get("model", {}).get("type", "unknown")
+                if model_type and exp_model_type.lower() != model_type.lower():
+                    continue
+                best_checkpoint, best_score = _find_best_checkpoint(checkpoints_dir, prefer_metric)
+                if best_checkpoint:
+                    metric_name = "val_loss" if prefer_metric == "val_loss" else "val_error"
+                    experiments.append(
+                        {
+                            "name": exp_dir,
+                            "path": exp_path,
+                            "val_error": best_score if prefer_metric == "val_error" else None,
+                            "val_loss": best_score if prefer_metric == "val_loss" else None,
+                            "best_metric": metric_name,
+                            "best_score": best_score,
+                            "model_type": exp_model_type,
+                            "best_model_path": os.path.join(checkpoints_dir, best_checkpoint),
+                            "config": config,
+                        }
+                    )
+            except Exception:
+                continue
+        if not experiments:
+            print("No available experiments found")
+            return None
+        experiment_info = min(experiments, key=lambda x: x["best_score"])
+    # Load model
+    print(f"Loading model: {experiment_info['best_model_path']}")
+    print(f"Experiment: {experiment_info['name']}")
+    try:
+        model = _load_model_by_type(experiment_info["model_type"], experiment_info["best_model_path"])
+        # No BERT features in this pipeline
+        return model, experiment_info
+    except Exception as e:
+        print(f"Model loading error: {e}")
+        return None
+def viterbi_decode_from_morphemes(logits: torch.Tensor, morphemes: list, edges: list, silent: bool = False) -> list:
+    """Edge-based Viterbi decoding.
+    Args:
+        logits: Logits per morpheme.
+        morphemes: List of morpheme records.
+        edges: Edge list among morpheme indices.
+        silent: If True, suppress debug prints.
+    Returns:
+        Indices of morphemes on the optimal path.
+    """
+    if len(logits) != len(morphemes):
+        if not silent:
+            print(f"Warning: #logits ({len(logits)}) != #morphemes ({len(morphemes)})")
+        return list(range(min(len(logits), len(morphemes))))
+    if not silent:
+        print("\n=== Viterbi Decode ===")
+        print(f"#Morphemes: {len(morphemes)}")
+        print(f"Using edge info: {len(edges)} edges")
+        print("\nNode logits:")
+        for idx, (morph, logit) in enumerate(zip(morphemes, logits)):
+            print(
+                f"  [{idx:3d}] {morph['surface']:10s} ({morph['start_pos']:2d}-{morph['end_pos']:2d}) {morph['pos']:10s} logit={logit:.3f}"
+            )
+    # Build adjacency from edges (forward edges only)
+    n = len(morphemes)
+    adj_list = [[] for _ in range(n)]
+    for edge in edges:
+        source_idx = edge["source_idx"]
+        target_idx = edge["target_idx"]
+        if 0 <= source_idx < n and 0 <= target_idx < n:
+            # Add forward edges only (source.end_pos <= target.start_pos)
+            source_end = morphemes[source_idx].get("end_pos", 0)
+            target_start = morphemes[target_idx].get("start_pos", 0)
+            if source_end <= target_start:
+                adj_list[source_idx].append(target_idx)
+    # POS to UD mapping (for display)
+    pos_to_ud = {
+        "名詞": "NOUN",
+        "動詞": "VERB",
+        "形容詞": "ADJ",
+        "副詞": "ADV",
+        "助詞": "ADP",  # approximate
+        "助動詞": "AUX",
+        "接続詞": "CCONJ",
+        "連体詞": "DET",
+        "感動詞": "INTJ",
+        "代名詞": "PRON",
+        "形状詞": "ADJ",
+        "補助記号": "PUNCT",
+        "接頭辞": "PREFIX",
+        "接尾辞": "SUFFIX",
+    }
+    if not silent:
+        print("\nMorpheme details:")
+        for i, morpheme in enumerate(morphemes):
+            start_pos = morpheme.get("start_pos", 0)
+            end_pos = morpheme.get("end_pos", 0)
+            surface = morpheme.get("surface", "")
+            logit = morpheme.get("logit", 0.0)
+            pos = morpheme.get("pos", "")
+            pos_main = pos.split(",")[0] if "," in pos else pos
+            ud_pos = pos_to_ud.get(pos_main, "X")
+            print(f"  {i}: {surface} ({start_pos}-{end_pos}) {pos_main}({ud_pos}) logit={logit:.3f}")
+    # Dynamic programming
+    dp = [-float("inf")] * n  # max score to each node
+    parent = [-1] * n  # best predecessor per node
+    # Find start nodes (earliest start position)
+    start_nodes = []
+    min_start_pos = min(m.get("start_pos", 0) for m in morphemes)
+    for i, m in enumerate(morphemes):
+        if m.get("start_pos", 0) == min_start_pos:
+            start_nodes.append(i)
+    # Initialize start nodes
+    for i in start_nodes:
+        dp[i] = morphemes[i].get("logit", 0.0)
+    # Process nodes in position order (topological-like)
+    node_positions = [(i, morphemes[i].get("start_pos", 0), morphemes[i].get("end_pos", 0)) for i in range(n)]
+    node_positions.sort(key=lambda x: (x[1], x[2]))  # sort by start_pos, end_pos
+    # Relax edges for each node in order
+    for node_idx, _, _ in node_positions:
+        if dp[node_idx] == -float("inf"):
+            continue  # unreachable node
+        # Relax transitions to reachable next nodes
+        for next_idx in adj_list[node_idx]:
+            new_score = dp[node_idx] + morphemes[next_idx].get("logit", 0.0)
+            if new_score > dp[next_idx]:
+                dp[next_idx] = new_score
+                parent[next_idx] = node_idx
+    # Select best end node at the final position
+    end_nodes = []
+    max_end_pos = max(m.get("end_pos", 0) for m in morphemes)
+    for i, m in enumerate(morphemes):
+        if m.get("end_pos", 0) == max_end_pos:
+            end_nodes.append(i)
+    best_end_idx = -1
+    best_score = -float("inf")
+    for i in end_nodes:
+        if dp[i] > best_score:
+            best_score = dp[i]
+            best_end_idx = i
+    # Backtracking with safety cap to avoid infinite loops
+    path = []
+    current = best_end_idx
+    max_iterations = n * 2  # safety cap
+    iteration_count = 0
+    visited = set()
+    while current != -1 and iteration_count < max_iterations:
+        if current in visited:
+            print(f"Warning: Detected cycle during backtracking (node {current})")
+            break
+        visited.add(current)
+        path.append(current)
+        current = parent[current]
+        iteration_count += 1
+    if iteration_count >= max_iterations:
+        print(f"Warning: Backtracking reached max iterations ({max_iterations})")
+    path.reverse()
+    # Display
+    if path:
+        total_score = sum(morphemes[idx].get("logit", 0.0) for idx in path)
+        if not silent:
+            print(f"\nOptimal path (total score: {total_score:.3f}):")
+            for idx in path:
+                morpheme = morphemes[idx]
+                logit = morpheme.get("logit", 0.0)
+                print(f"  {morpheme['surface']} (logit: {logit:.3f})")
+    return path
+##
+# Global singletons (lazy initialization)
+_analyzer = None
+_data_module_cache = {}
+def predict_morphemes_from_text(text, model=None, experiment_info=None, silent=False):
+    """Predict morpheme boundaries from text.
+    Steps:
+    1. Analyze with MeCab to get candidates.
+    2. Build nodes/edges from morphemes and connections.
+    3. Run the model to get per-node scores.
+    4. Run Viterbi decoding over nodes and edges.
+    Args:
+        text: Input text.
+        model: Model to use.
+        experiment_info: Experiment metadata.
+        silent: If True, suppress prints.
+    """
+    global _analyzer
+    if model is None:
+        result = load_model()
+        if result is None:
+            return [], []
+        model, experiment_info = result
+    if not silent:
+        print(f"Input text: {text}")
+    # 1) Get morpheme candidates (initialize analyzer on first use)
+    if _analyzer is None:
+        _analyzer = MeCabAnalyzer()
+    # Fetch candidates directly via analyzer and deduplicate
+    candidates = _analyzer.get_morpheme_candidates(text)
+    candidates = normalize_mecab_candidates(candidates)
+    candidates = dedup_morphemes(candidates)
+    if not candidates:
+        print("Error: Failed to obtain morpheme candidates")
+        return [], []
+    if not silent:
+        print(f"#Candidates: {len(candidates)}")
+    # 2) Use candidates as morphemes
+    morphemes = candidates
+    # Validate type
+    if not isinstance(morphemes, list):
+        print(f"Warning: morphemes is not a list: {type(morphemes)}")
+        morphemes = []
+    # Add lexical features using the shared DataModule implementation
+    dm_tmp = DataModule(annotations_dir="dummy", batch_size=1, num_workers=0, lexical_feature_dim=100000, silent=True)
+    morphemes = dm_tmp.compute_lexical_features(morphemes, text)
+    # Build edges (adjacent only)
+    edges = build_adjacent_edges(morphemes)
+    # Add annotation field as '?' for inference
+    for morpheme in morphemes:
+        if "annotation" not in morpheme:
+            morpheme["annotation"] = "?"
+    if not silent:
+        print(f"Unified graph: {len(morphemes)} nodes, {len(edges)} edges")
+    # 3) Initialize DataModule per experiment settings
+    features_config = experiment_info["config"].get("features", {})
+    training_config = experiment_info["config"].get("training", {})
+    edge_config = experiment_info["config"].get("edge_features", {})
+    # Cache DataModule by annotations_dir
+    global _data_module_cache
+    cache_key = str(training_config.get("annotations_dir", "annotations_kwdlc"))
+    if cache_key not in _data_module_cache:
+        # Always use lexical features
+        _data_module_cache[cache_key] = DataModule(
+            annotations_dir=training_config.get("annotations_dir", "annotations_kwdlc"),
+            batch_size=1,
+            num_workers=0,
+            silent=silent,
+            lexical_feature_dim=features_config.get("lexical_feature_dim", 100000),
+            use_bidirectional_edges=edge_config.get("use_bidirectional_edges", True),
+        )
+    data_module = _data_module_cache[cache_key]
+    # Build graph using the same public API as preprocessing
+    graph = data_module.create_graph_from_morphemes_data(
+        morphemes=morphemes,
+        edges=edges,
+        text=text,
+        for_training=False,
+    )
+    if graph is None:
+        print("Error: Failed to create PyTorch graph")
+        return [], []
+    # Inference
+    # Device (CPU by default)
+    device = torch.device("cpu")
+    # Respect explicit device from experiment_info if present
+    if experiment_info and "device" in experiment_info:
+        device = experiment_info["device"]
+    with torch.no_grad():
+        # Ensure lexical feature tensors exist
+        if not hasattr(graph, "lexical_indices") or graph.lexical_indices is None:
+            print("Error: lexical_indices not found")
+            return [], []
+        logits = model(
+            graph.lexical_indices.to(device),  # lexical_indices
+            graph.lexical_values.to(device),  # lexical_values
+            graph.edge_index.to(device),
+            None,
+            graph.edge_attr.to(device) if graph.edge_attr is not None else None,
+        ).squeeze()
+        if logits.dim() == 0:
+            logits = logits.unsqueeze(0)
+        probabilities = torch.sigmoid(logits)
+        predictions = (probabilities >= 0.5).float()
+        # Move back to CPU for post-processing
+        logits = logits.cpu()
+        probabilities = probabilities.cpu()
+        predictions = predictions.cpu()
+    # Attach predictions to morphemes
+    for i, morpheme in enumerate(morphemes):
+        if i < len(predictions):
+            morpheme["predicted_annotation"] = "+" if predictions[i] == 1 else "-"
+            morpheme["logit"] = logits[i].item()
+            morpheme["probability"] = probabilities[i].item()
+    # 4) Viterbi decode over nodes/edges (no CRF)
+    optimal_path = viterbi_decode_from_morphemes(logits, morphemes, edges, silent=silent)
+    # Format results
+    results = []
+    for i, morpheme in enumerate(morphemes):
+        is_in_optimal_path = optimal_path and i in optimal_path
+        result = {
+            "surface": morpheme["surface"],
+            "pos": morpheme["pos"],
+            "reading": morpheme["reading"],
+            "predicted_annotation": morpheme.get("predicted_annotation", "?"),
+            "logit": morpheme.get("logit", 0.0),
+            "probability": morpheme.get("probability", 0.5),
+            "in_optimal_path": is_in_optimal_path,
+        }
+        results.append(result)
+    # Collect morphemes on the optimal path
+    optimal_morphemes = []
+    if optimal_path:
+        # Count candidates per span
+        position_candidates = {}
+        for i, m in enumerate(morphemes):
+            pos_key = (m.get("start_pos", 0), m.get("end_pos", 0))
+            if pos_key not in position_candidates:
+                position_candidates[pos_key] = []
+            position_candidates[pos_key].append(i)
+        for idx in optimal_path:
+            if idx < len(morphemes):
+                morph = morphemes[idx].copy()
+                # Add candidate count and selected rank for this span
+                pos_key = (morph.get("start_pos", 0), morph.get("end_pos", 0))
+                if pos_key in position_candidates:
+                    candidates_at_pos = position_candidates[pos_key]
+                    morph["num_candidates"] = len(candidates_at_pos)
+                    morph["selected_rank"] = candidates_at_pos.index(idx) + 1 if idx in candidates_at_pos else 0
+                optimal_morphemes.append(morph)
+    return results, optimal_morphemes
+def print_results(results, optimal_morphemes=None, verbose: bool = False):
+    """Print morphemes in MeCab-like format (surface\tCSV features)."""
+    if not results:
+        return
+    def mecab_features(m):
+        pos = m.get("pos", "*")
+        pos1 = m.get("pos_detail1", "*")
+        pos2 = m.get("pos_detail2", "*")
+        ctype = m.get("inflection_type", "*")
+        cform = m.get("inflection_form", "*")
+        base = m.get("base_form", m.get("lemma", "*")) or "*"
+        reading = m.get("reading", "*") or "*"
+        return f"{pos},{pos1},{pos2},{ctype},{cform},{base},{reading}"
+    items = (
+        optimal_morphemes
+        if optimal_morphemes
+        else [
+            {
+                "surface": r.get("surface", ""),
+                "pos": r.get("pos", "*"),
+                "pos_detail1": "*",
+                "pos_detail2": "*",
+                "inflection_type": "*",
+                "inflection_form": "*",
+                "base_form": r.get("surface", ""),
+                "reading": r.get("reading", "*"),
+            }
+            for r in results
+        ]
+    )
+    for m in items:
+        print(f"{m.get('surface', '')}\t{mecab_features(m)}")
+    print("EOS")
+def main():
+    """Main inference entrypoint."""
+    import argparse
+    parser = argparse.ArgumentParser(description="Mecari morphological analysis inference")
+    parser.add_argument("--text", "-t", help="Input text directly")
+    parser.add_argument("--experiment", "-e", help="Experiment name to load (e.g., gat_20250730_145624)")
+    parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output (include UD POS)")
+    args = parser.parse_args()
+    if args.experiment:
+        result = load_model(experiment_name=args.experiment)
+    else:
+        result = load_model()
+    if result is None:
+        return
+    model, experiment_info = result
+    if args.text:
+        result = predict_morphemes_from_text(args.text, model, experiment_info, silent=not args.verbose)
+        if result:
+            results, optimal_morphemes = result
+            print_results(results, optimal_morphemes, verbose=args.verbose)
+        else:
+            print("Inference failed.")
+    else:
+        print("\nMecari morphological inference")
+        print("Enter text (e.g., Tokyo is nice)")
+        print("Type 'quit' or 'exit' to finish.\n")
+        while True:
+            try:
+                user_input = input("Input: ").strip()
+                if user_input.lower() in ["quit", "exit", "q"]:
+                    print("Exiting.")
+                    break
+                if not user_input:
+                    continue
+                print(f"Text: {user_input}")
+                result = predict_morphemes_from_text(user_input, model, experiment_info, silent=not args.verbose)
+                if result:
+                    results, optimal_morphemes = result
+                    print_results(results, optimal_morphemes, verbose=args.verbose)
+                else:
+                    print("Inference failed.")
+                print()
+            except EOFError:
+                print("\nExiting.")
+                break
+            except KeyboardInterrupt:
+                print("\nExiting.")
+                break
+            except Exception as e:
+                import traceback
+                print(f"\nAn error occurred: {e}")
+                traceback.print_exc()
+                continue
+if __name__ == "__main__":
+    main()

mecari/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Mecari - Japanese Morphological Analysis with Graph Neural Networks"""
+__version__ = "0.1.0"
+# Export minimal API (avoid heavy imports at package import time)
+from mecari.config.config import get_model_config, load_config, override_config, save_config  # noqa: F401
+from mecari.data.data_module import DataModule  # noqa: F401
+__all__ = ["DataModule", "get_model_config", "override_config", "save_config", "load_config"]

mecari/analyzers/mecab.py ADDED Viewed

	@@ -0,0 +1,151 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import os
+import subprocess
+import tempfile
+from typing import Dict, List
+from mecari.utils.signature import signature_key
+def _byte_to_char_map(text: str) -> dict[int, int]:
+    mapping: dict[int, int] = {}
+    cpos = 0
+    bpos = 0
+    for ch in text:
+        mapping[bpos] = cpos
+        bpos += len(ch.encode("utf-8"))
+        cpos += 1
+    mapping[bpos] = cpos
+    return mapping
+class MeCabAnalyzer:
+    """Obtain morpheme candidates for building graph.
+    Args:
+        jumandic_path: Filesystem path to the JUMANDIC dictionary used by MeCab.
+        mecab_bin: Optional MeCab binary name or full path. If None, resolves
+            from the MECAB_BIN environment variable or defaults to "mecab".
+    Methods:
+        version(): Return the MeCab version string, or an empty string on error.
+        get_morpheme_candidates(text): Analyze text and return a list of
+            morpheme dicts with fields such as:
+              - surface, base_form, reading
+              - pos, pos_detail1/2/3
+              - inflection_type, inflection_form
+              - start_pos, end_pos (character offsets)
+            Unknown or unavailable values are filled with "*" or empty strings.
+    """
+    def __init__(
+        self,
+        jumandic_path: str | None = None,
+        mecab_bin: str | None = None,
+    ) -> None:
+        # Prefer JUMANDIC if present; otherwise fall back to IPADIC
+        if jumandic_path is None:
+            candidates = [
+                "/var/lib/mecab/dic/juman-utf8",
+                "/usr/lib/x86_64-linux-gnu/mecab/dic/juman-utf8",
+            ]
+            ipadic_candidates = [
+                "/var/lib/mecab/dic/ipadic",
+                "/usr/lib/x86_64-linux-gnu/mecab/dic/ipadic",
+            ]
+            chosen = next((p for p in candidates if os.path.isdir(p)), None)
+            if chosen is None:
+                chosen = next((p for p in ipadic_candidates if os.path.isdir(p)), None)
+            self.jumandic_path = chosen  # may be None; handled below
+        else:
+            self.jumandic_path = jumandic_path
+        # Allow selecting a specific mecab binary via arg or env var; default to common path
+        self.mecab_bin = mecab_bin or os.getenv("MECAB_BIN") or (
+            "/usr/bin/mecab" if os.path.exists("/usr/bin/mecab") else "mecab"
+        )
+    def version(self) -> str:
+        try:
+            out = subprocess.run([self.mecab_bin, "-v"], capture_output=True, text=True)
+            return (out.stdout or out.stderr).strip()
+        except Exception:
+            return ""
+    def get_morpheme_candidates(self, text: str) -> List[Dict]:
+        """Return a flat list of JUMANDIC candidates (robust %H format)."""
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False, encoding="utf-8") as f:
+            f.write(text)
+            temp_file = f.name
+        try:
+            fmt = "%pi\t%m\t%H\t%ps\t%pe\n"
+            cmd = [self.mecab_bin]
+            # Pass dictionary only if we have a resolvable path
+            if isinstance(self.jumandic_path, str) and os.path.isdir(self.jumandic_path):
+                cmd += ["-d", self.jumandic_path]
+            cmd += ["-F", fmt, "-E", "", "-a", temp_file]
+            result = subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", errors="ignore")
+            stdout = result.stdout
+        finally:
+            try:
+                import os
+                os.unlink(temp_file)
+            except Exception:
+                pass
+        if result.returncode != 0:
+            return []
+        byte_to_char = _byte_to_char_map(text)
+        out: list[dict] = []
+        seen = set()
+        for line in stdout.strip().split("\n"):
+            if not line:
+                continue
+            parts = line.split("\t")
+            if len(parts) < 5:
+                continue
+            node_id, surface, features, sb, eb = parts[0], parts[1], parts[2], parts[3], parts[4]
+            if surface in ("BOS", "EOS"):
+                continue
+            if not surface.strip():
+                continue
+            try:
+                start_byte = int(sb)
+                end_byte = int(eb)
+            except ValueError:
+                continue
+            start_pos = byte_to_char.get(start_byte, 0)
+            end_pos = byte_to_char.get(end_byte, len(text))
+            fs = features.split(",")
+            pos = fs[0] if len(fs) > 0 else "*"
+            pos1 = fs[1] if len(fs) > 1 else "*"
+            is_conj = pos in ("動詞", "形容詞", "助動詞")
+            ctype = fs[2] if len(fs) > 2 and fs[2] != "*" and is_conj else "*"
+            cform = fs[3] if len(fs) > 3 and fs[3] != "*" and is_conj else "*"
+            pos2 = (fs[2] if len(fs) > 2 else "*") if not is_conj else "*"
+            pos3 = (fs[3] if len(fs) > 3 else "*") if not is_conj else "*"
+            base = fs[4] if len(fs) > 4 and fs[4] != "*" else ""
+            reading = fs[5] if len(fs) > 5 and fs[5] != "*" else ""
+            m = {
+                "surface": surface,
+                "pos": pos,
+                "pos_detail1": pos1,
+                "pos_detail2": pos2,
+                "pos_detail3": pos3,
+                "base_form": base,
+                "reading": reading,
+                "inflection_type": ctype,
+                "inflection_form": cform,
+                "start_pos": start_pos,
+                "end_pos": end_pos,
+                "annotation": "?",
+                "node_id": node_id,
+            }
+            key = signature_key(m)
+            if key in seen:
+                continue
+            seen.add(key)
+            out.append(m)
+        return out

mecari/config/config.py ADDED Viewed

	@@ -0,0 +1,84 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import os
+from typing import Any, Dict
+import yaml
+def load_config(config_path: str) -> Dict[str, Any]:
+    """Load a YAML config with inheritance (defaults/extends)."""
+    if not os.path.exists(config_path):
+        raise FileNotFoundError(f"Config file not found: {config_path}")
+    with open(config_path, "r", encoding="utf-8") as f:
+        config = yaml.safe_load(f)
+    # Handle inheritance (Hydra-style defaults or legacy extends)
+    if "defaults" in config:
+        # Hydra-style defaults (list format)
+        defaults = config["defaults"]
+        if isinstance(defaults, list):
+            base_config = {}
+            for default_item in defaults:
+                if isinstance(default_item, str):
+                    base_config_path = default_item
+                else:
+                    continue
+                if not os.path.isabs(base_config_path):
+                    config_dir = os.path.dirname(config_path)
+                    base_config_path = os.path.join(config_dir, base_config_path + ".yaml")
+                if os.path.exists(base_config_path):
+                    loaded = load_config(base_config_path)
+                    base_config = override_config(base_config, loaded)
+            child_config = {k: v for k, v in config.items() if k != "defaults" and v is not None}
+            config = override_config(base_config, child_config)
+    elif "extends" in config:
+        # Legacy extends format
+        base_config_path = config["extends"]
+        if not os.path.isabs(base_config_path):
+            config_dir = os.path.dirname(config_path)
+            base_config_path = os.path.join(config_dir, base_config_path)
+        base_config = load_config(base_config_path)
+        child_config = {k: v for k, v in config.items() if k != "extends" and v is not None}
+        config = override_config(base_config, child_config)
+    return config
+def get_model_config(model_type: str) -> Dict[str, Any]:
+    """Return config for a given model type."""
+    config_path = f"configs/{model_type}.yaml"
+    return load_config(config_path)
+def override_config(config: Dict[str, Any], overrides: Dict[str, Any]) -> Dict[str, Any]:
+    """Deep-override config with values from overrides."""
+    def deep_update(base_dict, update_dict):
+        for key, value in update_dict.items():
+            if isinstance(value, dict) and key in base_dict and isinstance(base_dict[key], dict):
+                deep_update(base_dict[key], value)
+            else:
+                base_dict[key] = value
+    import copy
+    result = copy.deepcopy(config)
+    deep_update(result, overrides)
+    return result
+def save_config(config: Dict[str, Any], output_path: str):
+    """Save config as YAML."""
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with open(output_path, "w", encoding="utf-8") as f:
+        yaml.dump(config, f, default_flow_style=False, allow_unicode=True, indent=2)

mecari/data/data_module.py ADDED Viewed

	@@ -0,0 +1,361 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import os
+from typing import Dict, List, Optional
+import pytorch_lightning as pl
+import torch
+from torch.utils.data import Dataset
+from torch_geometric.data import Data, DataLoader
+# Required import for lexical feature computation
+from mecari.featurizers.lexical import (
+    LexicalNGramFeaturizer as LexFeaturizer,
+    Morpheme as LexMorpheme,
+)
+"""Data module for lexical-graph training using prebuilt .pt graphs only."""
+# Prebuilt .pt graph dataset
+class _PtGraphDataset(Dataset):
+    """Prebuilt PyG graph tensors saved as .pt per sentence.
+    Each file is expected to be a dict with keys:
+      - 'graph': torch_geometric.data.Data
+      - 'source_id': str (used for split)
+      - optional: 'text'
+    """
+    def __init__(self, files: List[str]) -> None:
+        self.files = files
+    def __len__(self) -> int:
+        return len(self.files)
+    def __getitem__(self, idx: int) -> Data:
+        path = self.files[idx]
+        obj = torch.load(path, map_location="cpu")
+        if isinstance(obj, dict) and "graph" in obj:
+            data = obj["graph"]
+        else:
+            data = obj
+        if not isinstance(data, Data):
+            raise RuntimeError(f"Invalid graph object in: {path}")
+        data.data_index = idx
+        return data
+# Safe globals registration for PyTorch 2.6+
+try:
+    import torch.serialization
+    from torch_geometric.data.data import DataEdgeAttr
+    torch.serialization.add_safe_globals([DataEdgeAttr, Data])
+except (ImportError, AttributeError):
+    pass
+class DataModule(pl.LightningDataModule):
+    """Loads .pt graphs and builds lexical graph features for training."""
+    def __init__(
+        self,
+        annotations_dir: str = "annotations",
+        batch_size: int = 32,
+        num_workers: int = 0,
+        max_files: Optional[int] = None,
+        use_bidirectional_edges: bool = True,
+        annotations_override_dir: Optional[str] = None,
+        silent: bool = False,
+        lexical_feature_dim: int = 100000,
+        lexical_max_features: int = 20,
+    ) -> None:
+        super().__init__()
+        self.annotations_dir = annotations_dir
+        self.annotations_override_dir = annotations_override_dir
+        self.batch_size = batch_size
+        self.num_workers = num_workers
+        self.max_files = max_files
+        self.use_bidirectional_edges = True
+        self.silent = silent
+        self.lexical_feature_dim = lexical_feature_dim
+        self.lexical_max_features = int(lexical_max_features)
+        self.use_bidirectional_edges = bool(use_bidirectional_edges)
+        # Initialized in setup()
+        self.train_dataset = []
+        self.val_dataset = []
+        self.test_dataset = []
+        # Eagerly initialize lexical featurizer (small and picklable)
+        self._lex_featurizer = LexFeaturizer(dim=int(self.lexical_feature_dim), add_bias=True)
+        # POS mapping for evaluation breakdown
+        self.pos_to_id = {
+            "名詞": 1,
+            "動詞": 2,
+            "形容詞": 3,
+            "副詞": 4,
+            "助詞": 5,
+            "助動詞": 6,
+            "接続詞": 7,
+            "連体詞": 8,
+            "感動詞": 9,
+            "形状詞": 10,
+            "補助記号": 11,
+            "接頭辞": 12,
+            "接尾辞": 13,
+            "特殊": 14,
+        }
+        self.id_to_pos = {v: k for k, v in self.pos_to_id.items()}
+    def create_graph_from_morphemes_data(self, *args, **kwargs) -> Optional[Data]:
+        """Create a lexical graph from morpheme data (or candidates)."""
+        if "candidates" in kwargs:
+            candidates = kwargs.pop("candidates")
+            text = kwargs.get("text", "")
+            morphemes_edges = self._build_graph_from_candidates(candidates, text)
+            if not morphemes_edges:
+                return None
+            kwargs["morphemes"] = morphemes_edges["morphemes"]
+            kwargs["edges"] = morphemes_edges["edges"]
+        return self._create_lexical_graph(*args, **kwargs)
+    # --- Lexical features helper (for preprocessing) ---
+    def compute_lexical_features(self, morphemes: List[Dict], text: str) -> List[Dict]:
+        """Add lexical_features to each morpheme using Mecari's lexical featurizer.
+        Requires mecari.featurizers.lexical to be importable. Raises a clear error
+        if the featurizer is unavailable (training/inference depend on it).
+        """
+        if not morphemes:
+            return morphemes
+        for m in morphemes:
+            try:
+                morph_obj = LexMorpheme(
+                    surf=m.get("surface", ""),
+                    lemma=m.get("base_form", ""),
+                    pos=m.get("pos", "*"),
+                    pos1=m.get("pos_detail1", "*"),
+                    ctype=m.get("inflection_type", "*"),
+                    cform=m.get("inflection_form", "*"),
+                    reading=m.get("reading", "*"),
+                )
+                st = m.get("start_pos", 0)
+                ed = m.get("end_pos", st + len(m.get("surface", "")))
+                prev_char = text[st - 1] if st > 0 else None
+                next_char = text[ed] if ed < len(text) else None
+                feats = self._lex_featurizer.unigram_feats(morph_obj, prev_char, next_char)
+                m["lexical_features"] = feats
+            except Exception:
+                # on any failure, leave unchanged
+                pass
+        return morphemes
+    def _create_lexical_graph(
+        self, morphemes: List[Dict], edges: List[Dict], text: str, for_training: bool = True
+    ) -> Optional[Data]:
+        """Build a graph using lexical features."""
+        if not morphemes:
+            return None
+        # Sparse lexical features per node
+        all_indices = []
+        all_values = []
+        all_lengths = []
+        annotations = []
+        valid_mask = []
+        max_features = 0
+        for morpheme in morphemes:
+            lexical_feats = morpheme.get("lexical_features", [])
+            indices = []
+            values = []
+            for idx, val in lexical_feats:
+                if 0 <= idx < self.lexical_feature_dim:
+                    indices.append(idx)
+                    values.append(val)
+            all_lengths.append(len(indices))
+            max_features = max(max_features, len(indices))
+            all_indices.append(indices)
+            all_values.append(values)
+            if for_training:
+                annotation = morpheme.get("annotation", "?")
+                if annotation == "+":
+                    annotations.append(1)
+                    valid_mask.append(True)
+                elif annotation == "-":
+                    annotations.append(0)
+                    valid_mask.append(True)
+                else:
+                    annotations.append(0)
+                    valid_mask.append(False)
+        # Fixed-size padding/truncation for batching
+        FIXED_MAX_FEATURES = int(getattr(self, "lexical_max_features", 20))
+        padded_indices = []
+        padded_values = []
+        for indices, values in zip(all_indices, all_values):
+            if len(indices) > FIXED_MAX_FEATURES:
+                padded_indices.append(indices[:FIXED_MAX_FEATURES])
+                padded_values.append(values[:FIXED_MAX_FEATURES])
+            else:
+                pad_length = FIXED_MAX_FEATURES - len(indices)
+                padded_indices.append(indices + [0] * pad_length)
+                padded_values.append(values + [0.0] * pad_length)
+        edge_index = self._build_edge_index(edges, len(morphemes))
+        # POS ids per node (for evaluation breakdown)
+        pos_ids = []
+        for m in morphemes:
+            pos = m.get("pos", "*")
+            pos_ids.append(self.pos_to_id.get(pos, 0))
+        graph_data = Data(
+            lexical_indices=torch.tensor(padded_indices, dtype=torch.long),
+            lexical_values=torch.tensor(padded_values, dtype=torch.float32),
+            lexical_lengths=torch.tensor(all_lengths, dtype=torch.long),
+            edge_index=edge_index,
+            num_nodes=len(morphemes),
+        )
+        graph_data.pos_ids = torch.tensor(pos_ids, dtype=torch.long)
+        if for_training:
+            graph_data.y = torch.tensor(annotations, dtype=torch.float32)
+            graph_data.valid_mask = torch.tensor(valid_mask, dtype=torch.bool)
+        return graph_data
+    def _build_edge_index(self, edges: List[Dict], num_nodes: int) -> torch.Tensor:
+        """Build a PyG edge_index tensor from edge dicts."""
+        if not edges:
+            return torch.tensor([[], []], dtype=torch.long)
+        source_indices = []
+        target_indices = []
+        for edge in edges:
+            source = edge.get("source_idx", 0)
+            target = edge.get("target_idx", 0)
+            if 0 <= source < num_nodes and 0 <= target < num_nodes:
+                source_indices.append(source)
+                target_indices.append(target)
+                if self.use_bidirectional_edges:
+                    source_indices.append(target)
+                    target_indices.append(source)
+        if not source_indices:
+            return torch.tensor([[], []], dtype=torch.long)
+        return torch.tensor([source_indices, target_indices], dtype=torch.long)
+    def _load_kwdlc_ids(self, ids_file: str) -> set:
+        """Load KWDLC ID list (one ID per line)."""
+        ids = set()
+        if ids_file and os.path.exists(ids_file):
+            with open(ids_file, "r") as f:
+                for line in f:
+                    ids.add(line.strip())
+        return ids
+    def load_annotation_data(self, max_files: Optional[int] = None) -> List[Dict]:
+        """Detect and list available .pt annotation graph files."""
+        if os.path.isdir(self.annotations_dir):
+            pt_files = [
+                os.path.join(self.annotations_dir, fn)
+                for fn in sorted(os.listdir(self.annotations_dir))
+                if fn.endswith(".pt")
+            ]
+            if pt_files:
+                if max_files is not None:
+                    pt_files = pt_files[:max_files]
+                return [{"_mode": "pt", "_pt_files": pt_files}]
+        raise FileNotFoundError(f"No annotation graphs found under: {self.annotations_dir}")
+    def setup(self, stage: Optional[str] = None) -> None:
+        """Build train/val/test datasets from discovered .pt files."""
+        annotation_data = self.load_annotation_data(max_files=self.max_files)
+        if not annotation_data:
+            self.train_dataset = []
+            self.val_dataset = []
+            self.test_dataset = []
+            return
+        dev_ids = self._load_kwdlc_ids(os.path.join("KWDLC", "id", "split_for_pas", "dev.id"))
+        test_ids = self._load_kwdlc_ids(os.path.join("KWDLC", "id", "split_for_pas", "test.id"))
+        mode = annotation_data[0].get("_mode")
+        if mode == "pt":
+            files: List[str] = annotation_data[0]["_pt_files"]
+            train_files: List[str] = []
+            val_files: List[str] = []
+            test_files: List[str] = []
+            # Use KWDLC split ids (mandatory)
+            dev_ids = self._load_kwdlc_ids(os.path.join("KWDLC", "id", "split_for_pas", "dev.id"))
+            test_ids = self._load_kwdlc_ids(os.path.join("KWDLC", "id", "split_for_pas", "test.id"))
+            for fp in files:
+                sid = None
+                try:
+                    obj = torch.load(fp, map_location="cpu")
+                    if isinstance(obj, dict):
+                        sid = obj.get("source_id")
+                except Exception:
+                    pass
+                if sid and (dev_ids or test_ids):
+                    if sid in test_ids:
+                        test_files.append(fp)
+                    elif sid in dev_ids:
+                        val_files.append(fp)
+                    else:
+                        train_files.append(fp)
+                else:
+                    train_files.append(fp)
+            # Build datasets strictly based on KWDLC dev/test ids
+            self.train_dataset = _PtGraphDataset(train_files)
+            self.val_dataset = _PtGraphDataset(val_files)
+            self.test_dataset = _PtGraphDataset(test_files)
+            if len(self.val_dataset) == 0 or len(self.test_dataset) == 0:
+                raise RuntimeError(
+                    "KWDLC dev/test split produced empty val/test datasets. Ensure KWDLC id files exist and source_id is set in .pt files."
+                )
+        else:
+            raise RuntimeError("Unsupported annotation mode; expected pt")
+        print(
+            f"Data split: train={len(self.train_dataset)}, val={len(self.val_dataset)}, test={len(self.test_dataset)}"
+        )
+    def _create_dataloader(self, dataset: List[Data], batch_size: int, shuffle: bool = False) -> DataLoader:
+        """Create a DataLoader with optional workers/prefetching."""
+        return DataLoader(
+            dataset,
+            batch_size=batch_size,
+            shuffle=shuffle,
+            num_workers=self.num_workers,
+            pin_memory=False,
+            persistent_workers=True if self.num_workers > 0 else False,
+            prefetch_factor=2 if self.num_workers > 0 else None,
+        )
+    def train_dataloader(self) -> DataLoader:
+        """Return train DataLoader."""
+        return self._create_dataloader(self.train_dataset, self.batch_size, shuffle=True)
+    def val_dataloader(self) -> DataLoader:
+        """Return val DataLoader."""
+        return self._create_dataloader(self.val_dataset, self.batch_size, shuffle=False)
+    def test_dataloader(self) -> DataLoader:
+        """Return test DataLoader."""
+        return self._create_dataloader(self.test_dataset, self.batch_size, shuffle=False)

mecari/featurizers/lexical.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import hashlib
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Tuple
+# -------- Basic data structures --------
+@dataclass
+class Morpheme:
+    surf: str  # surface
+    lemma: str  # lemma (base form)
+    pos: str  # POS (coarse)
+    pos1: str = "*"  # POS (fine)
+    ctype: str = "*"  # conjugation type
+    cform: str = "*"  # conjugation form
+    reading: str = "*"  # reading (if any)
+# -------- Utilities --------
+def _stable_hash(s: str, dim: int) -> int:
+    # md5 stable hash -> lower 8 bytes -> modulo by dim
+    d = hashlib.md5(s.encode("utf-8")).digest()
+    return int.from_bytes(d[:8], "little") % dim
+def _charclass(ch: str) -> str:
+    # Simple character classes (for boundary features)
+    if not ch:
+        return "O"
+    try:
+        o = ord(ch)
+    except Exception:
+        return "O"
+    if 0x3040 <= o <= 0x309F:
+        return "H"  # hiragana
+    if 0x30A0 <= o <= 0x30FF:
+        return "K"  # katakana
+    if 0x4E00 <= o <= 0x9FFF or 0x3400 <= o <= 0x4DBF:
+        return "C"  # kanji
+    if 0x0030 <= o <= 0x0039 or 0xFF10 <= o <= 0xFF19:
+        return "D"  # digits
+    if 0x0041 <= o <= 0x007A or 0xFF21 <= o <= 0xFF5A:
+        return "A"  # letters
+    if ch.isspace():
+        return "S"
+    return "O"  # other
+def _affix(s: str, n: int) -> str:
+    return s[:n] if len(s) >= n else s
+def _suffix(s: str, n: int) -> str:
+    return s[-n:] if len(s) >= n else s
+# -------- Lexical n-gram featurizer --------
+class LexicalNGramFeaturizer:
+    """Build unigram + boundary features as (index, value) pairs."""
+    def __init__(self, dim: int = 1_000_000, add_bias: bool = True):
+        self.dim = dim
+        self.add_bias = add_bias
+    def _push(self, feats: List[Tuple[int, float]], key: str, val: float = 1.0):
+        feats.append((_stable_hash(key, self.dim), val))
+    def unigram_feats(self, m: Morpheme, prev_char: Optional[str], next_char: Optional[str]) -> List[Tuple[int, float]]:
+        f: List[Tuple[int, float]] = []
+        # POS
+        self._push(f, f"U:POS={m.pos}")
+        self._push(f, f"U:POS1={m.pos}:{m.pos1}")
+        # Lexicalized (surface/lemma) + POS
+        self._push(f, f"U:LEM={m.lemma}")
+        self._push(f, f"U:SURF={m.surf}")
+        self._push(f, f"U:LEM+POS={m.lemma}|{m.pos}")
+        self._push(f, f"U:SURF+POS1={m.surf}|{m.pos}:{m.pos1}")
+        # Conjugation
+        self._push(f, f"U:CFORM={m.ctype}:{m.cform}")
+        # Reading (coarse)
+        if m.reading and m.reading != "*":
+            self._push(f, f"U:READ={m.reading}")
+        # Prefix/Suffix (string n-grams)
+        self._push(f, f"U:PREF2={_affix(m.surf, 2)}")
+        self._push(f, f"U:SUF2={_suffix(m.surf, 2)}")
+        # Boundary char types (1 char left/right)
+        if prev_char:
+            self._push(f, f"U:BTYPE_L={_charclass(prev_char)}->{_charclass(m.surf[:1])}")
+        if next_char:
+            self._push(f, f"U:BTYPE_R={_charclass(m.surf[-1:])}->{_charclass(next_char)}")
+        if self.add_bias:
+            self._push(f, "U:BIAS")
+        return f
+    def featurize_sequence(
+        self, morphs: List[Morpheme], raw_sentence: Optional[str] = None
+    ) -> List[Dict[str, List[Tuple[int, float]]]]:
+        if raw_sentence is None:
+            raw_sentence = "".join(m.surf for m in morphs)
+        spans = []
+        cur = 0
+        for m in morphs:
+            st, ed = cur, cur + len(m.surf)
+            spans.append((st, ed))
+            cur = ed
+        for i, m in enumerate(morphs):
+            st, ed = spans[i]
+            prev_char = raw_sentence[st - 1] if st > 0 else None
+            next_char = raw_sentence[ed] if ed < len(raw_sentence) else None
+            feats_node = self.unigram_feats(m, prev_char, next_char)
+        return feats_node
+if __name__ == "__main__":
+    pass

mecari/models/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+from .base import BaseMecariGNN  # noqa: F401
+from .gatv2 import MecariGATv2  # noqa: F401
+__all__ = ["BaseMecariGNN", "MecariGATv2"]

mecari/models/base.py ADDED Viewed

	@@ -0,0 +1,214 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""Base model with lexical features only."""
+from typing import Optional
+import pytorch_lightning as pl
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class BaseMecariGNN(pl.LightningModule):
+    """Base class for Mecari morpheme GNNs."""
+    def __init__(
+        self,
+        hidden_dim: int = 512,
+        num_classes: int = 1,
+        learning_rate: float = 1e-3,
+        lexical_feature_dim: int = 100000,
+    ) -> None:
+        super().__init__()
+        self.save_hyperparameters()
+        self.hidden_dim = hidden_dim
+        self.num_classes = num_classes
+        self.learning_rate = learning_rate
+        self.lexical_feature_dim = lexical_feature_dim
+        self.lexical_embedding = nn.Embedding(
+            num_embeddings=lexical_feature_dim, embedding_dim=hidden_dim, padding_idx=0, sparse=False
+        )
+        nn.init.xavier_uniform_(self.lexical_embedding.weight[1:])
+        self.lexical_embedding.weight.data[0].fill_(0)
+        self.lexical_norm = nn.LayerNorm(hidden_dim)
+        self.lexical_dropout = nn.Dropout(0.2)
+        self.residual_proj = nn.Linear(hidden_dim, hidden_dim)
+        self.node_classifier = nn.Sequential(
+            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden_dim, 1)
+        )
+    def _process_features(
+        self, lexical_indices: torch.Tensor, lexical_values: torch.Tensor, bert_features: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+        """Process lexical features."""
+        embedded = self.lexical_embedding(lexical_indices)
+        weighted = embedded * lexical_values.unsqueeze(-1)
+        aggregated = weighted.sum(dim=1)
+        processed = self.lexical_dropout(self.lexical_norm(aggregated))
+        return processed
+    def forward(self, lexical_indices, lexical_values, edge_index, bert_features=None, edge_attr=None):
+        """Forward pass (implemented in subclasses)."""
+        raise NotImplementedError("Subclasses must implement forward method")
+    def training_step(self, batch, batch_idx):
+        node_predictions = self(
+            batch.lexical_indices,
+            batch.lexical_values,
+            batch.edge_index,
+            None,
+            batch.edge_attr if hasattr(batch, "edge_attr") else None,
+        ).squeeze()
+        valid_mask = batch.valid_mask
+        valid_predictions = node_predictions[valid_mask]
+        valid_targets = batch.y[valid_mask]
+        loss = self._compute_bce_loss(valid_predictions, valid_targets, stage="train")
+        with torch.no_grad():
+            pred_probs = torch.sigmoid(valid_predictions)
+            pred_binary = (pred_probs > 0.5).float()
+            correct = (pred_binary == valid_targets).sum()
+            accuracy = correct / valid_targets.numel()
+            error_rate = 1.0 - accuracy
+        self.log("train_loss", loss, prog_bar=True, on_step=True, on_epoch=True)
+        self.log("train_error", error_rate, prog_bar=True, on_step=True, on_epoch=True)
+        if self.trainer and self.trainer.optimizers:
+            current_lr = self.trainer.optimizers[0].param_groups[0]["lr"]
+            self.log("learning_rate", current_lr, on_step=True, on_epoch=False)
+        return loss
+    def validation_step(self, batch, batch_idx):
+        node_predictions = self(
+            batch.lexical_indices,
+            batch.lexical_values,
+            batch.edge_index,
+            None,
+            batch.edge_attr if hasattr(batch, "edge_attr") else None,
+        ).squeeze()
+        valid_mask = batch.valid_mask
+        valid_predictions = node_predictions[valid_mask]
+        valid_targets = batch.y[valid_mask]
+        loss = self._compute_bce_loss(valid_predictions, valid_targets, stage="val")
+        with torch.no_grad():
+            pred_probs = torch.sigmoid(valid_predictions)
+            pred_binary = (pred_probs > 0.5).float()
+            correct = (pred_binary == valid_targets).sum()
+            accuracy = correct / valid_targets.numel()
+            error_rate = 1.0 - accuracy
+        self.log("val_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
+        self.log("val_error", error_rate, prog_bar=True, on_step=True, on_epoch=True)
+        self.log("val_loss_epoch", loss, on_step=False, on_epoch=True)
+        self.log("val_error_epoch", error_rate, on_step=False, on_epoch=True)
+        return loss
+    def configure_optimizers(self):
+        """Configure optimizer."""
+        optimizer_config = getattr(self, "training_config", {}).get("optimizer", {})
+        optimizer_type = optimizer_config.get("type", "adamw")
+        if optimizer_type == "adamw":
+            optimizer = torch.optim.AdamW(
+                self.parameters(), lr=self.learning_rate, weight_decay=optimizer_config.get("weight_decay", 0.01)
+            )
+        else:
+            optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
+        # Optional warmup scheduler (linear warmup to base LR)
+        tc = getattr(self, "training_config", {}) or {}
+        warmup_steps = int(tc.get("warmup_steps", 0) or 0)
+        warmup_start_lr = float(tc.get("warmup_start_lr", 0.0) or 0.0)
+        if warmup_steps > 0 and self.learning_rate > 0.0:
+            start_factor = max(0.0, min(1.0, warmup_start_lr / float(self.learning_rate)))
+            def lr_lambda(step: int):
+                if step <= 0:
+                    return start_factor
+                if step < warmup_steps:
+                    return start_factor + (1.0 - start_factor) * (step / float(warmup_steps))
+                return 1.0
+            scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_lambda)
+            return {
+                "optimizer": optimizer,
+                "lr_scheduler": {
+                    "scheduler": scheduler,
+                    "interval": "step",
+                    "frequency": 1,
+                    "name": "linear_warmup",
+                },
+            }
+        return {"optimizer": optimizer}
+    def test_step(self, batch, batch_idx):
+        node_predictions = self(
+            batch.lexical_indices,
+            batch.lexical_values,
+            batch.edge_index,
+            None,
+            batch.edge_attr if hasattr(batch, "edge_attr") else None,
+        ).squeeze()
+        valid_mask = batch.valid_mask
+        valid_predictions = node_predictions[valid_mask]
+        valid_targets = batch.y[valid_mask]
+        with torch.no_grad():
+            pred_probs = torch.sigmoid(valid_predictions)
+            pred_binary = (pred_probs > 0.5).float()
+            correct = (pred_binary == valid_targets).sum()
+            accuracy = correct / valid_targets.numel()
+            error_rate = 1.0 - accuracy
+        self.log("test_error", error_rate, on_step=False, on_epoch=True)
+        self.log("test_accuracy", accuracy, on_step=False, on_epoch=True)
+        return error_rate
+    def _compute_bce_loss(self, logits: torch.Tensor, targets: torch.Tensor, stage: str = "train") -> torch.Tensor:
+        """BCEWithLogits loss with optional label smoothing and pos_weight.
+        - label_smoothing: smooth targets toward 0.5 by epsilon.
+        - pos_weight: handle class imbalance using ratio (neg/pos) per batch, robustly.
+        """
+        loss_cfg = getattr(self, "training_config", {}).get("loss", {})
+        eps = float(loss_cfg.get("label_smoothing", 0.0) or 0.0)
+        use_pos_weight = bool(loss_cfg.get("use_pos_weight", True))
+        # Compute pos_weight from unsmoothed targets
+        pos = torch.clamp(targets.sum(), min=0.0)
+        total = torch.tensor(targets.numel(), device=targets.device, dtype=targets.dtype)
+        neg = total - pos
+        pos_weight = None
+        if use_pos_weight and pos > 0 and neg > 0:
+            # pos_weight = neg/pos; clamp to avoid extreme values
+            pw = (neg / pos).detach()
+            pw = torch.clamp(pw, 0.5, 50.0)  # safety bounds
+            pos_weight = pw
+        # Apply label smoothing to targets: y' = (1-eps)*y + 0.5*eps
+        if eps > 0.0:
+            targets = (1.0 - eps) * targets + 0.5 * eps
+        loss = F.binary_cross_entropy_with_logits(
+            logits,
+            targets,
+            pos_weight=pos_weight,
+        )
+        return loss

mecari/models/gatv2.py ADDED Viewed

	@@ -0,0 +1,139 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""GATv2 model for morpheme graph classification."""
+import torch.nn as nn
+import torch.nn.functional as F
+from torch_geometric.nn import GATv2Conv
+from torch_geometric.utils import add_self_loops, dropout_adj
+from .base import BaseMecariGNN
+class MecariGATv2(BaseMecariGNN):
+    """Graph Attention Network v2 for morpheme analysis"""
+    def __init__(
+        self,
+        hidden_dim: int = 512,
+        num_heads: int = 8,
+        num_layers: int = 4,
+        num_classes: int = 1,
+        learning_rate: float = 1e-3,
+        lexical_feature_dim: int = 100000,
+        share_weights: bool = False,  # share-weights option of GATv2
+        # New knobs
+        dropout: float = 0.1,
+        attn_dropout: float = 0.1,
+        add_self_loops_flag: bool = True,
+        edge_dropout: float = 0.0,
+        norm: str = "layer",
+        **kwargs,  # Ignore extra params for config compatibility
+    ):
+        super().__init__(
+            hidden_dim=hidden_dim,
+            num_classes=num_classes,
+            learning_rate=learning_rate,
+            lexical_feature_dim=lexical_feature_dim,
+        )
+        self.num_heads = num_heads
+        self.num_layers = num_layers
+        self.share_weights = share_weights
+        self.feat_dropout_p = dropout
+        self.attn_dropout_p = attn_dropout
+        self.add_self_loops_flag = add_self_loops_flag
+        self.edge_dropout_p = edge_dropout
+        self.norm_type = (norm or "layer").lower()
+        # GATv2 layers
+        self.gatv2_layers = nn.ModuleList()
+        self.layer_norms = nn.ModuleList()
+        for i in range(num_layers):
+            if i == 0:
+                # First layer
+                self.gatv2_layers.append(
+                    GATv2Conv(
+                        hidden_dim,
+                        hidden_dim,
+                        heads=num_heads,
+                        dropout=self.attn_dropout_p,
+                        share_weights=share_weights,
+                        add_self_loops=False,
+                    )
+                )
+            elif i == num_layers - 1:
+                # Last layer - single head
+                self.gatv2_layers.append(
+                    GATv2Conv(
+                        hidden_dim * num_heads,
+                        hidden_dim,
+                        heads=1,
+                        concat=False,
+                        dropout=self.attn_dropout_p,
+                        share_weights=share_weights,
+                        add_self_loops=False,
+                    )
+                )
+            else:
+                # Middle layers
+                self.gatv2_layers.append(
+                    GATv2Conv(
+                        hidden_dim * num_heads,
+                        hidden_dim,
+                        heads=num_heads,
+                        dropout=self.attn_dropout_p,
+                        share_weights=share_weights,
+                        add_self_loops=False,
+                    )
+                )
+            # Layer normalization (all layers)
+            if i < num_layers - 1:
+                self.layer_norms.append(
+                    nn.LayerNorm(hidden_dim * num_heads)
+                    if self.norm_type == "layer"
+                    else nn.BatchNorm1d(hidden_dim * num_heads)
+                )
+            else:
+                self.layer_norms.append(
+                    nn.LayerNorm(hidden_dim) if self.norm_type == "layer" else nn.BatchNorm1d(hidden_dim)
+                )
+    def forward(self, lexical_indices, lexical_values, edge_index, bert_features=None, edge_attr=None):
+        """Forward pass of GATv2"""
+        x = self._process_features(lexical_indices, lexical_values, bert_features)
+        residual = self.residual_proj(x)
+        ei = edge_index
+        if self.add_self_loops_flag:
+            ei, _ = add_self_loops(ei, num_nodes=x.size(0))
+        if self.edge_dropout_p > 0 and self.training:
+            ei, _ = dropout_adj(ei, p=self.edge_dropout_p, force_undirected=False, training=True)
+        # Apply GATv2 layers
+        prev = None
+        for i in range(self.num_layers):
+            prev = x
+            x = self.gatv2_layers[i](x, ei)
+            x = self.layer_norms[i](x)
+            # Per-layer residual if dimension matches (middle layers)
+            if x.shape == prev.shape and i < self.num_layers - 1:
+                x = x + prev
+            # Add residual at last layer
+            if i == self.num_layers - 1:
+                x = x + residual
+            x = F.elu(x)
+            # Dropout except last layer
+            if i < self.num_layers - 1:
+                x = F.dropout(x, p=self.feat_dropout_p, training=self.training)
+        # Classification
+        logits = self.node_classifier(x)
+        return logits

mecari/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+__all__ = []

mecari/utils/morph_utils.py ADDED Viewed

	@@ -0,0 +1,51 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+from __future__ import annotations
+from typing import Dict, List, Any, Optional
+from mecari.utils.signature import signature_key
+def dedup_morphemes(morphs: List[Dict]) -> List[Dict]:
+    seen = set()
+    out: List[Dict] = []
+    for m in morphs:
+        key = signature_key(m)
+        if key in seen:
+            continue
+        seen.add(key)
+        out.append(m)
+    out.sort(key=lambda m: (
+        m.get("start_pos", 0),
+        -(m.get("end_pos", 0) - m.get("start_pos", 0)),
+        m.get("surface", ""),
+        m.get("reading", ""),
+        m.get("pos", "*"),
+    ))
+    return out
+def build_adjacent_edges(morphs: List[Dict]) -> List[Dict]:
+    edges: List[Dict] = []
+    for i, s in enumerate(morphs):
+        for j, t in enumerate(morphs):
+            if i >= j:
+                continue
+            if s.get("end_pos", 0) == t.get("start_pos", 0):
+                edges.append({"source_idx": i, "target_idx": j, "edge_type": "forward"})
+    return edges
+def normalize_mecab_candidates(candidates: List[Dict]) -> List[Dict]:
+    """Normalize MeCab candidates consistently for preprocessing/inference.
+    - If surface is digit-only and base_form is empty/missing, set base_form = surface.
+    Modifies candidates in place and returns the list for convenience.
+    """
+    for c in candidates:
+        surf = c.get("surface", "")
+        bf = c.get("base_form")
+        if (bf is None or bf == "") and surf and surf.isdigit():
+            c["base_form"] = surf
+    return candidates

mecari/utils/signature.py ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+from __future__ import annotations
+from typing import Dict, Tuple
+def to_katakana(s) -> str:
+    """Robust hiragana->katakana conversion for str or sequence.
+    Accepts str, list, tuple; concatenates string elements when a sequence is given.
+    Non-string inputs are stringified; None becomes empty string.
+    """
+    if isinstance(s, (list, tuple)):
+        s = "".join(x for x in s if isinstance(x, str))
+    elif not isinstance(s, str):
+        s = str(s) if s is not None else ""
+    out = []
+    for ch in s:
+        if not ch:
+            continue
+        o = ord(ch)
+        if 0x3041 <= o <= 0x3096:
+            out.append(chr(o + 0x60))
+        else:
+            out.append(ch)
+    return "".join(out)
+def signature_key(m: Dict) -> Tuple:
+    """Stable deduplication key for a morpheme dict (POS up to pos1)."""
+    surface = m.get("surface", "")
+    pos = m.get("pos", "*")
+    pos1 = m.get("pos_detail1", "*")
+    base = m.get("base_form") or m.get("lemma") or ""
+    read = to_katakana(m.get("reading") or "")
+    st = m.get("start_pos", 0)
+    ed = m.get("end_pos", st + len(surface))
+    return (st, ed, surface, pos, pos1, base, read)

packages.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+mecab
+mecab-utils
+libmecab-dev
+mecab-jumandic-utf8
+mecab-ipadic-utf8

preprocess.py ADDED Viewed

	@@ -0,0 +1,366 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Build training graphs from KWDLC with JUMANDIC.
+Pipeline:
+1) Read gold morphemes from KNP files
+2) Parse text with MeCab (JUMANDIC) to get candidate morphemes
+3) Match candidates to gold and assign annotations ('+', '-', '?')
+4) Save graph data as .pt
+"""
+import argparse
+from collections import defaultdict
+from pathlib import Path
+from typing import Dict, List
+import torch
+import yaml
+from tqdm import tqdm
+from mecari.analyzers.mecab import MeCabAnalyzer
+from mecari.data.data_module import DataModule
+from mecari.featurizers.lexical import LexicalNGramFeaturizer as LexicalFeaturizer
+from mecari.featurizers.lexical import Morpheme
+from mecari.utils.morph_utils import build_adjacent_edges, dedup_morphemes, normalize_mecab_candidates
+def add_lexical_features(morphemes: List[Dict], text: str, feature_dim: int = 100000) -> List[Dict]:
+    """Add lexical (index, value) pairs to morphemes. Not used when saving JSON.
+    Kept for backward-compatibility and test equivalence.
+    """
+    featurizer = LexicalFeaturizer(dim=feature_dim, add_bias=True)
+    for m in morphemes:
+        surf = m.get("surface", "")
+        morph_obj = Morpheme(
+            surf=surf,
+            lemma=m.get("base_form", surf),
+            pos=m.get("pos", "*"),
+            pos1=m.get("pos_detail1", "*"),
+            ctype="*",
+            cform="*",
+            reading=m.get("reading", "*"),
+        )
+        st = m.get("start_pos", 0)
+        ed = m.get("end_pos", st + len(surf))
+        prev_char = text[st - 1] if st > 0 and st <= len(text) else None
+        next_char = text[ed] if ed < len(text) else None
+        feats = featurizer.unigram_feats(morph_obj, prev_char, next_char)
+        m["lexical_features"] = feats
+    return morphemes
+def hiragana_to_katakana(text: str) -> str:
+    """Convert hiragana to katakana."""
+    return "".join([chr(ord(c) + 96) if "ぁ" <= c <= "ん" else c for c in text])
+def _load_gold_with_kyoto(knp_path: Path) -> List[Dict]:
+    """Load sentences and morphemes from a KNP file using kyoto-reader (required)."""
+    try:
+        from kyoto_reader import KyotoReader  # type: ignore
+    except Exception as e:  # pragma: no cover
+        raise RuntimeError("kyoto-reader is required for gold loading. Install it (pip install kyoto-reader).") from e
+    try:
+        try:
+            reader = KyotoReader(str(knp_path), n_jobs=0)
+        except TypeError:
+            reader = KyotoReader(str(knp_path))
+        sents: List[Dict] = []
+        for doc in reader.process_all_documents(n_jobs=0):
+            if doc is None:
+                continue
+            for sent in doc.sentences:
+                text = sent.surf
+                morphemes: List[Dict] = []
+                pos = 0
+                for mrph in sent.mrph_list():
+                    surf = getattr(mrph, "midasi", "") or ""
+                    read = getattr(mrph, "yomi", surf) or surf
+                    lemma = getattr(mrph, "genkei", surf) or surf
+                    pos_main = getattr(mrph, "hinsi", "*") or "*"
+                    pos1 = getattr(mrph, "bunrui", "*") or "*"
+                    st = pos
+                    ed = st + len(surf)
+                    pos = ed
+                    morphemes.append(
+                        {
+                            "surface": surf,
+                            "reading": read,
+                            "base_form": lemma,
+                            "pos": pos_main,
+                            "pos_detail1": pos1,
+                            "pos_detail2": "*",
+                            "pos_detail3": "*",
+                            "start_pos": st,
+                            "end_pos": ed,
+                        }
+                    )
+                sents.append({"text": text, "morphemes": morphemes})
+        return sents
+    except Exception as e:
+        raise RuntimeError(f"Failed to parse KNP with kyoto-reader: {knp_path}") from e
+def match_morphemes_with_gold(candidates: List[Dict], gold_morphemes: List[Dict], text: str) -> List[Dict]:
+    """Match candidate morphemes to gold and assign annotations ('?', '+', '-').
+    Policy:
+      - Initialize every candidate as '?'
+      - Mark '+' for candidates that strictly match gold (surface, POS, base, reading)
+      - Mark '-' for candidates that overlap any '+' span
+    """
+    # Reconstruct gold spans in character offsets
+    gold_details = []
+    cur = 0
+    for g in gold_morphemes:
+        surf = g.get("surface", "")
+        st, ed = cur, cur + len(surf)
+        cur = ed
+        gold_details.append(
+            {
+                "start_pos": st,
+                "end_pos": ed,
+                "surface": surf,
+                "pos": g.get("pos", "*"),
+                "pos_detail1": g.get("pos_detail1", "*"),
+                "pos_detail2": g.get("pos_detail2", "*"),
+                "base_form": g.get("base_form", ""),
+                "reading": hiragana_to_katakana(g.get("reading", "")),
+            }
+        )
+    # Initialize all candidates with '?'
+    annotated: List[Dict] = []
+    for cand in candidates:
+        a = {**cand}
+        a["annotation"] = "?"
+        if "inflection_type" not in a:
+            a["inflection_type"] = "*"
+        if "inflection_form" not in a:
+            a["inflection_form"] = "*"
+        annotated.append(a)
+    # Match by strict equality first; allow reading mismatch as fallback
+    span_to_cands: dict[tuple[int, int], list[Dict]] = {}
+    for a in annotated:
+        cs = a.get("start_pos", 0)
+        ce = a.get("end_pos", cs + len(a.get("surface", "")))
+        span_to_cands.setdefault((cs, ce), []).append(a)
+    matched_spans: List[tuple[int, int]] = []
+    for g in gold_details:
+        span = (g["start_pos"], g["end_pos"])
+        cands = span_to_cands.get(span, [])
+        if not cands:
+            continue
+        strict = []
+        fallback = []
+        for a in cands:
+            if a.get("surface", "") != g["surface"]:
+                continue
+            if a.get("pos", "*") != g["pos"]:
+                continue
+            if a.get("pos_detail1", "*") != g.get("pos_detail1", "*"):
+                continue
+            if a.get("base_form", "") != g["base_form"]:
+                continue
+            if hiragana_to_katakana(a.get("reading", "")) == g["reading"]:
+                strict.append(a)
+            else:
+                fallback.append(a)
+        chosen_list = strict if strict else fallback
+        if chosen_list:
+            for a in chosen_list:
+                a["annotation"] = "+"
+            matched_spans.append(span)
+            for a in cands:
+                if (a not in chosen_list) and a.get("annotation") != "+":
+                    a["annotation"] = "-"
+    # Demote any morpheme that overlaps (by at least 1 char) with any '+' span.
+    plus_spans = []
+    for a in annotated:
+        if a.get("annotation") == "+":
+            cs = a.get("start_pos", 0)
+            ce = a.get("end_pos", cs + len(a.get("surface", "")))
+            plus_spans.append((cs, ce))
+    def _strict_overlap(st1: int, ed1: int, st2: int, ed2: int) -> bool:
+        # overlap only if intersection length > 0 (touching is not overlap)
+        return max(st1, st2) < min(ed1, ed2)
+    for a in annotated:
+        if a.get("annotation") == "+":
+            continue
+        cs = a.get("start_pos", 0)
+        ce = a.get("end_pos", cs + len(a.get("surface", "")))
+        for ms, me in plus_spans:
+            if _strict_overlap(cs, ce, ms, me):
+                a["annotation"] = "-"
+                break
+    return annotated
+def main():
+    parser = argparse.ArgumentParser(description="Create training data from KWDLC (JUMANDIC)")
+    parser.add_argument("--input-dir", type=str, default="KWDLC/knp", help="Directory containing KNP files")
+    parser.add_argument("--config", type=str, default="configs/gat.yaml", help="Path to config file")
+    parser.add_argument("--limit", type=int, help="Max number of files to process")
+    parser.add_argument("--test-only", action="store_true", help="Process only test split IDs")
+    parser.add_argument("--jumandic-path", type=str, default="/var/lib/mecab/dic/juman-utf8", help="Path to JUMANDIC")
+    args = parser.parse_args()
+    config = {}
+    if args.config and Path(args.config).exists():
+        with open(args.config, "r") as f:
+            config = yaml.safe_load(f)
+        if "extends" in config:
+            parent_config_path = Path(args.config).parent / config["extends"]
+            if parent_config_path.exists():
+                with open(parent_config_path, "r") as f:
+                    parent_config = yaml.safe_load(f)
+                def deep_merge(base, override):
+                    for key, value in override.items():
+                        if key in base and isinstance(base[key], dict) and isinstance(value, dict):
+                            deep_merge(base[key], value)
+                        else:
+                            base[key] = value
+                    return base
+                config = deep_merge(parent_config, config)
+    features_config = config.get("features", {})
+    feature_dim = features_config.get("lexical_feature_dim", 100000)
+    training_config = config.get("training", {})
+    if training_config.get("annotations_dir"):
+        output_dir = Path(training_config.get("annotations_dir"))
+    else:
+        output_dir = Path("annotations_kwdlc_juman")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Lexical features: using {feature_dim} dims")
+    print(f"Output directory: {output_dir}")
+    analyzer = MeCabAnalyzer(
+        jumandic_path=args.jumandic_path,
+    )
+    knp_files = []
+    if args.test_only:
+        test_id_file = Path("KWDLC/id/split_for_pas/test.id")
+        if test_id_file.exists():
+            with open(test_id_file, "r") as f:
+                test_ids = [line.strip() for line in f if line.strip()]
+            knp_base_dir = Path(args.input_dir)
+            for file_id in test_ids:
+                dir_name = file_id[:13]
+                file_name = f"{file_id}.knp"
+                knp_path = knp_base_dir / dir_name / file_name
+                if knp_path.exists():
+                    knp_files.append(knp_path)
+    else:
+        knp_dir = Path(args.input_dir)
+        knp_files = sorted(knp_dir.glob("**/*.knp"))
+    if args.limit:
+        knp_files = knp_files[: args.limit]
+    print(f"Files to process: {len(knp_files)}")
+    print(f"JUMANDIC: {args.jumandic_path}")
+    print(f"Output to: {output_dir}")
+    total_stats = defaultdict(int)
+    annotation_idx = 0
+    dm = DataModule(
+        annotations_dir=str(output_dir),
+        lexical_feature_dim=int(feature_dim),
+        use_bidirectional_edges=bool(config.get("edge_features", {}).get("use_bidirectional_edges", True)),
+    )
+    # Save .pt files directly under the output_dir
+    for knp_path in tqdm(knp_files, desc="processing"):
+        try:
+            sentences = _load_gold_with_kyoto(knp_path)
+            if not sentences:
+                continue
+            doc_id = knp_path.stem
+            for s in sentences:
+                s["source_id"] = doc_id
+            for sent_idx, sentence in enumerate(sentences):
+                text = sentence["text"]
+                gold_morphemes = sentence["morphemes"]
+                source_id = sentence.get("source_id", doc_id)
+                candidates = analyzer.get_morpheme_candidates(text)
+                candidates = normalize_mecab_candidates(candidates)
+                candidates = dedup_morphemes(candidates)
+                if not candidates:
+                    continue
+                annotated_morphemes = match_morphemes_with_gold(candidates, gold_morphemes, text)
+                edges = build_adjacent_edges(annotated_morphemes)
+                for m in annotated_morphemes:
+                    if "lexical_features" in m:
+                        m.pop("lexical_features", None)
+                morphemes_with_feats = dm.compute_lexical_features(annotated_morphemes, text)
+                graph = dm.create_graph_from_morphemes_data(
+                    morphemes=morphemes_with_feats,
+                    edges=edges,
+                    text=text,
+                    for_training=True,
+                )
+                if graph is None:
+                    continue
+                graph_file = output_dir / f"graph_{annotation_idx:04d}.pt"
+                payload = {
+                    "graph": graph,
+                    "source_id": source_id,
+                    "text": text,
+                }
+                torch.save(payload, graph_file)
+                total_stats["sentences"] += 1
+                total_stats["morphemes"] += len(annotated_morphemes)
+                total_stats["positive"] += sum(1 for m in annotated_morphemes if m.get("annotation") == "+")
+                total_stats["negative"] += sum(1 for m in annotated_morphemes if m.get("annotation") == "-")
+                annotation_idx += 1
+            total_stats["files"] += 1
+        except Exception as e:
+            print(f"Error ({knp_path}): {e}")
+            total_stats["errors"] += 1
+    print("\n" + "=" * 50)
+    print("Processing complete")
+    print("=" * 50)
+    print(f"Files: {total_stats['files']}")
+    print(f"Sentences: {total_stats['sentences']}")
+    print(f"Morphemes: {total_stats['morphemes']}")
+    print(f"Positive (+): {total_stats['positive']}")
+    print(f"Negative (-): {total_stats['negative']}")
+    #
+    if total_stats["errors"] > 0:
+        print(f"Errors: {total_stats['errors']}")
+if __name__ == "__main__":
+    main()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,56 @@

+[project]
+name = "mecari-morpheme"
+version = "0.1.0"
+description = "Japanese morphological analysis using Graph Neural Networks"
+readme = "README.md"
+requires-python = ">=3.11,<3.12"
+dependencies = [
+    "torch>=2.2,<2.3",
+    "pytorch-lightning>=2.0.0",
+    "torch-geometric>=2.4,<2.5",
+    "numpy>=1.24,<2.0",
+    "pyyaml>=6.0",
+    "tqdm>=4.65.0",
+    "kyoto-reader>=2.5.0",
+    # Optional: enabled by default via config
+    "wandb>=0.15.0",
+]
+[project.optional-dependencies]
+dev = [
+    "ipython>=8.14.0",
+    "jupyter>=1.0.0",
+    "notebook>=7.0.0",
+    "pytest>=7.4.0",
+    "black>=23.0.0",
+    "ruff>=0.1.0",
+]
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[tool.setuptools]
+packages = ["mecari"]
+[tool.uv]
+index-url = "https://pypi.org/simple"
+# Use CUDA 12.1 compatible PyG wheels (matches torch 2.2.x + cu121 environment)
+find-links = ["https://data.pyg.org/whl/torch-2.2.0+cu121.html"]
+# torch-cluster等のビルド時にtorchが必要
+[tool.uv.extra-build-dependencies]
+# Ensure torch is available when resolving extension wheels
+torch-geometric = ["torch"]
+[tool.ruff]
+line-length = 120
+target-version = "py311"
+[tool.ruff.lint]
+select = ["E", "F", "I"]
+ignore = ["E501"]  # line too long
+[tool.black]
+line-length = 120
+target-version = ['py311']

requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+--find-links https://data.pyg.org/whl/torch-2.2.0+cpu.html
+# Core runtime
+torch==2.2.2
+torch-scatter
+torch-sparse
+torch-cluster
+torch-spline-conv
+torch-geometric==2.4.0
+pytorch-lightning==2.5.2
+numpy>=1.24,<2.1
+pyyaml>=6.0
+tqdm>=4.65.0
+kyoto-reader>=2.5.0
+# UI
+gradio>=4.37.0
+# Optional logger (disabled at runtime)
+wandb>=0.15.0

runtime.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python-3.11

sample_model/config.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+edge_features:
+  use_bidirectional_edges: true
+features:
+  lexical_feature_dim: 100000
+inference:
+  checkpoint_dir: experiments
+  experiment_name: null
+loss:
+  label_smoothing: 0.0
+  use_pos_weight: true
+model:
+  dropout: 0.1
+  hidden_dim: 64
+  num_classes: 1
+  num_heads: 4
+  num_layers: 4
+  share_weights: false
+  type: gatv2
+training:
+  accumulate_grad_batches: 1
+  annotations_dir: annotations_new
+  batch_size: 128
+  deterministic: false
+  gradient_clip_algorithm: norm
+  gradient_clip_val: 0.5
+  learning_rate: 0.001
+  log_every_n_steps: 50
+  max_steps: 10000
+  num_workers: 4
+  optimizer:
+    type: adamw
+    weight_decay: 0.001
+  patience: 10
+  project_name: mecari
+  seed: 42
+  use_wandb: true
+  val_check_interval: 1.0
+  warmup_start_lr: 0.0
+  warmup_steps: 500

sample_model/model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ccfc112d4a0dcdc0b087c9cabf1b45f1aeae2f1cb8a6f86196a115aa594f68d7
+size 26975745

train.py ADDED Viewed

	@@ -0,0 +1,388 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+import os
+# Disable tokenizer parallelism warning
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+import argparse
+import random
+from datetime import datetime
+from importlib import import_module
+from typing import Optional
+import numpy as np
+import pytorch_lightning as pl
+import torch
+from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor, ModelCheckpoint
+from mecari.config.config import get_model_config, override_config, save_config
+from mecari.data.data_module import DataModule
+def set_seed(seed: int = 42, deterministic: bool = True) -> None:
+    """Set random seeds for reproducibility.
+    Args:
+        seed: Random seed value.
+        deterministic: If True, enforce deterministic behavior (slower).
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = deterministic
+    torch.backends.cudnn.benchmark = not deterministic
+    pl.seed_everything(seed)
+def get_config_sections(config: dict) -> dict:
+    """Extract structured sections from a unified config dict."""
+    return {
+        "model": config["model"],
+        "training": config["training"],
+        "features": config.get("features", {}),
+        "edge": config.get("edge_features", {}),
+    }
+def calculate_feature_dim(config: dict) -> int:
+    """Return feature dimension from config (lexical features by default)."""
+    features_cfg = config.get("features", {})
+    lexical_dim = features_cfg.get("lexical_feature_dim", 100000)
+    return lexical_dim
+def create_data_module(config: dict) -> DataModule:
+    """Create DataModule from config (lexical-only pipeline)."""
+    features_cfg = config.get("features", {})
+    training_cfg = config["training"]
+    edge_cfg = config.get("edge_features", {})
+    lexical_feature_dim = features_cfg.get("lexical_feature_dim", 100000)
+    return DataModule(
+        annotations_dir=training_cfg["annotations_dir"],
+        batch_size=training_cfg["batch_size"],
+        num_workers=training_cfg["num_workers"],
+        max_files=training_cfg.get("max_files"),
+        use_bidirectional_edges=edge_cfg.get("use_bidirectional_edges", True),
+        annotations_override_dir=training_cfg.get("annotations_override_dir"),
+        lexical_feature_dim=lexical_feature_dim,
+    )
+def setup_loggers(config: dict, experiment_name: str):
+    """Configure optional loggers (e.g., Weights & Biases)."""
+    import subprocess
+    from pytorch_lightning.loggers import WandbLogger
+    loggers = []
+    if config["training"]["use_wandb"]:
+        try:
+            tags = []
+            try:
+                branch = subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True).strip()
+                tags.append(f"branch:{branch}")
+                commit = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"], text=True).strip()
+                tags.append(f"commit:{commit}")
+            except:
+                pass
+            wandb_logger = WandbLogger(
+                project=config["training"]["project_name"],
+                name=experiment_name,
+                save_dir=f"experiments/{experiment_name}",
+                save_code=True,
+                log_model=False,
+                tags=tags,
+            )
+            loggers.append(wandb_logger)
+            print("✓ Added WandB logger (metrics only)")
+        except Exception as e:
+            print(f"WandbLogger initialization error: {e}")
+    else:
+        print("WandB logging disabled")
+    if not loggers:
+        loggers = False
+    return loggers
+def create_trainer(config: dict, callbacks: list, loggers, deterministic: bool) -> pl.Trainer:
+    """Create a PyTorch Lightning Trainer."""
+    if torch.cuda.is_available():
+        accelerator = "gpu"
+        devices = 1
+    else:
+        accelerator = "cpu"
+        devices = 1
+    max_steps = config["training"].get("max_steps", 8600)
+    max_epochs = -1  # use max_steps only
+    trainer_kwargs = {
+        "max_epochs": max_epochs,
+        "max_steps": max_steps,
+        "callbacks": callbacks,
+        "logger": loggers,
+        "accelerator": accelerator,
+        "devices": devices,
+        "log_every_n_steps": config["training"]["log_every_n_steps"],
+        "val_check_interval": config["training"]["val_check_interval"],
+        "gradient_clip_val": config["training"]["gradient_clip_val"],
+        "enable_checkpointing": True,
+        "enable_progress_bar": True,
+        "limit_train_batches": 1.0,
+        "limit_val_batches": 1.0,
+        "limit_test_batches": 1.0,
+        "limit_predict_batches": 1.0,
+        "fast_dev_run": False,
+        "deterministic": deterministic,
+        "benchmark": not deterministic,
+        "precision": "16-mixed",
+    }
+    if "gradient_clip_algorithm" in config["training"]:
+        trainer_kwargs["gradient_clip_algorithm"] = config["training"]["gradient_clip_algorithm"]
+    if "accumulate_grad_batches" in config["training"]:
+        trainer_kwargs["accumulate_grad_batches"] = config["training"]["accumulate_grad_batches"]
+    return pl.Trainer(**trainer_kwargs)
+def create_model_and_datamodule(config: dict, feature_dim: int, data_module: Optional[DataModule] = None):
+    """Create model and ensure DataModule is available (lexical-only)."""
+    cfg = get_config_sections(config)
+    model_cfg = cfg["model"]
+    training_cfg = cfg["training"]
+    features_cfg = cfg["features"]
+    if data_module is None:
+        data_module = create_data_module(config)
+    common_params = {
+        "hidden_dim": model_cfg["hidden_dim"],
+        "num_classes": model_cfg["num_classes"],
+        "learning_rate": training_cfg["learning_rate"],
+        "lexical_feature_dim": features_cfg.get("lexical_feature_dim", 100000),
+    }
+    if model_cfg["type"] == "gatv2":
+        MecariGATv2 = getattr(import_module("mecari.models.gatv2"), "MecariGATv2")
+        model = MecariGATv2(
+            **common_params,
+            num_heads=model_cfg["num_heads"],
+            share_weights=model_cfg.get("share_weights", False),
+            dropout=model_cfg.get("dropout", 0.1),
+            attn_dropout=model_cfg.get("attn_dropout", model_cfg.get("attention_dropout", 0.1)),
+            add_self_loops_flag=model_cfg.get("add_self_loops", True),
+            edge_dropout=model_cfg.get("edge_dropout", 0.0),
+            norm=model_cfg.get("norm", "layer"),
+        )
+    else:
+        raise ValueError(f"Unsupported model type: {model_cfg['type']}")
+    return model, data_module
+def main():
+    parser = argparse.ArgumentParser(description="Train the morphological analysis model")
+    parser.add_argument(
+        "--model",
+        "-m",
+        choices=["gatv2"],
+        default="gatv2",
+        help="Model type (only gatv2 supported). If a config is provided, config.model.type takes precedence.",
+    )
+    parser.add_argument("--config", "-c", help="Path to config file (overrides model type if present)")
+    parser.add_argument("--batch-size", "-b", type=int, help="Batch size")
+    parser.add_argument("--steps", "-s", type=int, help="Max training steps")
+    parser.add_argument("--lr", type=float, help="Learning rate")
+    parser.add_argument("--hidden-dim", type=int, help="Hidden dimension size")
+    parser.add_argument("--patience", type=int, help="Early stopping patience")
+    parser.add_argument("--weight-decay", type=float, help="Weight decay")
+    parser.add_argument("--no-wandb", action="store_true", help="Disable Weights & Biases logging")
+    parser.add_argument("--seed", type=int, help="Random seed")
+    parser.add_argument("--no-deterministic", action="store_true", help="Disable deterministic mode for speed")
+    parser.add_argument("--resume", type=str, help="Experiment name to resume (e.g., gatv2_20250806_162945)")
+    args = parser.parse_args()
+    # Load/merge config
+    if args.config:
+        from mecari.config.config import load_config
+        config = load_config(args.config)
+        if "model" in config and "type" in config["model"]:
+            args.model = config["model"]["type"]
+    else:
+        config = get_model_config(args.model)
+    overrides = {}
+    # Training overrides
+    training_overrides = {}
+    if args.batch_size:
+        training_overrides["batch_size"] = args.batch_size
+    if args.steps:
+        training_overrides["max_steps"] = args.steps
+    if args.lr:
+        training_overrides["learning_rate"] = args.lr
+    if args.no_wandb:
+        training_overrides["use_wandb"] = False
+    if args.patience:
+        training_overrides["patience"] = args.patience
+    if args.seed:
+        training_overrides["seed"] = args.seed
+    if args.no_deterministic:
+        training_overrides["deterministic"] = False
+    if training_overrides:
+        overrides["training"] = training_overrides
+    # Model overrides
+    if args.hidden_dim:
+        overrides["model"] = {"hidden_dim": args.hidden_dim}
+    # Optimizer overrides
+    if args.weight_decay:
+        overrides.setdefault("training", {})
+        overrides["training"]["optimizer"] = {"weight_decay": args.weight_decay}
+    if overrides:
+        config = override_config(config, overrides)
+    deterministic = config["training"].get("deterministic", True)
+    set_seed(config["training"]["seed"], deterministic=deterministic)
+    if not deterministic:
+        print("⚡ Performance mode: deterministic=False (reproducibility not guaranteed)")
+    resume_from_checkpoint = None
+    experiment_name = None
+    if args.resume:
+        experiment_path = os.path.join("experiments", args.resume)
+        if os.path.exists(experiment_path):
+            checkpoint_dir = os.path.join(experiment_path, "checkpoints")
+            if os.path.exists(checkpoint_dir):
+                checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith(".ckpt")]
+                if checkpoints:
+                    checkpoints.sort()
+                    resume_from_checkpoint = os.path.join(checkpoint_dir, checkpoints[-1])
+                    print(f"Resuming training from: {resume_from_checkpoint}")
+                    experiment_name = args.resume
+                    config_path = os.path.join(experiment_path, "config.yaml")
+                    if os.path.exists(config_path):
+                        from mecari.config.config import load_config
+                        config = load_config(config_path)
+                        print(f"Restored config from: {config_path}")
+                else:
+                    print(f"Warning: No checkpoints found in: {checkpoint_dir}")
+            else:
+                print(f"Warning: Checkpoint directory not found: {checkpoint_dir}")
+        else:
+            print(f"Warning: Experiment directory not found: {experiment_path}")
+    else:
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        experiment_name = f"{config['model']['type']}_{timestamp}"
+    print(f"Experiment: {experiment_name}")
+    print(f"Model: {config['model']['type'].upper()}")
+    print("Lexical features: enabled (default)")
+    if torch.cuda.is_available():
+        print(f"🚀 Using GPU: {torch.cuda.get_device_name(0)}")
+        print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
+    else:
+        print("💻 Using CPU")
+    data_module = create_data_module(config)
+    feature_dim = calculate_feature_dim(config)
+    model, _ = create_model_and_datamodule(config, feature_dim, data_module)
+    # Attach training config for schedulers, etc.
+    model.training_config = config["training"]
+    experiment_dir = f"experiments/{experiment_name}"
+    if not args.resume:
+        os.makedirs(experiment_dir, exist_ok=True)
+        save_config(config, f"{experiment_dir}/config.yaml")
+    checkpoint_callback_error = ModelCheckpoint(
+        dirpath=f"experiments/{experiment_name}/checkpoints",
+        filename=f"{config['model']['type']}-{{epoch:02d}}-{{val_error_epoch:.3f}}",
+        monitor="val_error_epoch",
+        mode="min",
+        save_top_k=1,
+        save_last=True,
+    )
+    early_stopping = EarlyStopping(
+        monitor="val_error_epoch", mode="min", patience=config["training"]["patience"], verbose=True, strict=False
+    )
+    loggers = setup_loggers(config, experiment_name)
+    callbacks = [checkpoint_callback_error, early_stopping]
+    try:
+        if loggers:
+            lr_monitor = LearningRateMonitor(logging_interval="step")
+            callbacks.append(lr_monitor)
+    except Exception:
+        pass
+    trainer = create_trainer(config, callbacks, loggers, deterministic)
+    print("Starting training...")
+    try:
+        if resume_from_checkpoint:
+            trainer.fit(model, data_module, ckpt_path=resume_from_checkpoint)
+        else:
+            trainer.fit(model, data_module)
+        training_status = "completed"
+        if data_module.test_dataset:
+            print("Evaluating on test data...")
+            trainer.test(model, data_module)
+        print("Training complete!")
+    except KeyboardInterrupt:
+        print("\nTraining interrupted...")
+        training_status = "interrupted"
+    except Exception as e:
+        print(f"\nError during training: {e}")
+        import traceback
+        traceback.print_exc()
+        training_status = "error"
+    print(f"Experiment: {experiment_name}")
+    print(f"Experiment dir: experiments/{experiment_name}")
+    print("\n=== Saved models ===")
+    if checkpoint_callback_error.best_model_path:
+        best_error = (
+            float(checkpoint_callback_error.best_model_score)
+            if checkpoint_callback_error.best_model_score is not None
+            else 1.0
+        )
+        print(f"  Best val_error: {best_error:.6f}")
+        print(f"    → {os.path.basename(checkpoint_callback_error.best_model_path)}")
+    print(f"\nFinal epoch: {trainer.current_epoch}")
+    print(f"Training status: {training_status}")
+if __name__ == "__main__":
+    main()

up_hf.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import os
+from huggingface_hub import HfApi
+api = HfApi()
+repo_id = "zbller/Mecari"
+api.create_repo(repo_id=repo_id, repo_type="space", space_sdk="gradio", private=False, exist_ok=True, token=os.environ["HF_TOKEN"])
+api.upload_folder(
+    repo_id=repo_id,
+    repo_type="space",
+    folder_path=".",
+    path_in_repo=".",
+    ignore_patterns=[
+        ".git", ".git/**", ".venv", ".venv/**", "__pycache__", "**/__pycache__",
+        "KWDLC", "KWDLC/**", "annotations", "annotations/**", "experiments", "experiments/**",
+        "mecari_morpheme.egg-info", "mecari_morpheme.egg-info/**",
+    ],
+    token=os.environ["HF_TOKEN"],
+)
+print(f"Uploaded to https://huggingface.co/spaces/{repo_id}")

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff