leaderboard-analytics-service

Running

App Files Files Community

Feat Regional Analysis

by SmileXing - opened 3 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+1364

-155

This PR is in draft mode

Files changed (20) hide show

.env.example +4 -0
.gitignore +8 -0
CHANGELOG.md +20 -0
README.md +69 -12
app.py +4 -3
pyproject.toml +10 -0
requirements.txt +1 -0
src/leaderboard_analytics/__init__.py +0 -1
src/leaderboard_analytics/config.py +6 -1
src/leaderboard_analytics/db.py +3 -2
src/leaderboard_analytics/geoip_database.py +36 -0
src/leaderboard_analytics/main.py +32 -5
src/leaderboard_analytics/repositories.py +294 -46
src/leaderboard_analytics/schemas.py +11 -4
src/leaderboard_analytics/services.py +220 -18
src/leaderboard_analytics/ui.py +396 -63
tests/test_geoip_database.py +29 -0
tests/test_repositories.py +95 -0
tests/test_schemas.py +16 -0
tests/test_services.py +110 -0

.env.example CHANGED Viewed

@@ -4,3 +4,7 @@ MONGO_COLLECTION=events
 HOST=0.0.0.0
 PORT=7860
 GRADIO_SHARE=false

 HOST=0.0.0.0
 PORT=7860
 GRADIO_SHARE=false
+GRADIO_SSR_MODE=false
+GEOIP_DATABASE_PATH=GeoLite2-Country.mmdb
+GEOIP_DATABASE_URL=https://cdn.jsdelivr.net/npm/geolite2-country/GeoLite2-Country.mmdb.gz
+GEOIP_AUTO_DOWNLOAD=true

.gitignore CHANGED Viewed

@@ -49,6 +49,7 @@ coverage.xml
 *.py.cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
@@ -144,6 +145,13 @@ ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject

 *.py.cover
 .hypothesis/
 .pytest_cache/
+.pytest_tmp/
 cover/
 # Translations
 env.bak/
 venv.bak/
+# Local GeoIP databases
+*.mmdb
+*.mmdb.gz
+# Local analytics exports
+visitor_ips*.csv
 # Spyder project settings
 .spyderproject
 .spyproject

CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# Changelog
+All notable changes to this project will be documented in this file.
+## Unreleased
+### Added
+- Added full-range overview totals so UV and Sessions are distinct counts across the selected range.
+- Added ordered funnel logic that counts each step only when it occurs after the previous required step.
+- Added benchmark choices, raw data tables, and CSV export support to the dashboard.
+- Added query validation, MongoDB ping checks, and dashboard-friendly error messages.
+- Added pytest coverage for metric totals, query validation, and MongoDB aggregation pipeline shape.
+- Added CI for formatting, linting, and tests.
+### Changed
+- Updated new vs returning visitor logic to compute first-seen dates from the full available page-view history before applying the selected reporting range.
+- Updated MongoDB aggregation pipelines to prefer an indexed `ts` Date field while retaining fallback support for legacy `timestamp` values.
+- Documented recommended MongoDB indexes for production deployments.

README.md CHANGED Viewed

@@ -23,8 +23,10 @@ The primary purpose of this document is to define **what is measured**, **where
 All analytics are based on the `events` collection and the following stable fields:
 - Core dimensions: `event_name`, `timestamp`, `session_id`
 - Behavior context: `benchmark`, `filters`
 - Visitor identity (approximate): `properties.visitor_id`
 - Change context: `properties.old_value`, `properties.new_value`, `properties.filter_name`
 Important event names:
@@ -51,7 +53,7 @@ Important event names:
 - **Definition**: Number of unique interaction sessions.
 - **Source fields**: `session_id`
 - **Calculation**:
-  - Sessions = count of distinct `session_id` in the selected time range
 ### 3) UV (Unique Visitors, Approximate)
@@ -59,7 +61,7 @@ Important event names:
 - **Source fields**: `properties.visitor_id`
 - **Calculation**:
   - Remove null/empty `properties.visitor_id`
-  - UV = count of distinct `properties.visitor_id`
 ### 4) Sessions Per Visitor
@@ -106,7 +108,7 @@ Important event names:
 - **Source fields**: `event_name`, `session_id`
 - **Calculation**:
   - For each `filter_change_`* event type:
-    - collect distinct `session_id`
     - coverage = distinct session count
 ---
@@ -122,12 +124,12 @@ Recommended session-level funnel:
 ### 9) Step Session Count
-- **Definition**: Number of sessions that reached each funnel step.
-- **Source fields**: `session_id`, `event_name`
 - **Calculation**:
   - Group events by `session_id`
-  - For each session, mark whether each step exists
-  - Count sessions satisfying each cumulative step condition
 ### 10) Step Conversion Rate
@@ -144,10 +146,10 @@ Recommended session-level funnel:
 ### 11) New Visitors
 - **Definition**: Visitors whose current period contains their first observed visit date.
-- **Source fields**: `event_name`, `timestamp`, `properties.visitor_id`
 - **Calculation**:
   - Use `page_view` events only
-  - For each `visitor_id`, find earliest timestamp (`first_seen`)
   - If event date equals `first_seen` date, classify as `new`
   - Count distinct `visitor_id` by period
@@ -160,6 +162,19 @@ Recommended session-level funnel:
   - If event date is later than first-seen date, classify as `returning`
   - Count distinct `visitor_id` by period
 ---
 ## Time Aggregation Rules
@@ -170,10 +185,11 @@ All trend metrics support these granularities:
 - `week` -> `%G-W%V` (ISO week)
 - `month` -> `%Y-%m`
-Time filtering is applied on converted event time:
-- Convert `timestamp` to datetime (`ts`)
-- Keep records where `start_time <= ts <= end_time`
 Optional benchmark filtering:
@@ -186,6 +202,27 @@ Optional benchmark filtering:
 1. `visitor_id` is an approximate identifier, not a strict user identity.
 2. For `filter_change_`*, `properties.new_value` may not always represent the actual final filter value; prefer `filters` snapshot for behavioral context.
 3. If `table_download` is not instrumented, funnel step 4 will under-report by design.
 ---
@@ -196,6 +233,19 @@ Only required runtime inputs:
 - MongoDB connection URI (`MONGO_URI`)
 - Mongo database/collection names (defaults supported)
 Local commands:
 ```bash
@@ -203,3 +253,10 @@ uv sync
 uv run leaderboard-analytics
 ```

 All analytics are based on the `events` collection and the following stable fields:
 - Core dimensions: `event_name`, `timestamp`, `session_id`
+- Preferred event time: `ts` as a MongoDB Date
 - Behavior context: `benchmark`, `filters`
 - Visitor identity (approximate): `properties.visitor_id`
+- Visitor IP for country analysis: `properties.ip`
 - Change context: `properties.old_value`, `properties.new_value`, `properties.filter_name`
 Important event names:
 - **Definition**: Number of unique interaction sessions.
 - **Source fields**: `session_id`
 - **Calculation**:
+  - Sessions = count of distinct non-empty `session_id` values in the selected time range
 ### 3) UV (Unique Visitors, Approximate)
 - **Source fields**: `properties.visitor_id`
 - **Calculation**:
   - Remove null/empty `properties.visitor_id`
+  - UV = count of distinct `properties.visitor_id` values in the selected time range
 ### 4) Sessions Per Visitor
 - **Source fields**: `event_name`, `session_id`
 - **Calculation**:
   - For each `filter_change_`* event type:
+    - collect distinct non-empty `session_id`
     - coverage = distinct session count
 ---
 ### 9) Step Session Count
+- **Definition**: Number of sessions that reached each ordered funnel step.
+- **Source fields**: `session_id`, `event_name`, `ts` or `timestamp`
 - **Calculation**:
   - Group events by `session_id`
+  - Sort events by event time
+  - Count each cumulative step only when it occurs after the previous required step
 ### 10) Step Conversion Rate
 ### 11) New Visitors
 - **Definition**: Visitors whose current period contains their first observed visit date.
+- **Source fields**: `event_name`, `ts` or `timestamp`, `properties.visitor_id`
 - **Calculation**:
   - Use `page_view` events only
+  - For each `visitor_id`, find earliest timestamp (`first_seen`) from the full available dataset
   - If event date equals `first_seen` date, classify as `new`
   - Count distinct `visitor_id` by period
   - If event date is later than first-seen date, classify as `returning`
   - Count distinct `visitor_id` by period
+### 13) Visitor Locations by Country
+- **Definition**: Page view volume by visitor IP country/region.
+- **Source fields**: `event_name`, `properties.ip`
+- **Calculation**:
+  - Filter `event_name == "page_view"`
+  - Remove null/empty `properties.ip`
+  - Group page views by IP in MongoDB
+  - Resolve each IP to a country using the local MaxMind GeoLite2 Country database
+  - Group by `country_code` and `country_name`
+  - Map color = page view count (`pv`)
+  - Private, invalid, unresolved, or unconfigured IPs are grouped as `Unknown`
 ---
 ## Time Aggregation Rules
 - `week` -> `%G-W%V` (ISO week)
 - `month` -> `%Y-%m`
+Time filtering rules:
+- Prefer the indexed MongoDB Date field `ts`
+- Fall back to converting legacy `timestamp` values when `ts` is not present
+- Keep records where `start_time <= event time <= end_time`
 Optional benchmark filtering:
 1. `visitor_id` is an approximate identifier, not a strict user identity.
 2. For `filter_change_`*, `properties.new_value` may not always represent the actual final filter value; prefer `filters` snapshot for behavioral context.
 3. If `table_download` is not instrumented, funnel step 4 will under-report by design.
+4. Total UV and Sessions are distinct counts across the full selected time range. They are not calculated by summing per-period trend values.
+5. Funnel steps are ordered by event time. A session only reaches a later step when that step happens after the previous required step.
+---
+## MongoDB Performance Notes
+For production deployments, store event time as a MongoDB Date field named `ts`. Keeping only string timestamps forces aggregation pipelines to convert time values at query time and can reduce index usage.
+Recommended indexes:
+```javascript
+db.events.createIndex({ ts: 1 })
+db.events.createIndex({ ts: 1, benchmark: 1 })
+db.events.createIndex({ event_name: 1, ts: 1 })
+db.events.createIndex({ session_id: 1, ts: 1 })
+db.events.createIndex({ "properties.visitor_id": 1, ts: 1 })
+db.events.createIndex({ event_name: 1, ts: 1, "properties.ip": 1 })
+```
+Legacy events with only `timestamp` remain supported, but backfilling `ts` is recommended before running this dashboard against large collections.
 ---
 - MongoDB connection URI (`MONGO_URI`)
 - Mongo database/collection names (defaults supported)
+Optional visitor location input:
+- `GEOIP_DATABASE_PATH`: path to a local MaxMind `GeoLite2-Country.mmdb` file
+- `GEOIP_DATABASE_URL`: URL for a gzipped GeoLite2 Country MMDB download
+- `GEOIP_AUTO_DOWNLOAD`: whether to download and decompress the MMDB when missing
+The dashboard does not call an external IP lookup API for visitor lookups. By default,
+startup downloads `https://cdn.jsdelivr.net/npm/geolite2-country/GeoLite2-Country.mmdb.gz`
+when `GEOIP_DATABASE_PATH` is missing, decompresses it, and uses the resulting MMDB file
+locally. Set `GEOIP_AUTO_DOWNLOAD=false` if the runtime cannot access the network or if
+you prefer to mount the MMDB yourself. If the database is unavailable, visitor location
+rows are grouped as `Unknown`.
 Local commands:
 ```bash
 uv run leaderboard-analytics
 ```
+Run quality checks:
+```bash
+uv run ruff format --check .
+uv run ruff check .
+uv run pytest
+```

app.py CHANGED Viewed

@@ -1,5 +1,5 @@
-from pathlib import Path
 import sys
 # Ensure src-layout package is importable in Hugging Face Spaces runtime.
 ROOT_DIR = Path(__file__).resolve().parent
@@ -7,8 +7,9 @@ SRC_DIR = ROOT_DIR / "src"
 if str(SRC_DIR) not in sys.path:
     sys.path.insert(0, str(SRC_DIR))
-from leaderboard_analytics.main import run
 if __name__ == "__main__":
-    run()

 import sys
+from pathlib import Path
 # Ensure src-layout package is importable in Hugging Face Spaces runtime.
 ROOT_DIR = Path(__file__).resolve().parent
 if str(SRC_DIR) not in sys.path:
     sys.path.insert(0, str(SRC_DIR))
+from leaderboard_analytics.main import create_demo, launch_demo  # noqa: E402
+demo = create_demo()
 if __name__ == "__main__":
+    launch_demo(demo)

pyproject.toml CHANGED Viewed

@@ -12,6 +12,13 @@ dependencies = [
   "python-dotenv>=1.0.1",
   "pandas>=2.2.3",
   "plotly>=5.24.1",
 ]
 [tool.ruff]
@@ -34,3 +41,6 @@ build-backend = "hatchling.build"
 [tool.hatch.build.targets.wheel]
 packages = ["src/leaderboard_analytics"]

   "python-dotenv>=1.0.1",
   "pandas>=2.2.3",
   "plotly>=5.24.1",
+  "geoip2>=4.8.0",
+]
+[project.optional-dependencies]
+dev = [
+  "pytest>=8.3.0",
+  "ruff>=0.8.0",
 ]
 [tool.ruff]
 [tool.hatch.build.targets.wheel]
 packages = ["src/leaderboard_analytics"]
+[tool.pytest.ini_options]
+pythonpath = ["src"]

requirements.txt CHANGED Viewed

@@ -5,3 +5,4 @@ pydantic-settings>=2.6.0
 python-dotenv>=1.0.1
 pandas>=2.2.3
 plotly>=5.24.1

 python-dotenv>=1.0.1
 pandas>=2.2.3
 plotly>=5.24.1
+geoip2>=4.8.0

src/leaderboard_analytics/__init__.py CHANGED Viewed

	@@ -1,2 +1 @@
1	"""Leaderboard analytics package."""
2	-


1	"""Leaderboard analytics package."""

src/leaderboard_analytics/config.py CHANGED Viewed

@@ -12,9 +12,14 @@ class Settings(BaseSettings):
     host: str = "0.0.0.0"
     port: int = 7860
     gradio_share: bool = False
 @lru_cache(maxsize=1)
 def get_settings() -> Settings:
     return Settings()

     host: str = "0.0.0.0"
     port: int = 7860
     gradio_share: bool = False
+    gradio_ssr_mode: bool = False
+    geoip_database_path: str = "GeoLite2-Country.mmdb"
+    geoip_database_url: str = (
+        "https://cdn.jsdelivr.net/npm/geolite2-country/GeoLite2-Country.mmdb.gz"
+    )
+    geoip_auto_download: bool = True
 @lru_cache(maxsize=1)
 def get_settings() -> Settings:
     return Settings()

src/leaderboard_analytics/db.py CHANGED Viewed

@@ -9,7 +9,9 @@ def get_mongo_client() -> MongoClient:
     settings = get_settings()
     if not settings.mongo_uri:
         raise ValueError("MONGO_URI is not configured. Please set MONGO_URI in .env file.")
-    return MongoClient(settings.mongo_uri)
 def get_database(client: MongoClient) -> Database:
@@ -20,4 +22,3 @@ def get_database(client: MongoClient) -> Database:
 def get_events_collection(db: Database) -> Collection:
     settings = get_settings()
     return db[settings.mongo_collection]

     settings = get_settings()
     if not settings.mongo_uri:
         raise ValueError("MONGO_URI is not configured. Please set MONGO_URI in .env file.")
+    client = MongoClient(settings.mongo_uri, serverSelectionTimeoutMS=5000)
+    client.admin.command("ping")
+    return client
 def get_database(client: MongoClient) -> Database:
 def get_events_collection(db: Database) -> Collection:
     settings = get_settings()
     return db[settings.mongo_collection]

src/leaderboard_analytics/geoip_database.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import gzip
+import shutil
+import tempfile
+from pathlib import Path
+from urllib.request import urlopen
+DEFAULT_GEOIP_DATABASE_URL = (
+    "https://cdn.jsdelivr.net/npm/geolite2-country/GeoLite2-Country.mmdb.gz"
+)
+def ensure_geoip_database(
+    database_path: str | Path,
+    source_url: str = DEFAULT_GEOIP_DATABASE_URL,
+    *,
+    auto_download: bool = True,
+    timeout: float = 30.0,
+) -> Path:
+    target_path = Path(database_path)
+    if target_path.exists() or not auto_download:
+        return target_path
+    target_path.parent.mkdir(parents=True, exist_ok=True)
+    with tempfile.NamedTemporaryFile(
+        prefix=f"{target_path.name}.",
+        suffix=".tmp",
+        dir=target_path.parent,
+        delete=False,
+    ) as temp_file:
+        temp_path = Path(temp_file.name)
+        with urlopen(source_url, timeout=timeout) as response:
+            with gzip.GzipFile(fileobj=response) as gzip_file:
+                shutil.copyfileobj(gzip_file, temp_file)
+    temp_path.replace(target_path)
+    return target_path

src/leaderboard_analytics/main.py CHANGED Viewed

@@ -1,22 +1,49 @@
 from leaderboard_analytics.config import get_settings
 from leaderboard_analytics.db import get_database, get_events_collection, get_mongo_client
 from leaderboard_analytics.repositories import AnalyticsRepository
 from leaderboard_analytics.services import AnalyticsService
 from leaderboard_analytics.ui import build_dashboard
-def run() -> None:
     settings = get_settings()
     client = get_mongo_client()
     db = get_database(client)
     events_collection = get_events_collection(db)
     repository = AnalyticsRepository(events_collection=events_collection)
-    service = AnalyticsService(repository=repository)
-    demo = build_dashboard(service=service)
-    demo.launch(server_name=settings.host, server_port=settings.port, share=settings.gradio_share)
 if __name__ == "__main__":
     run()

 from leaderboard_analytics.config import get_settings
 from leaderboard_analytics.db import get_database, get_events_collection, get_mongo_client
+from leaderboard_analytics.geoip_database import ensure_geoip_database
 from leaderboard_analytics.repositories import AnalyticsRepository
 from leaderboard_analytics.services import AnalyticsService
 from leaderboard_analytics.ui import build_dashboard
+def create_demo():
     settings = get_settings()
     client = get_mongo_client()
     db = get_database(client)
     events_collection = get_events_collection(db)
+    geoip_database_path = settings.geoip_database_path
+    try:
+        geoip_database_path = str(
+            ensure_geoip_database(
+                settings.geoip_database_path,
+                settings.geoip_database_url,
+                auto_download=settings.geoip_auto_download,
+            )
+        )
+    except Exception as exc:
+        print(f"GeoIP database download failed: {exc}")
     repository = AnalyticsRepository(events_collection=events_collection)
+    service = AnalyticsService(
+        repository=repository,
+        geoip_database_path=geoip_database_path,
+    )
+    return build_dashboard(service=service)
+def launch_demo(demo) -> None:
+    settings = get_settings()
+    demo.launch(
+        server_name=settings.host,
+        server_port=settings.port,
+        share=settings.gradio_share,
+        ssr_mode=settings.gradio_ssr_mode,
+    )
+def run() -> None:
+    launch_demo(create_demo())
 if __name__ == "__main__":
     run()

src/leaderboard_analytics/repositories.py CHANGED Viewed

@@ -11,12 +11,34 @@ def _period_expression(granularity: Granularity) -> dict:
         Granularity.WEEK: "%G-W%V",
         Granularity.MONTH: "%Y-%m",
     }
-    return {"$dateToString": {"format": format_map[granularity], "date": "$ts"}}
 def _with_time_and_optional_benchmark(filters: QueryFilters) -> dict:
     matcher: dict = {
-        "ts": {
             "$gte": filters.start_time,
             "$lte": filters.end_time,
         }
@@ -26,6 +48,23 @@ def _with_time_and_optional_benchmark(filters: QueryFilters) -> dict:
     return matcher
 class AnalyticsRepository:
     def __init__(self, events_collection: Collection) -> None:
         self.events_collection = events_collection
@@ -33,7 +72,8 @@ class AnalyticsRepository:
     def overview_timeseries(self, filters: QueryFilters) -> list[dict]:
         period_expr = _period_expression(filters.granularity)
         pipeline: list[dict] = [
-            {"$addFields": {"ts": {"$toDate": "$timestamp"}, "visitor_id": "$properties.visitor_id"}},
             {"$match": _with_time_and_optional_benchmark(filters)},
             {
                 "$group": {
@@ -50,27 +90,52 @@ class AnalyticsRepository:
                     "period": "$_id.period",
                     "pv": 1,
                     "event_count": 1,
-                    "session_count": {"$size": "$sessions"},
-                    "uv": {
-                        "$size": {
-                            "$filter": {
-                                "input": "$visitors",
-                                "as": "v",
-                                "cond": {"$and": [{"$ne": ["$$v", None]}, {"$ne": ["$$v", ""]}]},
-                            }
-                        }
-                    },
                 }
             },
             {"$sort": {"period": 1}},
         ]
         return list(self.events_collection.aggregate(pipeline))
     def benchmark_top(self, filters: QueryFilters, limit: int = 20) -> list[dict]:
         pipeline: list[dict] = [
-            {"$addFields": {"ts": {"$toDate": "$timestamp"}}},
-            {"$match": {**_with_time_and_optional_benchmark(filters), "event_name": "benchmark_change"}},
             {"$group": {"_id": "$properties.new_value", "count": {"$sum": 1}}},
             {"$project": {"_id": 0, "benchmark": "$_id", "count": 1}},
             {"$sort": {"count": -1}},
             {"$limit": limit},
@@ -79,20 +144,27 @@ class AnalyticsRepository:
     def filter_distribution(self, filters: QueryFilters) -> list[dict]:
         pipeline: list[dict] = [
-            {"$addFields": {"ts": {"$toDate": "$timestamp"}}},
             {
                 "$match": {
                     **_with_time_and_optional_benchmark(filters),
                     "event_name": {"$regex": "^filter_change_"},
                 }
             },
-            {"$group": {"_id": "$event_name", "count": {"$sum": 1}, "sessions": {"$addToSet": "$session_id"}}},
             {
                 "$project": {
                     "_id": 0,
                     "event_name": "$_id",
                     "count": 1,
-                    "session_coverage": {"$size": "$sessions"},
                 }
             },
             {"$sort": {"count": -1}},
@@ -101,41 +173,169 @@ class AnalyticsRepository:
     def funnel(self, filters: QueryFilters) -> list[dict]:
         pipeline: list[dict] = [
-            {"$addFields": {"ts": {"$toDate": "$timestamp"}}},
             {"$match": _with_time_and_optional_benchmark(filters)},
-            {"$group": {"_id": "$session_id", "events": {"$addToSet": "$event_name"}}},
             {
                 "$project": {
-                    "has_page_view": {"$in": ["page_view", "$events"]},
-                    "has_benchmark_change": {"$in": ["benchmark_change", "$events"]},
-                    "has_filter_change": {
-                        "$gt": [
                             {
-                                "$size": {
-                                    "$filter": {
-                                        "input": "$events",
-                                        "as": "e",
-                                        "cond": {"$regexMatch": {"input": "$$e", "regex": "^filter_change_"}},
-                                    }
                                 }
                             },
                             0,
                         ]
                     },
-                    "has_table_download": {"$in": ["table_download", "$events"]},
                 }
             },
             {
                 "$group": {
                     "_id": None,
-                    "step1_page_view": {"$sum": {"$cond": ["$has_page_view", 1, 0]}},
                     "step2_benchmark_change": {
-                        "$sum": {"$cond": [{"$and": ["$has_page_view", "$has_benchmark_change"]}, 1, 0]}
                     },
                     "step3_filter_change": {
                         "$sum": {
                             "$cond": [
-                                {"$and": ["$has_page_view", "$has_benchmark_change", "$has_filter_change"]},
                                 1,
                                 0,
                             ]
@@ -146,10 +346,10 @@ class AnalyticsRepository:
                             "$cond": [
                                 {
                                     "$and": [
-                                        "$has_page_view",
-                                        "$has_benchmark_change",
-                                        "$has_filter_change",
-                                        "$has_table_download",
                                     ]
                                 },
                                 1,
@@ -174,10 +374,9 @@ class AnalyticsRepository:
     def visitors_new_vs_returning(self, filters: QueryFilters) -> list[dict]:
         period_expr = _period_expression(filters.granularity)
         pipeline: list[dict] = [
-            {"$addFields": {"ts": {"$toDate": "$timestamp"}, "visitor_id": "$properties.visitor_id"}},
             {
                 "$match": {
-                    **_with_time_and_optional_benchmark(filters),
                     "event_name": "page_view",
                     "visitor_id": {"$nin": [None, ""]},
                 }
@@ -185,31 +384,80 @@ class AnalyticsRepository:
             {
                 "$setWindowFields": {
                     "partitionBy": "$visitor_id",
-                    "sortBy": {"ts": 1},
-                    "output": {"first_seen": {"$first": "$ts"}},
                 }
             },
             {
                 "$project": {
                     "period": period_expr,
-                    "is_new": {"$eq": [{"$dateToString": {"format": "%Y-%m-%d", "date": "$ts"}}, {"$dateToString": {"format": "%Y-%m-%d", "date": "$first_seen"}}]},
                     "visitor_id": 1,
                 }
             },
-            {"$group": {"_id": {"period": "$period", "is_new": "$is_new"}, "visitors": {"$addToSet": "$visitor_id"}}},
             {
                 "$project": {
                     "_id": 0,
                     "period": "$_id.period",
                     "is_new": "$_id.is_new",
-                    "visitor_count": {"$size": "$visitors"},
                 }
             },
             {"$sort": {"period": 1, "is_new": -1}},
         ]
         return list(self.events_collection.aggregate(pipeline))
     @staticmethod
     def safe_first(items: Iterable[dict]) -> dict:
         return next(iter(items), {})

         Granularity.WEEK: "%G-W%V",
         Granularity.MONTH: "%Y-%m",
     }
+    return {"$dateToString": {"format": format_map[granularity], "date": "$event_ts"}}
+def _with_normalized_time() -> dict:
+    return {
+        "$addFields": {
+            "event_ts": {"$ifNull": ["$ts", {"$toDate": "$timestamp"}]},
+            "visitor_id": "$properties.visitor_id",
+        }
+    }
+def _indexed_time_prefilter(filters: QueryFilters) -> dict:
+    matcher: dict = {
+        "$or": [
+            {"ts": {"$gte": filters.start_time, "$lte": filters.end_time}},
+            {"ts": None},
+            {"ts": {"$exists": False}},
+        ]
+    }
+    if filters.benchmark:
+        matcher["benchmark"] = filters.benchmark
+    return matcher
 def _with_time_and_optional_benchmark(filters: QueryFilters) -> dict:
     matcher: dict = {
+        "event_ts": {
             "$gte": filters.start_time,
             "$lte": filters.end_time,
         }
     return matcher
+def _non_empty_set_size(field_name: str, variable_name: str) -> dict:
+    return {
+        "$size": {
+            "$filter": {
+                "input": f"${field_name}",
+                "as": variable_name,
+                "cond": {
+                    "$and": [
+                        {"$ne": [f"$${variable_name}", None]},
+                        {"$ne": [f"$${variable_name}", ""]},
+                    ]
+                },
+            }
+        }
+    }
 class AnalyticsRepository:
     def __init__(self, events_collection: Collection) -> None:
         self.events_collection = events_collection
     def overview_timeseries(self, filters: QueryFilters) -> list[dict]:
         period_expr = _period_expression(filters.granularity)
         pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
             {"$match": _with_time_and_optional_benchmark(filters)},
             {
                 "$group": {
                     "period": "$_id.period",
                     "pv": 1,
                     "event_count": 1,
+                    "session_count": _non_empty_set_size("sessions", "s"),
+                    "uv": _non_empty_set_size("visitors", "v"),
                 }
             },
             {"$sort": {"period": 1}},
         ]
         return list(self.events_collection.aggregate(pipeline))
+    def overview_totals(self, filters: QueryFilters) -> dict:
+        pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
+            {"$match": _with_time_and_optional_benchmark(filters)},
+            {
+                "$group": {
+                    "_id": None,
+                    "pv": {"$sum": {"$cond": [{"$eq": ["$event_name", "page_view"]}, 1, 0]}},
+                    "events": {"$sum": 1},
+                    "sessions": {"$addToSet": "$session_id"},
+                    "visitors": {"$addToSet": "$visitor_id"},
+                }
+            },
+            {
+                "$project": {
+                    "_id": 0,
+                    "pv": 1,
+                    "events": 1,
+                    "sessions": _non_empty_set_size("sessions", "s"),
+                    "uv": _non_empty_set_size("visitors", "v"),
+                }
+            },
+        ]
+        return self.safe_first(self.events_collection.aggregate(pipeline))
     def benchmark_top(self, filters: QueryFilters, limit: int = 20) -> list[dict]:
         pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
+            {
+                "$match": {
+                    **_with_time_and_optional_benchmark(filters),
+                    "event_name": "benchmark_change",
+                }
+            },
             {"$group": {"_id": "$properties.new_value", "count": {"$sum": 1}}},
+            {"$match": {"_id": {"$nin": [None, ""]}}},
             {"$project": {"_id": 0, "benchmark": "$_id", "count": 1}},
             {"$sort": {"count": -1}},
             {"$limit": limit},
     def filter_distribution(self, filters: QueryFilters) -> list[dict]:
         pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
             {
                 "$match": {
                     **_with_time_and_optional_benchmark(filters),
                     "event_name": {"$regex": "^filter_change_"},
                 }
             },
+            {
+                "$group": {
+                    "_id": "$event_name",
+                    "count": {"$sum": 1},
+                    "sessions": {"$addToSet": "$session_id"},
+                }
+            },
             {
                 "$project": {
                     "_id": 0,
                     "event_name": "$_id",
                     "count": 1,
+                    "session_coverage": _non_empty_set_size("sessions", "s"),
                 }
             },
             {"$sort": {"count": -1}},
     def funnel(self, filters: QueryFilters) -> list[dict]:
         pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
             {"$match": _with_time_and_optional_benchmark(filters)},
+            {"$sort": {"session_id": 1, "event_ts": 1}},
+            {
+                "$group": {
+                    "_id": "$session_id",
+                    "events": {"$push": {"name": "$event_name", "ts": "$event_ts"}},
+                }
+            },
+            {"$match": {"_id": {"$nin": [None, ""]}}},
+            {
+                "$project": {
+                    "events": 1,
+                    "page_view_at": {
+                        "$arrayElemAt": [
+                            {
+                                "$map": {
+                                    "input": {
+                                        "$filter": {
+                                            "input": "$events",
+                                            "as": "event",
+                                            "cond": {"$eq": ["$$event.name", "page_view"]},
+                                        }
+                                    },
+                                    "as": "event",
+                                    "in": "$$event.ts",
+                                }
+                            },
+                            0,
+                        ]
+                    },
+                }
+            },
             {
                 "$project": {
+                    "events": 1,
+                    "page_view_at": 1,
+                    "benchmark_change_at": {
+                        "$arrayElemAt": [
                             {
+                                "$map": {
+                                    "input": {
+                                        "$filter": {
+                                            "input": "$events",
+                                            "as": "event",
+                                            "cond": {
+                                                "$and": [
+                                                    {"$eq": ["$$event.name", "benchmark_change"]},
+                                                    {"$gte": ["$$event.ts", "$page_view_at"]},
+                                                ]
+                                            },
+                                        }
+                                    },
+                                    "as": "event",
+                                    "in": "$$event.ts",
+                                }
+                            },
+                            0,
+                        ]
+                    },
+                }
+            },
+            {
+                "$project": {
+                    "events": 1,
+                    "page_view_at": 1,
+                    "benchmark_change_at": 1,
+                    "filter_change_at": {
+                        "$arrayElemAt": [
+                            {
+                                "$map": {
+                                    "input": {
+                                        "$filter": {
+                                            "input": "$events",
+                                            "as": "event",
+                                            "cond": {
+                                                "$and": [
+                                                    {
+                                                        "$regexMatch": {
+                                                            "input": "$$event.name",
+                                                            "regex": "^filter_change_",
+                                                        }
+                                                    },
+                                                    {
+                                                        "$gte": [
+                                                            "$$event.ts",
+                                                            "$benchmark_change_at",
+                                                        ]
+                                                    },
+                                                ]
+                                            },
+                                        }
+                                    },
+                                    "as": "event",
+                                    "in": "$$event.ts",
+                                }
+                            },
+                            0,
+                        ]
+                    },
+                }
+            },
+            {
+                "$project": {
+                    "page_view_at": 1,
+                    "benchmark_change_at": 1,
+                    "filter_change_at": 1,
+                    "table_download_at": {
+                        "$arrayElemAt": [
+                            {
+                                "$map": {
+                                    "input": {
+                                        "$filter": {
+                                            "input": "$events",
+                                            "as": "event",
+                                            "cond": {
+                                                "$and": [
+                                                    {"$eq": ["$$event.name", "table_download"]},
+                                                    {"$gte": ["$$event.ts", "$filter_change_at"]},
+                                                ]
+                                            },
+                                        }
+                                    },
+                                    "as": "event",
+                                    "in": "$$event.ts",
                                 }
                             },
                             0,
                         ]
                     },
                 }
             },
             {
                 "$group": {
                     "_id": None,
+                    "step1_page_view": {
+                        "$sum": {"$cond": [{"$ne": ["$page_view_at", None]}, 1, 0]}
+                    },
                     "step2_benchmark_change": {
+                        "$sum": {
+                            "$cond": [
+                                {
+                                    "$and": [
+                                        {"$ne": ["$page_view_at", None]},
+                                        {"$gte": ["$benchmark_change_at", "$page_view_at"]},
+                                    ]
+                                },
+                                1,
+                                0,
+                            ]
+                        }
                     },
                     "step3_filter_change": {
                         "$sum": {
                             "$cond": [
+                                {
+                                    "$and": [
+                                        {"$ne": ["$page_view_at", None]},
+                                        {"$gte": ["$benchmark_change_at", "$page_view_at"]},
+                                        {"$gte": ["$filter_change_at", "$benchmark_change_at"]},
+                                    ]
+                                },
                                 1,
                                 0,
                             ]
                             "$cond": [
                                 {
                                     "$and": [
+                                        {"$ne": ["$page_view_at", None]},
+                                        {"$gte": ["$benchmark_change_at", "$page_view_at"]},
+                                        {"$gte": ["$filter_change_at", "$benchmark_change_at"]},
+                                        {"$gte": ["$table_download_at", "$filter_change_at"]},
                                     ]
                                 },
                                 1,
     def visitors_new_vs_returning(self, filters: QueryFilters) -> list[dict]:
         period_expr = _period_expression(filters.granularity)
         pipeline: list[dict] = [
+            _with_normalized_time(),
             {
                 "$match": {
                     "event_name": "page_view",
                     "visitor_id": {"$nin": [None, ""]},
                 }
             {
                 "$setWindowFields": {
                     "partitionBy": "$visitor_id",
+                    "sortBy": {"event_ts": 1},
+                    "output": {"first_seen": {"$first": "$event_ts"}},
                 }
             },
+            {"$match": _with_time_and_optional_benchmark(filters)},
             {
                 "$project": {
                     "period": period_expr,
+                    "is_new": {
+                        "$eq": [
+                            {"$dateToString": {"format": "%Y-%m-%d", "date": "$event_ts"}},
+                            {"$dateToString": {"format": "%Y-%m-%d", "date": "$first_seen"}},
+                        ]
+                    },
                     "visitor_id": 1,
                 }
             },
+            {
+                "$group": {
+                    "_id": {"period": "$period", "is_new": "$is_new"},
+                    "visitors": {"$addToSet": "$visitor_id"},
+                }
+            },
             {
                 "$project": {
                     "_id": 0,
                     "period": "$_id.period",
                     "is_new": "$_id.is_new",
+                    "visitor_count": _non_empty_set_size("visitors", "v"),
                 }
             },
             {"$sort": {"period": 1, "is_new": -1}},
         ]
         return list(self.events_collection.aggregate(pipeline))
+    def visitor_ip_counts(self, filters: QueryFilters) -> list[dict]:
+        pipeline: list[dict] = [
+            {"$match": _indexed_time_prefilter(filters)},
+            _with_normalized_time(),
+            {
+                "$match": {
+                    **_with_time_and_optional_benchmark(filters),
+                    "event_name": "page_view",
+                    "properties.ip": {"$nin": [None, ""]},
+                }
+            },
+            {"$group": {"_id": "$properties.ip", "pv": {"$sum": 1}}},
+            {"$project": {"_id": 0, "ip": "$_id", "pv": 1}},
+            {"$sort": {"pv": -1}},
+        ]
+        return list(self.events_collection.aggregate(pipeline))
+    def available_benchmarks(
+        self, filters: QueryFilters | None = None, limit: int = 100
+    ) -> list[str]:
+        pipeline: list[dict] = []
+        if filters is not None:
+            pipeline.extend(
+                [
+                    {"$match": _indexed_time_prefilter(filters)},
+                    _with_normalized_time(),
+                    {"$match": _with_time_and_optional_benchmark(filters)},
+                ]
+            )
+        pipeline.extend(
+            [
+                {"$match": {"benchmark": {"$nin": [None, ""]}}},
+                {"$group": {"_id": "$benchmark"}},
+                {"$sort": {"_id": 1}},
+                {"$limit": limit},
+            ]
+        )
+        return [row["_id"] for row in self.events_collection.aggregate(pipeline)]
     @staticmethod
     def safe_first(items: Iterable[dict]) -> dict:
         return next(iter(items), {})

src/leaderboard_analytics/schemas.py CHANGED Viewed

@@ -1,7 +1,7 @@
-from datetime import datetime, timezone
 from enum import StrEnum
-from pydantic import BaseModel, Field
 class Granularity(StrEnum):
@@ -12,9 +12,16 @@ class Granularity(StrEnum):
 class QueryFilters(BaseModel):
     start_time: datetime = Field(
-        default_factory=lambda: datetime.now(tz=timezone.utc).replace(hour=0, minute=0, second=0, microsecond=0)
     )
-    end_time: datetime = Field(default_factory=lambda: datetime.now(tz=timezone.utc))
     benchmark: str | None = None
     granularity: Granularity = Granularity.DAY

+from datetime import UTC, datetime
 from enum import StrEnum
+from pydantic import BaseModel, Field, model_validator
 class Granularity(StrEnum):
 class QueryFilters(BaseModel):
     start_time: datetime = Field(
+        default_factory=lambda: datetime.now(tz=UTC).replace(
+            hour=0, minute=0, second=0, microsecond=0
+        )
     )
+    end_time: datetime = Field(default_factory=lambda: datetime.now(tz=UTC))
     benchmark: str | None = None
     granularity: Granularity = Granularity.DAY
+    @model_validator(mode="after")
+    def validate_time_range(self) -> "QueryFilters":
+        if self.start_time > self.end_time:
+            raise ValueError("start_time must be earlier than or equal to end_time")
+        return self

src/leaderboard_analytics/services.py CHANGED Viewed

@@ -1,35 +1,179 @@
 import pandas as pd
 from leaderboard_analytics.repositories import AnalyticsRepository
 from leaderboard_analytics.schemas import QueryFilters
 class AnalyticsService:
-    def __init__(self, repository: AnalyticsRepository) -> None:
         self.repository = repository
     def get_overview(self, filters: QueryFilters) -> tuple[pd.DataFrame, dict]:
         rows = self.repository.overview_timeseries(filters)
         frame = pd.DataFrame(rows)
-        if frame.empty:
-            empty = {
-                "pv": 0,
-                "uv": 0,
-                "sessions": 0,
-                "events": 0,
-                "events_per_session": 0.0,
-                "sessions_per_visitor": 0.0,
-            }
-            return frame, empty
         totals = {
-            "pv": int(frame["pv"].sum()),
-            "uv": int(frame["uv"].sum()),
-            "sessions": int(frame["session_count"].sum()),
-            "events": int(frame["event_count"].sum()),
         }
-        totals["events_per_session"] = round(totals["events"] / totals["sessions"], 2) if totals["sessions"] else 0.0
-        totals["sessions_per_visitor"] = round(totals["sessions"] / totals["uv"], 2) if totals["uv"] else 0.0
         return frame, totals
     def get_benchmark_top(self, filters: QueryFilters) -> pd.DataFrame:
@@ -60,3 +204,61 @@ class AnalyticsService:
         frame["visitor_type"] = frame["is_new"].map({True: "new", False: "returning"})
         return frame

+import ipaddress
+from pathlib import Path
+from typing import Any, Protocol
 import pandas as pd
 from leaderboard_analytics.repositories import AnalyticsRepository
 from leaderboard_analytics.schemas import QueryFilters
+UNKNOWN_COUNTRY_CODE = "Unknown"
+UNKNOWN_COUNTRY_NAME = "Unknown"
+def _empty_ip_debug() -> dict[str, object]:
+    return {
+        "total_unique_ips": 0,
+        "total_ip_pv": 0,
+        "global_ips": 0,
+        "global_ip_pv": 0,
+        "private_ips": 0,
+        "private_ip_pv": 0,
+        "loopback_ips": 0,
+        "loopback_ip_pv": 0,
+        "reserved_ips": 0,
+        "reserved_ip_pv": 0,
+        "link_local_ips": 0,
+        "link_local_ip_pv": 0,
+        "multicast_ips": 0,
+        "multicast_ip_pv": 0,
+        "unspecified_ips": 0,
+        "unspecified_ip_pv": 0,
+        "invalid_ips": 0,
+        "invalid_ip_pv": 0,
+        "top_ip_pv_buckets": {
+            "1": 0,
+            "2-10": 0,
+            "11-100": 0,
+            "101-1000": 0,
+            ">1000": 0,
+        },
+    }
+def _ip_debug_category(ip_address: str) -> str:
+    try:
+        parsed_ip = ipaddress.ip_address(ip_address.strip())
+    except ValueError:
+        return "invalid"
+    if parsed_ip.is_global:
+        return "global"
+    if parsed_ip.is_loopback:
+        return "loopback"
+    if parsed_ip.is_private:
+        return "private"
+    if parsed_ip.is_reserved:
+        return "reserved"
+    if parsed_ip.is_link_local:
+        return "link_local"
+    if parsed_ip.is_multicast:
+        return "multicast"
+    if parsed_ip.is_unspecified:
+        return "unspecified"
+    return "reserved"
+def _ip_pv_bucket(pv: int) -> str:
+    if pv <= 1:
+        return "1"
+    if pv <= 10:
+        return "2-10"
+    if pv <= 100:
+        return "11-100"
+    if pv <= 1000:
+        return "101-1000"
+    return ">1000"
+class GeoIpCountryReader(Protocol):
+    def country(self, ip_address: str) -> Any: ...
+class GeoIpResolver:
+    def __init__(
+        self,
+        database_path: str | Path | None = None,
+        reader: GeoIpCountryReader | None = None,
+    ) -> None:
+        self.database_path = Path(database_path) if database_path else None
+        self._reader = reader
+        self._load_attempted = reader is not None
+    def resolve_country(self, ip_address: str) -> tuple[str, str]:
+        try:
+            parsed_ip = ipaddress.ip_address(ip_address.strip())
+        except ValueError:
+            return UNKNOWN_COUNTRY_CODE, UNKNOWN_COUNTRY_NAME
+        if not parsed_ip.is_global:
+            return UNKNOWN_COUNTRY_CODE, UNKNOWN_COUNTRY_NAME
+        reader = self._get_reader()
+        if reader is None:
+            return UNKNOWN_COUNTRY_CODE, UNKNOWN_COUNTRY_NAME
+        try:
+            response = reader.country(str(parsed_ip))
+        except Exception:
+            return UNKNOWN_COUNTRY_CODE, UNKNOWN_COUNTRY_NAME
+        country = response.country
+        if not getattr(country, "iso_code", None):
+            country = response.registered_country
+        code = getattr(country, "iso_code", None)
+        if not code:
+            return UNKNOWN_COUNTRY_CODE, UNKNOWN_COUNTRY_NAME
+        return code, getattr(country, "name", None) or code
+    def debug_status(self) -> dict[str, object]:
+        return {
+            "database_path": str(self.database_path) if self.database_path else "",
+            "database_configured": self.database_path is not None,
+            "database_exists": self.database_path.exists() if self.database_path else False,
+            "load_attempted": self._load_attempted,
+            "reader_loaded": self._reader is not None,
+        }
+    def _get_reader(self) -> GeoIpCountryReader | None:
+        if self._reader is not None:
+            return self._reader
+        if self._load_attempted:
+            return None
+        self._load_attempted = True
+        if self.database_path is None or not self.database_path.exists():
+            return None
+        try:
+            import geoip2.database
+            self._reader = geoip2.database.Reader(str(self.database_path))
+        except Exception:
+            return None
+        return self._reader
 class AnalyticsService:
+    def __init__(
+        self,
+        repository: AnalyticsRepository,
+        geoip_database_path: str | Path | None = None,
+        geoip_resolver: GeoIpResolver | None = None,
+    ) -> None:
         self.repository = repository
+        self.geoip_resolver = geoip_resolver or GeoIpResolver(geoip_database_path)
     def get_overview(self, filters: QueryFilters) -> tuple[pd.DataFrame, dict]:
         rows = self.repository.overview_timeseries(filters)
         frame = pd.DataFrame(rows)
+        raw_totals = self.repository.overview_totals(filters)
         totals = {
+            "pv": int(raw_totals.get("pv", 0)),
+            "uv": int(raw_totals.get("uv", 0)),
+            "sessions": int(raw_totals.get("sessions", 0)),
+            "events": int(raw_totals.get("events", 0)),
         }
+        totals["events_per_session"] = (
+            round(totals["events"] / totals["sessions"], 2) if totals["sessions"] else 0.0
+        )
+        totals["sessions_per_visitor"] = (
+            round(totals["sessions"] / totals["uv"], 2) if totals["uv"] else 0.0
+        )
         return frame, totals
     def get_benchmark_top(self, filters: QueryFilters) -> pd.DataFrame:
         frame["visitor_type"] = frame["is_new"].map({True: "new", False: "returning"})
         return frame
+    def get_visitor_locations(self, filters: QueryFilters) -> pd.DataFrame:
+        frame, _debug = self.get_visitor_location_details(filters)
+        return frame
+    def get_visitor_location_details(self, filters: QueryFilters) -> tuple[pd.DataFrame, dict]:
+        locations: dict[tuple[str, str], dict[str, int | str]] = {}
+        ip_debug = _empty_ip_debug()
+        for row in self.repository.visitor_ip_counts(filters):
+            ip = str(row.get("ip", "")).strip()
+            if not ip:
+                continue
+            pv = int(row.get("pv", 0))
+            category = _ip_debug_category(ip)
+            ip_debug["total_unique_ips"] = int(ip_debug["total_unique_ips"]) + 1
+            ip_debug["total_ip_pv"] = int(ip_debug["total_ip_pv"]) + pv
+            ip_debug[f"{category}_ips"] = int(ip_debug[f"{category}_ips"]) + 1
+            ip_debug[f"{category}_ip_pv"] = int(ip_debug[f"{category}_ip_pv"]) + pv
+            ip_debug["top_ip_pv_buckets"][_ip_pv_bucket(pv)] += 1  # type: ignore[index]
+            code, name = self.geoip_resolver.resolve_country(ip)
+            key = (code, name)
+            if key not in locations:
+                locations[key] = {
+                    "country_code": code,
+                    "country_name": name,
+                    "pv": 0,
+                    "ip_count": 0,
+                }
+            locations[key]["pv"] = int(locations[key]["pv"]) + pv
+            locations[key]["ip_count"] = int(locations[key]["ip_count"]) + 1
+        frame = pd.DataFrame(
+            locations.values(),
+            columns=["country_code", "country_name", "pv", "ip_count"],
+        )
+        if frame.empty:
+            return frame, ip_debug
+        frame = frame.sort_values(["pv", "ip_count"], ascending=[False, False]).reset_index(
+            drop=True
+        )
+        return frame, ip_debug
+    def get_geoip_debug_info(self) -> dict[str, object]:
+        debug_status = getattr(self.geoip_resolver, "debug_status", None)
+        if debug_status is None:
+            return {
+                "database_path": "",
+                "database_configured": False,
+                "database_exists": False,
+                "load_attempted": False,
+                "reader_loaded": False,
+            }
+        return debug_status()
+    def get_available_benchmarks(self, filters: QueryFilters | None = None) -> list[str]:
+        return self.repository.available_benchmarks(filters)

src/leaderboard_analytics/ui.py CHANGED Viewed

@@ -1,9 +1,14 @@
-from datetime import datetime, timedelta, timezone
 import math
 from typing import Any
 import gradio as gr
 import plotly.express as px
 from leaderboard_analytics.schemas import Granularity, QueryFilters
 from leaderboard_analytics.services import AnalyticsService
@@ -19,7 +24,7 @@ def _to_utc_datetime(value: Any, fallback: datetime) -> datetime:
         if isinstance(value, float) and math.isnan(value):
             return fallback
         # Gradio DateTime may return Unix timestamps as numbers.
-        dt = datetime.fromtimestamp(value, tz=timezone.utc)
     elif isinstance(value, str):
         dt = datetime.fromisoformat(value)
     else:
@@ -27,60 +32,353 @@ def _to_utc_datetime(value: Any, fallback: datetime) -> datetime:
     # Gradio DateTime may return naive datetime values in local time.
     if dt.tzinfo is None:
-        dt = dt.replace(tzinfo=timezone.utc)
-    return dt.astimezone(timezone.utc)
 def build_dashboard(service: AnalyticsService) -> gr.Blocks:
-    default_end = datetime.now(tz=timezone.utc)
     default_start = (default_end - timedelta(days=7)).replace(microsecond=0)
     def query(
         start_time: datetime | str | None,
         end_time: datetime | str | None,
         benchmark: str,
         granularity: str,
-    ) -> tuple[str, object, object, object, object, object]:
-        filters = QueryFilters(
-            start_time=_to_utc_datetime(start_time, default_start),
-            end_time=_to_utc_datetime(end_time, default_end),
-            benchmark=benchmark or None,
-            granularity=Granularity(granularity),
-        )
-        overview_df, totals = service.get_overview(filters)
-        benchmark_df = service.get_benchmark_top(filters)
-        filter_df = service.get_filter_distribution(filters)
-        funnel_df = service.get_funnel(filters)
-        visitors_df = service.get_new_vs_returning(filters)
-        metrics = (
-            f"PV: {totals['pv']} | UV: {totals['uv']} | Sessions: {totals['sessions']} | "
-            f"Events/Session: {totals['events_per_session']} | Sessions/Visitor: {totals['sessions_per_visitor']}"
-        )
-        overview_plot = (
-            px.line(overview_df, x="period", y=["pv", "uv", "session_count"], title="Traffic overview")
-            if not overview_df.empty
-            else px.line(title="Traffic overview (no data)")
-        )
-        benchmark_plot = (
-            px.bar(benchmark_df, x="benchmark", y="count", title="Benchmark Top")
-            if not benchmark_df.empty
-            else px.bar(title="Benchmark Top (no data)")
-        )
-        filter_plot = (
-            px.bar(filter_df, x="event_name", y="count", title="Filter usage")
-            if not filter_df.empty
-            else px.bar(title="Filter usage (no data)")
-        )
-        funnel_plot = px.funnel(funnel_df, x="sessions", y="step", title="Session funnel")
-        visitor_plot = (
-            px.bar(visitors_df, x="period", y="visitor_count", color="visitor_type", barmode="group", title="New vs returning visitors")
-            if not visitors_df.empty
-            else px.bar(title="New vs returning visitors (no data)")
-        )
-        return metrics, overview_plot, benchmark_plot, filter_plot, funnel_plot, visitor_plot
     with gr.Blocks() as demo:
         gr.Markdown("# Leaderboard Analytics Dashboard")
@@ -100,7 +398,12 @@ def build_dashboard(service: AnalyticsService) -> gr.Blocks:
                 value=default_end,
                 timezone="UTC",
             )
-            benchmark = gr.Textbox(label="Benchmark (optional)", placeholder="MTEB(eng)")
             granularity = gr.Dropdown(
                 label="Granularity",
                 choices=[Granularity.DAY.value, Granularity.WEEK.value, Granularity.MONTH.value],
@@ -108,7 +411,9 @@ def build_dashboard(service: AnalyticsService) -> gr.Blocks:
             )
             refresh = gr.Button("Refresh", variant="primary")
-        metrics_text = gr.Markdown("PV: 0 | UV: 0 | Sessions: 0 | Events/Session: 0 | Sessions/Visitor: 0")
         with gr.Row():
             overview_plot = gr.Plot(label="Traffic Overview")
@@ -117,32 +422,60 @@ def build_dashboard(service: AnalyticsService) -> gr.Blocks:
             filter_plot = gr.Plot(label="Filter Behavior")
             funnel_plot = gr.Plot(label="Funnel")
         visitor_plot = gr.Plot(label="Visitor Segmentation")
         refresh.click(
             fn=query,
             inputs=[start_time, end_time, benchmark, granularity],
-            outputs=[
-                metrics_text,
-                overview_plot,
-                benchmark_plot,
-                filter_plot,
-                funnel_plot,
-                visitor_plot,
-            ],
         )
         demo.load(
             fn=query,
             inputs=[start_time, end_time, benchmark, granularity],
-            outputs=[
-                metrics_text,
-                overview_plot,
-                benchmark_plot,
-                filter_plot,
-                funnel_plot,
-                visitor_plot,
-            ],
         )
     return demo

 import math
+import tempfile
+import zipfile
+from datetime import UTC, datetime, timedelta
+from pathlib import Path
 from typing import Any
 import gradio as gr
+import pandas as pd
 import plotly.express as px
+import plotly.graph_objects as go
 from leaderboard_analytics.schemas import Granularity, QueryFilters
 from leaderboard_analytics.services import AnalyticsService
         if isinstance(value, float) and math.isnan(value):
             return fallback
         # Gradio DateTime may return Unix timestamps as numbers.
+        dt = datetime.fromtimestamp(value, tz=UTC)
     elif isinstance(value, str):
         dt = datetime.fromisoformat(value)
     else:
     # Gradio DateTime may return naive datetime values in local time.
     if dt.tzinfo is None:
+        dt = dt.replace(tzinfo=UTC)
+    return dt.astimezone(UTC)
+def _empty_plot(title: str):
+    return px.line(title=title)
+def _empty_map(title: str):
+    figure = go.Figure()
+    _style_visitor_location_map(figure, title)
+    return figure
+def _query_range_text(filters: QueryFilters) -> str:
+    return f"{filters.start_time.isoformat()} to {filters.end_time.isoformat()}"
+def _write_csv_archive(tables: dict[str, pd.DataFrame]) -> str | None:
+    if all(table.empty for table in tables.values()):
+        return None
+    archive = tempfile.NamedTemporaryFile(
+        prefix="leaderboard-analytics-", suffix=".zip", delete=False
+    )
+    archive.close()
+    with zipfile.ZipFile(archive.name, "w", compression=zipfile.ZIP_DEFLATED) as zip_file:
+        for name, table in tables.items():
+            zip_file.writestr(f"{name}.csv", table.to_csv(index=False))
+    return archive.name
+def _visitor_location_top_table(visitor_locations: pd.DataFrame) -> pd.DataFrame:
+    if visitor_locations.empty:
+        return pd.DataFrame(columns=["Region", "Users"])
+    return (
+        visitor_locations.sort_values(["ip_count", "pv"], ascending=[False, False])
+        .head(10)
+        .rename(columns={"country_name": "Region", "ip_count": "Users"})[["Region", "Users"]]
+        .reset_index(drop=True)
+    )
+def _visitor_location_debug_text(
+    visitor_locations: pd.DataFrame,
+    geoip_debug: dict[str, object],
+    ip_debug: dict[str, object] | None = None,
+) -> str:
+    if visitor_locations.empty:
+        total_pv = 0
+        total_users = 0
+        mapped_regions = 0
+        unknown_pv = 0
+        unknown_users = 0
+    else:
+        unknown_rows = visitor_locations[visitor_locations["country_code"] == "Unknown"]
+        mapped_rows = visitor_locations[visitor_locations["country_code"] != "Unknown"]
+        total_pv = int(visitor_locations["pv"].sum())
+        total_users = int(visitor_locations["ip_count"].sum())
+        mapped_regions = len(mapped_rows)
+        unknown_pv = int(unknown_rows["pv"].sum()) if not unknown_rows.empty else 0
+        unknown_users = int(unknown_rows["ip_count"].sum()) if not unknown_rows.empty else 0
+    configured = "yes" if geoip_debug.get("database_configured") else "no"
+    exists = "yes" if geoip_debug.get("database_exists") else "no"
+    loaded = "yes" if geoip_debug.get("reader_loaded") else "no"
+    attempted = "yes" if geoip_debug.get("load_attempted") else "no"
+    path = geoip_debug.get("database_path") or "(not configured)"
+    ip_debug = ip_debug or {}
+    global_ips = int(ip_debug.get("global_ips", 0))
+    global_pv = int(ip_debug.get("global_ip_pv", 0))
+    private_ips = int(ip_debug.get("private_ips", 0))
+    private_pv = int(ip_debug.get("private_ip_pv", 0))
+    loopback_ips = int(ip_debug.get("loopback_ips", 0))
+    loopback_pv = int(ip_debug.get("loopback_ip_pv", 0))
+    invalid_ips = int(ip_debug.get("invalid_ips", 0))
+    invalid_pv = int(ip_debug.get("invalid_ip_pv", 0))
+    buckets = ip_debug.get("top_ip_pv_buckets", {})
+    return (
+        f"GeoIP DB: configured={configured}, exists={exists}, loaded={loaded}, "
+        f"load_attempted={attempted}  \n"
+        f"GeoIP path: `{path}`  \n"
+        f"Total location PV: {total_pv} | Users/IPs: {total_users} | "
+        f"Mapped regions: {mapped_regions}  \n"
+        f"Unknown PV: {unknown_pv} | Unknown users/IPs: {unknown_users}  \n"
+        f"Public IPs: {global_ips} ({global_pv} PV) | Private IPs: {private_ips} "
+        f"({private_pv} PV)  \n"
+        f"Loopback IPs: {loopback_ips} ({loopback_pv} PV) | Invalid IPs: {invalid_ips} "
+        f"({invalid_pv} PV)  \n"
+        f"PV/IP buckets: {buckets}"
+    )
+def _style_visitor_location_map(figure: go.Figure, title: str) -> None:
+    figure.update_geos(
+        projection_type="mercator",
+        showframe=False,
+        showcoastlines=True,
+        coastlinecolor="#cfd6df",
+        coastlinewidth=0.6,
+        showcountries=True,
+        countrycolor="#cfd6df",
+        countrywidth=0.7,
+        showland=True,
+        landcolor="#eef2f7",
+        showocean=True,
+        oceancolor="#f8fafc",
+        showlakes=True,
+        lakecolor="#f8fafc",
+        bgcolor="#ffffff",
+        lataxis_range=[-55, 75],
+        lonaxis_range=[-180, 180],
+    )
+    figure.update_layout(
+        title={"text": title, "x": 0.02, "xanchor": "left"},
+        height=560,
+        paper_bgcolor="#ffffff",
+        plot_bgcolor="#ffffff",
+        font={"color": "#1f2937"},
+        margin={"l": 0, "r": 0, "t": 52, "b": 0},
+        showlegend=False,
+        hoverlabel={
+            "bgcolor": "#ffffff",
+            "bordercolor": "#3b82f6",
+            "font_color": "#111827",
+        },
+    )
+def _visitor_location_map(visitor_locations: pd.DataFrame, range_text: str) -> go.Figure:
+    map_df = (
+        visitor_locations[visitor_locations["country_code"] != "Unknown"].copy()
+        if not visitor_locations.empty
+        else visitor_locations.copy()
+    )
+    if map_df.empty:
+        return _empty_map(f"Visitor locations by country (no mapped data for {range_text})")
+    max_pv = max(int(map_df["pv"].max()), 1)
+    size_ref = 2.0 * max_pv / (52**2)
+    figure = go.Figure(
+        go.Scattergeo(
+            locationmode="country names",
+            locations=map_df["country_name"],
+            mode="markers",
+            text=map_df["country_name"],
+            customdata=map_df[["country_code", "pv", "ip_count"]],
+            hovertemplate=(
+                "<b>%{text}</b><br>"
+                "Country code: %{customdata[0]}<br>"
+                "PV: %{customdata[1]:,}<br>"
+                "Users/IPs: %{customdata[2]:,}<extra></extra>"
+            ),
+            marker={
+                "size": map_df["pv"],
+                "sizemode": "area",
+                "sizeref": size_ref,
+                "sizemin": 8,
+                "color": "rgba(59, 130, 246, 0.55)",
+                "line": {"color": "rgba(37, 99, 235, 0.92)", "width": 1.2},
+            },
+        )
+    )
+    _style_visitor_location_map(figure, "Visitor locations by country")
+    figure.add_annotation(
+        x=0.02,
+        y=0.08,
+        xref="paper",
+        yref="paper",
+        text=(
+            f"Mapped regions: {len(map_df)}<br>"
+            f"Mapped PV: {int(map_df['pv'].sum()):,}<br>"
+            f"Users/IPs: {int(map_df['ip_count'].sum()):,}"
+        ),
+        showarrow=False,
+        align="left",
+        bgcolor="rgba(255, 255, 255, 0.88)",
+        bordercolor="rgba(148, 163, 184, 0.55)",
+        borderwidth=1,
+        font={"color": "#1f2937", "size": 12},
+    )
+    return figure
 def build_dashboard(service: AnalyticsService) -> gr.Blocks:
+    default_end = datetime.now(tz=UTC)
     default_start = (default_end - timedelta(days=7)).replace(microsecond=0)
+    def load_benchmarks() -> object:
+        try:
+            benchmarks = service.get_available_benchmarks()
+        except Exception:
+            benchmarks = []
+        return gr.update(choices=[""] + benchmarks, value="")
     def query(
         start_time: datetime | str | None,
         end_time: datetime | str | None,
         benchmark: str,
         granularity: str,
+    ) -> tuple[
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+        object,
+    ]:
+        try:
+            filters = QueryFilters(
+                start_time=_to_utc_datetime(start_time, default_start),
+                end_time=_to_utc_datetime(end_time, default_end),
+                benchmark=benchmark or None,
+                granularity=Granularity(granularity),
+            )
+            overview_df, totals = service.get_overview(filters)
+            benchmark_df = service.get_benchmark_top(filters)
+            filter_df = service.get_filter_distribution(filters)
+            funnel_df = service.get_funnel(filters)
+            visitors_df = service.get_new_vs_returning(filters)
+            visitor_locations_df, ip_debug = service.get_visitor_location_details(filters)
+            visitor_locations_top_df = _visitor_location_top_table(visitor_locations_df)
+            visitor_locations_debug = _visitor_location_debug_text(
+                visitor_locations_df,
+                service.get_geoip_debug_info(),
+                ip_debug,
+            )
+            range_text = _query_range_text(filters)
+            if (
+                overview_df.empty
+                and benchmark_df.empty
+                and filter_df.empty
+                and visitors_df.empty
+                and visitor_locations_df.empty
+            ):
+                metrics = f"No data for {range_text}."
+            else:
+                metrics = (
+                    f"Range: {range_text}  \n"
+                    f"PV: {totals['pv']} | UV: {totals['uv']} | Sessions: {totals['sessions']} | "
+                    f"Events/Session: {totals['events_per_session']} | "
+                    f"Sessions/Visitor: {totals['sessions_per_visitor']}"
+                )
+            overview_plot = (
+                px.line(
+                    overview_df,
+                    x="period",
+                    y=["pv", "uv", "session_count"],
+                    title="Traffic overview",
+                )
+                if not overview_df.empty
+                else _empty_plot(f"Traffic overview (no data for {range_text})")
+            )
+            benchmark_plot = (
+                px.bar(benchmark_df, x="benchmark", y="count", title="Benchmark Top")
+                if not benchmark_df.empty
+                else px.bar(title=f"Benchmark Top (no data for {range_text})")
+            )
+            filter_plot = (
+                px.bar(filter_df, x="event_name", y="count", title="Filter usage")
+                if not filter_df.empty
+                else px.bar(title=f"Filter usage (no data for {range_text})")
+            )
+            funnel_plot = px.funnel(funnel_df, x="sessions", y="step", title="Session funnel")
+            visitor_plot = (
+                px.bar(
+                    visitors_df,
+                    x="period",
+                    y="visitor_count",
+                    color="visitor_type",
+                    barmode="group",
+                    title="New vs returning visitors",
+                )
+                if not visitors_df.empty
+                else px.bar(title=f"New vs returning visitors (no data for {range_text})")
+            )
+            visitor_locations_plot = _visitor_location_map(visitor_locations_df, range_text)
+            csv_archive = _write_csv_archive(
+                {
+                    "overview": overview_df,
+                    "benchmarks": benchmark_df,
+                    "filters": filter_df,
+                    "funnel": funnel_df,
+                    "visitors": visitors_df,
+                    "visitor_locations": visitor_locations_df,
+                }
+            )
+            return (
+                metrics,
+                overview_plot,
+                benchmark_plot,
+                filter_plot,
+                funnel_plot,
+                visitor_plot,
+                visitor_locations_plot,
+                visitor_locations_debug,
+                visitor_locations_top_df,
+                overview_df,
+                benchmark_df,
+                filter_df,
+                funnel_df,
+                visitors_df,
+                visitor_locations_df,
+                csv_archive,
+            )
+        except Exception as exc:
+            message = f"Query failed: {exc}"
+            empty = pd.DataFrame()
+            empty_top = pd.DataFrame(columns=["Region", "Users"])
+            return (
+                message,
+                _empty_plot(message),
+                px.bar(title=message),
+                px.bar(title=message),
+                px.funnel(
+                    pd.DataFrame({"step": [], "sessions": []}),
+                    x="sessions",
+                    y="step",
+                    title=message,
+                ),
+                px.bar(title=message),
+                _empty_map(message),
+                message,
+                empty_top,
+                empty,
+                empty,
+                empty,
+                empty,
+                empty,
+                empty,
+                None,
+            )
     with gr.Blocks() as demo:
         gr.Markdown("# Leaderboard Analytics Dashboard")
                 value=default_end,
                 timezone="UTC",
             )
+            benchmark = gr.Dropdown(
+                label="Benchmark",
+                choices=[""],
+                value="",
+                allow_custom_value=True,
+            )
             granularity = gr.Dropdown(
                 label="Granularity",
                 choices=[Granularity.DAY.value, Granularity.WEEK.value, Granularity.MONTH.value],
             )
             refresh = gr.Button("Refresh", variant="primary")
+        metrics_text = gr.Markdown(
+            "PV: 0 | UV: 0 | Sessions: 0 | Events/Session: 0 | Sessions/Visitor: 0"
+        )
         with gr.Row():
             overview_plot = gr.Plot(label="Traffic Overview")
             filter_plot = gr.Plot(label="Filter Behavior")
             funnel_plot = gr.Plot(label="Funnel")
         visitor_plot = gr.Plot(label="Visitor Segmentation")
+        with gr.Row():
+            with gr.Column(scale=2):
+                visitor_locations_plot = gr.Plot(label="Visitor Locations")
+            with gr.Column(scale=1):
+                visitor_locations_debug = gr.Markdown(
+                    "GeoIP DB: not checked  \n"
+                    "Total location PV: 0 | Users/IPs: 0 | Mapped regions: 0"
+                )
+                visitor_locations_top_table = gr.DataFrame(
+                    label="Top 10 Regions",
+                    interactive=False,
+                    wrap=True,
+                )
+        with gr.Accordion("Raw data", open=False):
+            csv_file = gr.File(label="CSV export")
+            overview_table = gr.DataFrame(label="Traffic Overview")
+            benchmark_table = gr.DataFrame(label="Benchmark Analysis")
+            filter_table = gr.DataFrame(label="Filter Behavior")
+            funnel_table = gr.DataFrame(label="Funnel")
+            visitor_table = gr.DataFrame(label="Visitor Segmentation")
+            visitor_locations_table = gr.DataFrame(label="Visitor Locations")
+        outputs = [
+            metrics_text,
+            overview_plot,
+            benchmark_plot,
+            filter_plot,
+            funnel_plot,
+            visitor_plot,
+            visitor_locations_plot,
+            visitor_locations_debug,
+            visitor_locations_top_table,
+            overview_table,
+            benchmark_table,
+            filter_table,
+            funnel_table,
+            visitor_table,
+            visitor_locations_table,
+            csv_file,
+        ]
         refresh.click(
             fn=query,
             inputs=[start_time, end_time, benchmark, granularity],
+            outputs=outputs,
         )
+        demo.load(fn=load_benchmarks, outputs=benchmark)
         demo.load(
             fn=query,
             inputs=[start_time, end_time, benchmark, granularity],
+            outputs=outputs,
         )
+    Path(tempfile.gettempdir()).mkdir(parents=True, exist_ok=True)
     return demo

tests/test_geoip_database.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import gzip
+from leaderboard_analytics.geoip_database import ensure_geoip_database
+def test_ensure_geoip_database_downloads_and_decompresses_gzip(tmp_path) -> None:
+    source = tmp_path / "GeoLite2-Country.mmdb.gz"
+    target = tmp_path / "GeoLite2-Country.mmdb"
+    expected_bytes = b"fake-mmdb-bytes"
+    with gzip.open(source, "wb") as gzip_file:
+        gzip_file.write(expected_bytes)
+    result = ensure_geoip_database(target, source.as_uri())
+    assert result == target
+    assert target.read_bytes() == expected_bytes
+def test_ensure_geoip_database_keeps_existing_file(tmp_path) -> None:
+    source = tmp_path / "missing.mmdb.gz"
+    target = tmp_path / "GeoLite2-Country.mmdb"
+    expected_bytes = b"existing-mmdb-bytes"
+    target.write_bytes(expected_bytes)
+    result = ensure_geoip_database(target, source.as_uri())
+    assert result == target
+    assert target.read_bytes() == expected_bytes

tests/test_repositories.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from datetime import UTC, datetime
+from leaderboard_analytics.repositories import AnalyticsRepository
+from leaderboard_analytics.schemas import QueryFilters
+class CapturingCollection:
+    def __init__(self, rows: list[dict] | None = None) -> None:
+        self.rows = rows or []
+        self.pipeline: list[dict] | None = None
+    def aggregate(self, pipeline: list[dict]):
+        self.pipeline = pipeline
+        return iter(self.rows)
+def _filters() -> QueryFilters:
+    return QueryFilters(
+        start_time=datetime(2026, 1, 1, tzinfo=UTC),
+        end_time=datetime(2026, 1, 31, tzinfo=UTC),
+    )
+def test_funnel_pipeline_preserves_ordered_step_logic() -> None:
+    collection = CapturingCollection()
+    repository = AnalyticsRepository(collection)  # type: ignore[arg-type]
+    repository.funnel(_filters())
+    assert collection.pipeline is not None
+    assert {"$sort": {"session_id": 1, "event_ts": 1}} in collection.pipeline
+    assert any(
+        "$push" in stage.get("$group", {}).get("events", {}) for stage in collection.pipeline
+    )
+    assert not any(
+        "$addToSet" in str(stage) and "events" in str(stage) for stage in collection.pipeline
+    )
+    assert any(
+        "table_download_at" in str(stage) and "$filter_change_at" in str(stage)
+        for stage in collection.pipeline
+    )
+def test_new_vs_returning_pipeline_computes_first_seen_before_range_match() -> None:
+    collection = CapturingCollection()
+    repository = AnalyticsRepository(collection)  # type: ignore[arg-type]
+    repository.visitors_new_vs_returning(_filters())
+    assert collection.pipeline is not None
+    window_index = next(
+        i for i, stage in enumerate(collection.pipeline) if "$setWindowFields" in stage
+    )
+    range_match_index = next(
+        i
+        for i, stage in enumerate(collection.pipeline)
+        if stage.get("$match", {}).get("event_ts") is not None
+    )
+    assert window_index < range_match_index
+def test_overview_totals_filters_empty_identifiers() -> None:
+    collection = CapturingCollection([{"pv": 1, "uv": 1, "sessions": 1, "events": 2}])
+    repository = AnalyticsRepository(collection)  # type: ignore[arg-type]
+    totals = repository.overview_totals(_filters())
+    assert totals == {"pv": 1, "uv": 1, "sessions": 1, "events": 2}
+    assert collection.pipeline is not None
+    pipeline_text = str(collection.pipeline)
+    assert '"$sessions"' in pipeline_text or "'$sessions'" in pipeline_text
+    assert '"$visitors"' in pipeline_text or "'$visitors'" in pipeline_text
+    assert "$$s" in pipeline_text
+    assert "$$v" in pipeline_text
+def test_visitor_ip_counts_groups_page_view_ips_with_existing_filters() -> None:
+    collection = CapturingCollection([{"ip": "8.8.8.8", "pv": 3}])
+    repository = AnalyticsRepository(collection)  # type: ignore[arg-type]
+    filters = QueryFilters(
+        start_time=datetime(2026, 1, 1, tzinfo=UTC),
+        end_time=datetime(2026, 1, 31, tzinfo=UTC),
+        benchmark="MTEB",
+    )
+    rows = repository.visitor_ip_counts(filters)
+    assert rows == [{"ip": "8.8.8.8", "pv": 3}]
+    assert collection.pipeline is not None
+    pipeline_text = str(collection.pipeline)
+    assert "properties.ip" in pipeline_text
+    assert "page_view" in pipeline_text
+    assert "MTEB" in pipeline_text
+    assert "$nin" in pipeline_text
+    assert "$properties.ip" in pipeline_text

tests/test_schemas.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from datetime import UTC, datetime
+import pytest
+from pydantic import ValidationError
+from leaderboard_analytics.schemas import QueryFilters
+def test_query_filters_rejects_invalid_time_range() -> None:
+    with pytest.raises(
+        ValidationError, match="start_time must be earlier than or equal to end_time"
+    ):
+        QueryFilters(
+            start_time=datetime(2026, 1, 2, tzinfo=UTC),
+            end_time=datetime(2026, 1, 1, tzinfo=UTC),
+        )

tests/test_services.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from datetime import UTC, datetime
+from pathlib import Path
+from leaderboard_analytics.schemas import QueryFilters
+from leaderboard_analytics.services import AnalyticsService
+class FakeRepository:
+    def overview_timeseries(self, filters: QueryFilters) -> list[dict]:
+        return [
+            {"period": "2026-01-01", "pv": 2, "uv": 1, "session_count": 1, "event_count": 3},
+            {"period": "2026-01-02", "pv": 1, "uv": 1, "session_count": 1, "event_count": 2},
+        ]
+    def overview_totals(self, filters: QueryFilters) -> dict:
+        return {"pv": 3, "uv": 1, "sessions": 1, "events": 5}
+class LocationRepository:
+    def __init__(self, rows: list[dict]) -> None:
+        self.rows = rows
+    def visitor_ip_counts(self, filters: QueryFilters) -> list[dict]:
+        return self.rows
+class FakeGeoIpResolver:
+    def __init__(self, countries: dict[str, tuple[str, str]]) -> None:
+        self.countries = countries
+    def resolve_country(self, ip_address: str) -> tuple[str, str]:
+        return self.countries[ip_address]
+def test_overview_uses_full_range_distinct_totals() -> None:
+    service = AnalyticsService(FakeRepository())  # type: ignore[arg-type]
+    filters = QueryFilters(
+        start_time=datetime(2026, 1, 1, tzinfo=UTC),
+        end_time=datetime(2026, 1, 2, tzinfo=UTC),
+    )
+    frame, totals = service.get_overview(filters)
+    assert list(frame["period"]) == ["2026-01-01", "2026-01-02"]
+    assert totals == {
+        "pv": 3,
+        "uv": 1,
+        "sessions": 1,
+        "events": 5,
+        "events_per_session": 5.0,
+        "sessions_per_visitor": 1.0,
+    }
+def test_visitor_locations_groups_pv_and_ip_count_by_country() -> None:
+    repository = LocationRepository(
+        [
+            {"ip": "8.8.8.8", "pv": 3},
+            {"ip": "8.8.4.4", "pv": 2},
+            {"ip": "1.1.1.1", "pv": 4},
+        ]
+    )
+    resolver = FakeGeoIpResolver(
+        {
+            "8.8.8.8": ("US", "United States"),
+            "8.8.4.4": ("US", "United States"),
+            "1.1.1.1": ("AU", "Australia"),
+        }
+    )
+    service = AnalyticsService(
+        repository,  # type: ignore[arg-type]
+        geoip_resolver=resolver,  # type: ignore[arg-type]
+    )
+    frame = service.get_visitor_locations(
+        QueryFilters(
+            start_time=datetime(2026, 1, 1, tzinfo=UTC),
+            end_time=datetime(2026, 1, 2, tzinfo=UTC),
+        )
+    )
+    assert frame.to_dict("records") == [
+        {"country_code": "US", "country_name": "United States", "pv": 5, "ip_count": 2},
+        {"country_code": "AU", "country_name": "Australia", "pv": 4, "ip_count": 1},
+    ]
+def test_visitor_locations_groups_unresolved_ips_as_unknown() -> None:
+    repository = LocationRepository(
+        [
+            {"ip": "10.0.0.1", "pv": 2},
+            {"ip": "not-an-ip", "pv": 1},
+            {"ip": "8.8.8.8", "pv": 3},
+        ]
+    )
+    service = AnalyticsService(
+        repository,  # type: ignore[arg-type]
+        geoip_database_path=Path("missing-geolite2-country.mmdb"),
+    )
+    frame = service.get_visitor_locations(
+        QueryFilters(
+            start_time=datetime(2026, 1, 1, tzinfo=UTC),
+            end_time=datetime(2026, 1, 2, tzinfo=UTC),
+        )
+    )
+    assert frame.to_dict("records") == [
+        {"country_code": "Unknown", "country_name": "Unknown", "pv": 6, "ip_count": 3}
+    ]