GAAP & SEC Compliance Dataset
A comprehensive dataset for financial AI applications
Dataset Overview
This dataset contains 470,151 documents covering US GAAP (Generally Accepted Accounting Principles) standards and SEC (Securities and Exchange Commission) filing requirements. It's designed for training and evaluating AI systems for financial compliance, accounting Q&A, and regulatory analysis.
Key Statistics
- Total Documents: 470,151
- Average Length: 363 characters
- Unique Companies: 6,573
- Date Range: 2007-01-31 to 2025-12-01
- Dataset Size: ~296MB
Content Distribution
By Source
- XBRL: 445,211 (94.7%)
- SEC_FILING: 24,935 (5.3%)
- GAAP_STANDARD: 5 (0.0%)
By Document Type
- tag: 445,211 (94.7%)
- financial_data: 24,935 (5.3%)
- standard: 5 (0.0%)
By Category (Top 10)
- Other: 294,775 (62.7%)
- Expenses: 55,303 (11.8%)
- Assets: 35,592 (7.6%)
- Liabilities: 32,958 (7.0%)
- Income: 24,658 (5.2%)
- Equity: 19,732 (4.2%)
- Revenue: 7,133 (1.5%)
Use Cases
AI Chatbots
Build intelligent assistants for:
- GAAP compliance questions
- SEC filing analysis
- Accounting standard lookup
- Financial regulation guidance
Information Retrieval
Power search engines for:
- Financial document discovery
- Regulatory text mining
- Compliance research
- Academic studies
Machine Learning
Train models for:
- Financial text classification
- Accounting Q&A systems
- Regulatory NLP tasks
- Domain adaptation
Quick Start
Load Dataset
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset")
# Or stream for memory efficiency
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset", streaming=True)
# Access examples
for example in dataset["train"]:
print(f"Title: {example['metadata']['title']}")
print(f"Source: {example['metadata']['source']}")
print(f"Content: {example['content'][:200]}...")
break
Build RAG System
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
# Load dataset
docs = load_dataset("aanshshah/gaap-sec-compliance-dataset")["train"]
# Create embeddings
encoder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = encoder.encode([doc["content"] for doc in docs])
# Build FAISS index
index = faiss.IndexFlatL2(384)
index.add(embeddings)
def search_docs(query, k=5):
query_vec = encoder.encode([query])
_, indices = index.search(query_vec, k)
return [docs[i] for i in indices[0]]
# Example usage
results = search_docs("What is ASC 606?")
Use with LangChain
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
# Load documents
loader = HuggingFaceDatasetLoader(
path="aanshshah/gaap-sec-compliance-dataset",
page_content_column="content"
)
documents = loader.load()
# Create vector store
embeddings = HuggingFaceEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Query
results = vectorstore.similarity_search("revenue recognition", k=5)
Dataset Structure
Document Schema
Each document contains:
id: Unique identifiercontent: Full text contentmetadata: Structured information including:source: Origin (XBRL, SEC_FILING, GAAP_STANDARD)type: Document type (tag, financial_data, standard)category: Financial category (Assets, Revenue, etc.)code: Standard code (e.g., "ASC 606", "us-gaap:Assets")title: Human-readable titledate: Date in YYYY-MM-DD formatcompany: Company name (for SEC filings)
Example Document
{
"id": "gaap_standard_67a64e72e3390f7e",
"content": "# ASC 606: Revenue from Contracts with Customers...",
"metadata": {
"source": "GAAP_STANDARD",
"type": "standard",
"category": "Revenue",
"code": "ASC 606",
"title": "ASC 606: Revenue from Contracts with Customers",
"date": "2025-01-01"
}
}
Data Creation Process
Sources
XBRL US GAAP Taxonomy (94.7%)
- Complete standardized accounting tags
- Hierarchical relationships preserved
SEC EDGAR Database (5.3%)
- Real company 10-K/10-Q filings
- Quarterly data from 2007-2025
FASB Standards (<0.1%)
- Core GAAP standards (ASC)
- Implementation guidance
Processing Pipeline
- Extraction: Parse XBRL, HTML, PDF sources
- Standardization: Convert to consistent JSON format
- Cleaning: Remove duplicates and invalid entries
- Enrichment: Add metadata and categories
- Validation: Ensure quality and completeness
Applications in Production
Financial Institutions
- Compliance monitoring systems
- Risk assessment tools
- Regulatory report generation
- Audit automation
FinTech Companies
- AI-powered accounting assistants
- Automated bookkeeping
- Financial analysis platforms
- Investment research tools
Education & Training
- Interactive learning platforms
- Professional certification prep
- Academic research
- Student Q&A systems
Quality & Coverage
Quality Metrics
- Deduplicated: No duplicate documents
- Validated: All required fields present
- Cleaned: Invalid entries removed
- Structured: Consistent schema
- Current: Up-to-date as of December 2025
Coverage Areas
- Complete US GAAP taxonomy
- Major public company filings
- All accounting categories
- Historical and current standards
- Multiple filing types (10-K, 10-Q, 8-K)
Legal & Ethics
Data Sources
- All data from public sources
- No proprietary information
- SEC EDGAR publicly available filings
- XBRL taxonomy open standard
Use Restrictions
- Not for investment advice
- Educational/research purposes
- Verify critical information with official sources
- Comply with applicable regulations
Privacy
- No personal identifying information
- No material non-public information
- Only public company data
- Anonymized where appropriate
Updates & Maintenance
Version History
- v1.0.0 (December 2025): Initial release with 470K documents
Update Schedule
- Quarterly updates planned
- New SEC filings added
- GAAP standard updates included
- Community feedback incorporated
Support & Community
Getting Help
- Discussions
- Issues
- Contact via HuggingFace profile
Contributing
- Report data quality issues
- Suggest additional sources
- Share use cases and applications
- Submit improvements
Citation
If you use this dataset in your research or applications, please cite:
@dataset{gaap_sec_compliance_2025,
title={GAAP & SEC Compliance Dataset},
author={Shah, Aansh},
year={2025},
month={12},
publisher={HuggingFace},
url={https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset},
note={A comprehensive dataset of 470,151 financial documents for AI applications}
}
Acknowledgments
- XBRL US for taxonomy data
- SEC EDGAR for public filings
- FASB for accounting standards
- HuggingFace for hosting platform
- FlexAI for compute resources
Dataset Statistics
| Metric | Value |
|---|---|
| Documents | 470,151 |
| Characters | 171,055,320 |
| Companies | 6,573 |
| Date Span | 6,573 days |
| Storage | ~296MB |
Built for the financial AI community
Ready to build the next generation of financial AI? Start with this dataset!
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support