Model Card Template: Structured Documentation for AI Systems

You are deploying or procuring an AI model and need a structured way to document what it does, where it works well, where it fails, and what risks come with using it. A model card is the standard tool for this job.

What is a model card

A model card is a structured document that accompanies an AI model, describing its intended use, capabilities, limitations, evaluation results, and ethical considerations. The concept was introduced by Mitchell et al. (2019) at Google, drawing on precedents in nutrition labels, electronics datasheets, and financial disclosures.

A model card is not marketing material. It is a technical disclosure document designed to help three audiences make informed decisions:

**Developers** who integrate the model into applications need to understand its capabilities and failure modes
**Decision-makers** who approve AI deployments need to understand risks, limitations, and compliance implications
**Affected stakeholders** who are subject to the model's outputs deserve transparency about how the system works

Why teams need model cards

Without structured documentation, knowledge about a model's behavior lives in the heads of the people who built it. When those people change teams, leave the organization, or simply forget, critical information about limitations, failure modes, and design trade-offs is lost.

Regulatory requirements are making model documentation mandatory. The EU AI Act Annex IV requires technical documentation for high-risk AI systems that covers many of the same elements as a model card. The NIST AI Risk Management Framework recommends documented AI system profiles. Organizations that build the documentation habit now will be better prepared for compliance obligations.

Template sections

A model card should include the following eight sections. Each section is described below with guidance on what to include.

Section 1: Model Details

Basic identifying information about the model.

Field	Description	Example
Model name	Official name and version	CustomerAssist v2.3
Model type	Architecture or approach	Fine-tuned transformer (based on Llama 3 70B)
Developer	Organization or team that built or fine-tuned the model	Acme Corp AI Team
Release date	When this version was deployed	2026-03-15
License	Terms of use	Internal use only
Contact	Who to reach for questions	ai-team@acme.example.com
Model version history	Previous versions and key changes	v2.2 (2026-01-10): updated training data; v2.1 (2025-11-01): initial deployment

Section 2: Intended Use

What the model is for and, equally important, what it is not for.

Field	Description
Primary intended use	The specific task or tasks the model was designed to perform
Primary intended users	Who should be using this model (internal teams, end users, other systems)
Out-of-scope uses	Tasks or contexts where the model should not be used, even if it appears to work

Be specific. "General-purpose language understanding" is not a useful intended use statement. "Classifying customer support tickets into 12 predefined categories for routing" is.

Section 3: Factors

Characteristics of the operating environment that affect model performance.

Field	Description
Relevant factors	Groups, instruments, or environments that influence performance (demographics, languages, device types, data formats)
Evaluation factors	Which of these factors were specifically tested during evaluation

This section is where you document that the model was tested on English-language inputs only, or that evaluation data came from one geographic region, or that performance varies by user demographic.

Section 4: Metrics

How model performance is measured.

Field	Description
Performance measures	Which metrics are used and why (accuracy, F1, precision, recall, latency, fairness metrics)
Decision thresholds	What confidence thresholds are used and how they were chosen
Variation approaches	How performance variation across subgroups is measured

Select metrics that are relevant to the intended use. A classification model needs precision and recall. A generative model needs factual accuracy and harmlessness measures. A model affecting people needs fairness metrics.

Section 5: Evaluation Data

Information about the data used to test the model.

Field	Description
Datasets	Names and descriptions of evaluation datasets
Motivation	Why these datasets were chosen and what they represent
Preprocessing	How evaluation data was cleaned, filtered, or transformed

Document the gaps. If the evaluation data does not cover a population the model will encounter in production, say so.

Section 6: Training Data

Information about the data used to train the model (to the extent it can be disclosed).

Field	Description
Datasets	Names and descriptions of training datasets, or a summary of data sources
Motivation	Why these datasets were chosen
Preprocessing	How training data was cleaned, filtered, or transformed
Known gaps	Populations, languages, or scenarios underrepresented in training data

For proprietary or vendor-provided models where you do not have full training data visibility, document what the vendor has disclosed and note the gaps in your knowledge.

Section 7: Ethical Considerations

Risks, harms, and sensitive use cases.

Field	Description
Sensitive use cases	Where the model's errors could cause harm to individuals or groups
Known risks	Documented bias, fairness concerns, or failure modes that could cause harm
Mitigation strategies	What has been done to address identified risks
Unresolved concerns	Risks that have been identified but not yet addressed

This is the most important section for decision-makers. Be honest about what you know, what you do not know, and what you have chosen to accept.

Section 8: Caveats and Recommendations

Practical guidance for anyone using the model.

Field	Description
Known limitations	Conditions under which the model is expected to perform poorly
Deployment recommendations	How to deploy the model safely (monitoring, human oversight, usage limits)
Maintenance requirements	How often the model should be re-evaluated, and what triggers a review

Pre-filled example: Customer service chatbot

Below is a condensed example showing how the template applies to a customer service chatbot. In practice, each section would contain more detail.

Section	Content
Model Details	CustomerAssist v2.3. Fine-tuned Llama 3 70B. Released 2026-03-15. Internal deployment only.
Intended Use	Answer customer questions about product features, pricing, and order status using a verified knowledge base. NOT intended for: medical, legal, or financial advice; handling complaints that require human empathy; any decision with financial consequence to the customer.
Factors	Evaluated on English-language inputs only. Performance tested across product categories (electronics, clothing, home goods). Not tested on non-English inputs, regional dialects, or accessibility tool interactions.
Metrics	Factual accuracy: 94.2% on verified knowledge base questions. Hallucination rate: 3.1% on out-of-knowledge-base questions. Response latency: p95 under 2 seconds. Customer satisfaction (post-chat survey): 4.1/5.
Evaluation Data	5,000 customer questions sampled from support logs (January to February 2026). Stratified by product category and question type. Does not include adversarial or edge-case inputs.
Training Data	Fine-tuned on 50,000 verified Q&A pairs from the product knowledge base, 10,000 historical support transcripts (PII redacted), and 2,000 manually written examples for edge cases. Knowledge base covers products sold from 2024 onward; older product questions are out of scope.
Ethical Considerations	Risk of hallucination on questions outside the knowledge base. Could provide incorrect pricing if the knowledge base is not updated promptly. Customers may not realize they are interacting with an AI system (disclosure is displayed but may be missed). No demographic bias testing has been conducted on response quality.
Caveats	Accuracy degrades on questions about products not in the knowledge base. Should not be deployed without the RAG retrieval pipeline and the escalation-to-human fallback. Re-evaluate after any knowledge base update exceeding 500 entries.

When to create and update model cards

Event	Action
New model development	Create the model card during development, before deployment
Model procurement from a vendor	Request the vendor's model card; create your own supplementary card documenting your specific deployment context
Model version update	Update the model card with new evaluation results, changed capabilities, and any new limitations
Change in intended use	Update intended use and out-of-scope sections; re-evaluate and document
Incident involving the model	Update ethical considerations and caveats with the incident details and any changes made
Annual review	Review and update all sections even if no changes have occurred, to confirm the documentation remains accurate

Decision checklist

Before considering a model card complete, confirm:

[ ] All eight sections are filled in with specific, verifiable information (not placeholder text)
[ ] Intended use includes clear out-of-scope uses, not just what the model is for
[ ] Limitations are described in concrete terms, not vague disclaimers
[ ] Evaluation results are disaggregated by relevant subgroups where possible
[ ] Ethical considerations include both known risks and unresolved concerns
[ ] The card has been reviewed by someone who did not build the model
[ ] A maintenance schedule is defined (when the card will next be reviewed)
[ ] The card is stored where all relevant stakeholders can access it

Key takeaways

A model card is a disclosure document, not a sales pitch. Its value comes from honest documentation of limitations and risks, not from presenting the model in the best light.
The EU AI Act Annex IV technical documentation requirements overlap significantly with model card content. Building model cards now prepares your organization for regulatory compliance.
Model cards should be created during development, not after deployment. Retroactive documentation is less complete and less accurate.
The most useful model cards are specific. "The model may produce inaccurate outputs" tells readers nothing. "The model hallucinates at a rate of 3.1% on out-of-knowledge-base questions, most commonly by fabricating product features" tells them what to watch for.
A model card is a living document. Update it when the model changes, when the deployment context changes, and when new information about the model's behavior becomes available.