Model Merging: Combining Pre-trained Models

When to Use This Skill

Use Model Merging when you need to:

Combine capabilities from multiple fine-tuned models without retraining
Create specialized models by blending domain-specific expertise (math + coding + chat)
Improve performance beyond single models (often +5-10% on benchmarks)
Reduce training costs - no GPUs needed, merges run on CPU
Experiment rapidly - create new model variants in minutes, not days
Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

Install mergekit

git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .

Or via pip

pip install mergekit

Optional: Transformer library

pip install transformers torch

Quick Start

Simple Linear Merge

config.yml - Merge two models with equal weights

merge_method: linear models:

model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5
model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

Run merge

mergekit-yaml config.yml ./merged-model --cuda

Use merged model

python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

config.yml - Spherical interpolation

merge_method: slerp slices:

sources:
- model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # Interpolation factor (0=model1, 1=model2) dtype: bfloat16

Core Concepts

Merge Methods

Linear (Model Soup)

Simple weighted average of parameters
Fast, works well for similar models
Can merge 2+ models

merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights

where w1 + w2 + w3 = 1

SLERP (Spherical Linear Interpolation)

Interpolates along sphere in weight space
Preserves magnitude of weight vectors
Best for merging 2 models
Smoother than linear

SLERP formula

merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task Arithmetic

Extract "task vectors" (fine-tuned - base)
Combine task vectors, add to base
Good for merging multiple specialized models

Task vector

task_vector = finetuned_model - base_model

Merge multiple task vectors

merged = base_model + α₁task_vector₁ + α₂task_vector₂

TIES-Merging

Task arithmetic + sparsification
Resolves sign conflicts in parameters
Best for merging many task-specific models

DARE (Drop And REscale)

Randomly drops fine-tuned parameters
Rescales remaining parameters
Reduces redundancy, maintains performance

Configuration Structure

Basic structure

merge_method: <method> # linear, slerp, ties, dare_ties, task_arithmetic base_model: <path> # Optional: base model for task arithmetic

models:

model: <path/to/model1> parameters: weight: <float> # Merge weight density: <float> # For TIES/DARE
model: <path/to/model2> parameters: weight: <float>

parameters:

Method-specific parameters

dtype: <dtype> # bfloat16, float16, float32

Optional

slices: # Layer-wise merging tokenizer: # Tokenizer configuration

Merge Methods Guide

Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear models:

model: WizardLM/WizardMath-7B-V1.1 parameters: weight: 0.4
model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.3
model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: weight: 0.3 dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp slices:

sources:
- model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # 0.0 = first model, 1.0 = second model dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp slices:

sources:
- model: model_a layer_range: [0, 32]
- model: model_b layer_range: [0, 32] parameters: t:
- filter: self_attn # Attention layers value: 0.3
- filter: mlp # MLP layers value: 0.7
- value: 0.5 # Default for other layers dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic base_model: mistralai/Mistral-7B-v0.1 models:

model: WizardLM/WizardMath-7B-V1.1 # Math parameters: weight: 0.5
model: teknium/OpenHermes-2.5-Mistral-7B # Chat parameters: weight: 0.3
model: ajibawa-2023/Code-Mistral-7B # Code parameters: weight: 0.2 dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties base_model: mistralai/Mistral-7B-v0.1 models:

model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Keep top 50% of parameters weight: 1.0
model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 1.0
model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO parameters: density: 0.5 weight: 1.0 parameters: normalize: true dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties base_model: mistralai/Mistral-7B-v0.1 models:

model: WizardLM/WizardMath-7B-V1.1 parameters: density: 0.5 # Drop 50% of deltas weight: 0.6
model: teknium/OpenHermes-2.5-Mistral-7B parameters: density: 0.5 weight: 0.4 parameters: int8_mask: true # Use int8 for masks (saves memory) dtype: bfloat16

Advanced Patterns

Layer-wise Merging

Different models for different layers

merge_method: passthrough slices:

sources:
- model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # First half
sources:
- model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # Second half dtype: bfloat16

MoE from Merged Models

Create Mixture of Experts

merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts:

source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts:
- "math"
- "calculate"
source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts:
- "chat"
- "conversation"
source_model: ajibawa-2023/Code-Mistral-7B positive_prompts:
- "code"
- "python" dtype: bfloat16

Tokenizer Merging

merge_method: linear models:

model: mistralai/Mistral-7B-v0.1
model: custom/specialized-model

tokenizer: source: "union" # Combine vocabularies from both models tokens: <|special_token|>: source: "custom/specialized-model"

Best Practices

Model Compatibility

✅ Good: Same architecture

models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B ]

❌ Bad: Different architectures

models = [ "meta-llama/Llama-2-7b-hf", # Llama "mistralai/Mistral-7B-v0.1", # Mistral (incompatible!) ]

Weight Selection

✅ Good: Weights sum to 1.0

models:

model: model_a parameters: weight: 0.6
model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

models:

model: model_a parameters: weight: 0.8
model: model_b parameters: weight: 0.8 # May boost performance

Method Selection

Choose merge method based on use case:

2 models, smooth blend → SLERP

merge_method = "slerp"

3+ models, simple average → Linear

merge_method = "linear"

Multiple task-specific models → Task Arithmetic or TIES

merge_method = "ties"

Want to reduce redundancy → DARE

merge_method = "dare_ties"

Density Tuning (TIES/DARE)

Start conservative (keep more parameters)

parameters: density: 0.8 # Keep 80%

If performance good, increase sparsity

parameters: density: 0.5 # Keep 50%

If performance degrades, reduce sparsity

parameters: density: 0.9 # Keep 90%

Layer-specific Merging

Preserve base model's beginning and end

merge_method: passthrough slices:

sources:
- model: base_model layer_range: [0, 2] # Keep first layers
sources:
- model: merged_middle # Merge middle layers layer_range: [2, 30]
sources:
- model: base_model layer_range: [30, 32] # Keep last layers

Evaluation & Testing

Benchmark Merged Models

from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Test on various tasks

test_prompts = { "math": "Calculate: 25 * 17 =", "code": "Write a Python function to reverse a string:", "chat": "What is the capital of France?", }

for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks

Open LLM Leaderboard: General capabilities
MT-Bench: Multi-turn conversation
MMLU: Multitask accuracy
HumanEval: Code generation
GSM8K: Math reasoning

Production Deployment

Save and Upload

from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Upload to HuggingFace Hub

model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

Quantize with GGUF

python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

Quantize with GPTQ

python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls

❌ Pitfall 1: Merging Incompatible Models

Wrong: Different architectures

models:

model: meta-llama/Llama-2-7b # Llama architecture
model: mistralai/Mistral-7B # Mistral architecture

Fix: Only merge models with same architecture

❌ Pitfall 2: Over-weighting One Model

Suboptimal: One model dominates

models:

model: model_a parameters: weight: 0.95 # Too high
model: model_b parameters: weight: 0.05 # Too low

Fix: Use more balanced weights (0.3-0.7 range)

❌ Pitfall 3: Not Evaluating

Wrong: Merge and deploy without testing

mergekit-yaml config.yml ./merged-model

Deploy immediately (risky!)

Fix: Always benchmark before deploying

Resources

mergekit GitHub: https://github.com/arcee-ai/mergekit
HuggingFace Tutorial: https://huggingface.co/blog/mlabonne/merge-models
LazyMergekit: Automated merging notebook
TIES Paper: https://arxiv.org/abs/2306.01708
DARE Paper: https://arxiv.org/abs/2311.03099

model-merging

Safety Notice

Copy this and send it to your AI assistant to learn

Install mergekit

Or via pip

Optional: Transformer library

config.yml - Merge two models with equal weights

Run merge

Use merged model

config.yml - Spherical interpolation

where w1 + w2 + w3 = 1

SLERP formula

where θ = arccos(dot(model1, model2))

t ∈ [0, 1]

Task vector

Merge multiple task vectors

Basic structure

Method-specific parameters

Optional

Different models for different layers

Create Mixture of Experts

✅ Good: Same architecture

❌ Bad: Different architectures

✅ Good: Weights sum to 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

Choose merge method based on use case:

2 models, smooth blend → SLERP

3+ models, simple average → Linear

Multiple task-specific models → Task Arithmetic or TIES

Want to reduce redundancy → DARE

Start conservative (keep more parameters)

If performance good, increase sparsity

If performance degrades, reduce sparsity

Preserve base model's beginning and end

Load merged model

Test on various tasks

Load merged model

Upload to HuggingFace Hub

Quantize with GGUF

Quantize with GPTQ

Wrong: Different architectures

Suboptimal: One model dominates

Wrong: Merge and deploy without testing

Deploy immediately (risky!)

Source Transparency

Related Skills

ml-paper-writing

model-merging

qdrant-vector-search

peft-fine-tuning