mirror of https://github.com/ollama/ollama.git synced 2026-02-20 08:16:07 -05:00

Files

Patrick Devine a0407d07fa safetensors quantization for mlx (#14184 )

This change includes:
  - changes to the safetensors metadata format
  - changes to the create command to properly create the blobs with the new format
  - changes to load the new format
  - fixes ollama show to properly show each tensor

2026-02-10 11:29:17 -08:00

5.0 KiB

Raw Blame History

Tensor Blob Format

Ollama stores model tensors as individual blobs in the safetensors format. Each blob contains a logical tensor (or a combined quantized tensor with its scale/bias components), or a group of logical tensors (e.g. shared experts for a given layer along with the scale/bias components for that tensor).

Safetensors File Format

Every blob follows the safetensors layout:

[8 bytes: header_size (uint64 LE)] [header_size bytes: JSON header] [tensor data region]

The JSON header maps tensor names to their dtype, shape, and byte offsets within the data region. A special __metadata__ key holds string-to-string metadata.

Unquantized Blobs

An unquantized blob stores a single tensor keyed by its name:

{
  "model.layers.0.self_attn.q_proj.weight": {
    "dtype": "BF16",
    "shape": [2560, 2560],
    "data_offsets": [0, 13107200]
  }
}

The tensor key is the full tensor name. Dtype is typically BF16 or F32.

Quantized Blobs (Combined Format)

A quantized blob stores the packed weight, scaling factors, and optional zero-point biases in a single file. Tensor keys use the tensor name, with .scale and .bias suffixes for the auxiliary tensors:

{
  "__metadata__": {
    "quant_type": "int4",
    "group_size": "32"
  },
  "model.layers.0.mlp.up_proj.weight": {
    "dtype": "U32",
    "shape": [2560, 320],
    "data_offsets": [0, 3276800]
  },
  "model.layers.0.mlp.up_proj.weight.scale": {
    "dtype": "BF16",
    "shape": [2560, 80],
    "data_offsets": [3276800, 3686400]
  },
  "model.layers.0.mlp.up_proj.weight.bias": {
    "dtype": "BF16",
    "shape": [2560, 80],
    "data_offsets": [3686400, 4096000]
  }
}

Metadata Fields

Field	Description
`quant_type`	Quantization type: `int4`, `int8`, `nvfp4`, or `mxfp8`
`group_size`	Number of elements per quantization group (e.g., `32`, `64`)

Tensor Keys

Key	Description
`{name}`	Packed quantized weights (dtype `U32`)
`{name}.scale`	Per-group scaling factors
`{name}.bias`	Per-group zero-point offsets (affine modes only)

Quantization Types

Type	Bits	Group Size	Mode	Has Bias
`int4`	4	32	affine	yes
`int8`	8	64	affine	yes
`nvfp4`	4	16	nvfp4	no
`mxfp8`	8	32	mxfp8	no

Affine modes (int4, int8) use scale + bias for dequantization. The bias tensor provides the zero-point offset.

Non-affine modes (nvfp4, mxfp8) use only scale with specialized E4M3 scale formats.

Packed Weight Shape

Quantized weights are packed into uint32 values:

4-bit (int4, nvfp4): 8 values per uint32, so packed_cols = original_cols / 8
8-bit (int8, mxfp8): 4 values per uint32, so packed_cols = original_cols / 4

Scale shape: [rows, original_cols / group_size]

Manifest References

Blobs are referenced from the model manifest as layers:

{
  "mediaType": "application/vnd.ollama.image.tensor",
  "digest": "sha256:abc123...",
  "size": 4096150,
  "name": "model.layers.0.mlp.up_proj.weight"
}

Each tensor (quantized or not) is one layer in the manifest. The layer name matches the tensor key in the blob header.

Packed Blobs (Expert Groups)

For MoE (Mixture of Experts) models, expert tensors from the same layer are packed into a single blob to reduce blob count and improve loading efficiency. A packed blob is a standard safetensors file containing multiple tensor entries:

{
  "model.layers.1.mlp.experts.0.down_proj.weight": {
    "dtype": "U32",
    "shape": [2560, 640],
    "data_offsets": [0, 6553600]
  },
  "model.layers.1.mlp.experts.0.down_proj.weight.scale": {
    "dtype": "BF16",
    "shape": [2560, 40],
    "data_offsets": [6553600, 6963200]
  },
  "model.layers.1.mlp.experts.0.gate_proj.weight": {
    "dtype": "U32",
    "shape": [10240, 320],
    "data_offsets": [6963200, 20070400]
  },
  "model.layers.1.mlp.experts.0.gate_proj.weight.scale": { "..." : "..." }
}

Grouping Rules

model.layers.{L}.mlp.experts.* tensors are packed into one blob per layer
model.layers.{L}.mlp.shared_experts.* tensors are packed into one blob per layer
All other tensors remain as individual blobs

Manifest Representation

One manifest layer per packed group, using the group prefix as the layer name:

{
  "mediaType": "application/vnd.ollama.image.tensor",
  "digest": "sha256:...",
  "size": 123456789,
  "name": "model.layers.1.mlp.experts"
}

Loading

At load time, mlx_load_safetensors opens each blob via mmap for zero-copy access. For combined quantized blobs, the loader extracts {name}, {name}.scale, and {name}.bias tensors and caches them as name, name + "_scale", and name + "_qbias" respectively, maintaining compatibility with the weight loading interface.

For packed blobs, if the manifest layer name (group prefix) is not found as a tensor key, the loader parses the blob header to discover all tensor names and loads each individually.

5.0 KiB Raw Blame History