Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for
the attention scale instead of the MLA absorbed dimension (kvLoraRank +
qkRopeHeadDim = 576).
MLA absorption is a mathematically equivalent reorganization of the
attention computation - it should not change the effective attention
scale. The scale should match training, which uses 1/sqrt(256).
This improves tool calling and model looping issues.
The nvidia_fp32 config for (576, 512) head sizes had nbatch_fa=32,
which caused zero-sized arrays when computing array dimensions:
nbatch_fa / (np * warp_size) = 32 / (2 * 32) = 0
This resulted in CUDA compilation failures on CUDA 12 (Windows and
Linux arm64):
- "static assertion failed with nbatch_fa % (np*warp_size) != 0"
- "the size of an array must be greater than zero"
Fix by changing nbatch_fa from 32 to 64 for all (576, 512) configs
in the nvidia_fp32 function, matching the nvidia_fp16 and AMD configs.
* model: add MLA absorption for glm4moelite
Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.
* ggml: enable MLA flash attention for GLM-4.7-flash
Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).
Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups
CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4
* model: add compatibility validation for glm4moelite architecture
This change:
* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
* init deepseek model file
* temp removal of flash attention implementation
* shapes and proper, can make a pass
* query, key, value have good cosine similarity, but the max diff is a bit high
* Attention block is working! ** with eager for now, have not added the mask line
* Attention block is working! ** with eager for now, have not added the mask line
* working MoE at around 0.95 cosine sim
* added cosine similarity function
* Starting end to end structure
* Trying (and failing) to get rope to work, going to test full thing on tater
* running on tater36... just not the right outputs
* we have the right values for rope... but its still not working?
* chnage Extrapolation Factor to 1
* removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer
* Temporary modelfiles for cpu
* change kpass intermediate step to kv, two layer outputs [0,1] look fine
* this calls for 16 chicken nuggets
* whoops
* cleaning up code
* delete stuff we dont need
* getting rid of debug statements for llama cpp
* working with long contexts
* fix long context view error
* reverting some changes I made for files that are not apart of pr
* Added proper tokenizer for deeepseek3
* clean up model and go test
* remove Modelfile
* not passing the tests
* whoops
* how to pass the ci tests
* resolving some of the comments
* rename
* linted and renamed deepseek3 -> deepseek2
* remove name go
* addressed changes - main change was adopting qwen3 naming scheme
* I cannot with linters
* clean up logs
* clean up logs
---------
Co-authored-by: Grace Guo <graceguo@Graces-MBP.localdomain>
Co-authored-by: Grace Guo <graceguo@Graces-MacBook-Pro.local>
Co-authored-by: graceguo <graceguo@tater36.localdomain>
* perf: build graph for next batch in parallel to keep GPU busy
This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.
* tests: tune integration tests for ollama engine
This tunes the integration tests to focus more on models supported
by the new engine.