Commit Graph

55 Commits

Author SHA1 Message Date
Sergey Penkovsky
b1737bbce2 feat(mixtral): initial implementation of Mixtral MoE model, configs, and tests
- Add Mixtral architecture implementation with MoE support (llm/src/llm/models/mixtral/mixtral.py)
- Introduce generic Mixture-of-Experts (MoE) block (llm/src/llm/core/moe.py)
- Create dedicated configuration files for Mixtral training and generation experiments
- Register and test Mixtral support in experiment runner (run_llm_experiment.py)
- Add unit tests for Mixtral API including forward, caching, and generation modes
- Include Jupyter notebook mixstral.ipynb for architectural exploration and research
- Ensure correct handling of torch bool masks in sampling (top-k, top-p) during generation

BREAKING CHANGE: Adds new model code and test coverage, modifying experiment runner logic to register Mixtral.
2025-10-20 08:12:11 +03:00
Sergey Penkovsky
1aba02cab9 Merge pull request #3 from pese-git/feature/mistral
Feature/mistral
2025-10-17 20:45:20 +03:00
Sergey Penkovsky
9794db3e18 docs(readme): update project documentation for LLaMA, Mistral, HF integration
- Added explicit support and usage examples for Mistral and LLaMA architectures in both root and llm/ READMEs
- Updated directory structure and naming (datasets, tokenizers, mistral, hf-proxy)
- Clarified quickstart and experiments usage including config location and CLI
- Documented HuggingFace integration via  and marked it as experimental
- Highlighted differences and specifics of all supported architectures
- Improved guide for launching training/generation/experiments
- Made project scope and architecture more transparent for new contributors
2025-10-17 20:18:57 +03:00
Sergey Penkovsky
d947b7beb3 update and expand scientific docstrings for optimizer, scheduler, trainer
- Expanded module-level and function/class docstrings in optimizer.py, scheduler.py, and trainer.py
- Described mathematical foundations, theoretical motivations, and provided detailed usage examples for students
- All docstrings in Russian, clear scientific style

test(training): add comprehensive tests for optimizer, scheduler, and trainer modules

- Added new test files for get_optimizer, get_linear_schedule_with_warmup, and Trainer
- Tests cover parameter handling, edge cases, and expected learning dynamics (lr schedules and loss behavior)
- Trainer now logs average epoch losses to self.loss_history for testability and analysis

refactor(training/trainer): log epoch loss to loss_history for downstream analysis and tests

BREAKING CHANGE: Trainer.loss_history is a new attribute consolidating average losses per epoch, enabling robust learning dynamics assertions in tests
2025-10-17 16:25:39 +03:00
Sergey Penkovsky
613d784565 doc(datasets): update docstrings and tests 2025-10-17 10:49:45 +03:00
Sergey Penkovsky
38c271ca3c docs(models): update and expand docstrings for Mistral and its methods
- docs: add comprehensive docstrings for the Mistral class (in Russian) and its methods (forward, generate)
- docs: explain model architecture (GQA, Sliding Window Attention, SwiGLU, RMSNorm, RoPE), arguments, constraints, generation modes, usage examples, and references (Mistral, nucleus sampling)
- strictly documentation improvements, no logic/API changes

This commit makes Mistral model documentation clear and user-friendly for LLM engineering and inference.
2025-10-16 17:03:06 +03:00
Sergey Penkovsky
aec3c8adb6 docs(models): update and expand docstrings for LLaMA and generate method
- docs: add full, detailed Russian-language docstring for LLaMA.generate (sampling, top-k/top-p, examples, all parameter constraints and references)
- docs: bring LLaMA class header in line with modern LLM doc practices (motivation, architecture, references)
- no changes to logic, API, or tests

This makes the LLaMA model documentation fully transparent for all generation and inference modes.
2025-10-16 16:55:14 +03:00
Sergey Penkovsky
90eb2f4467 docs(models): expand docstring for generate method in GPT2
- docs: add detailed Russian-language docstring for generate method (args, nuances, sampling modes, error handling, usage examples, references to nucleus sampling and GPT-2 paper)
- strictly doc improvements, no logic or API changes

The updated documentation helps users clearly understand all generation options, constraints, and application modes in GPT2 LLMs.
2025-10-16 16:43:27 +03:00
Sergey Penkovsky
a3415d404a docs(models): update References in GPT docstring for vanilla implementation
- docs: update and focus References in GPT model docstring to only original GPT-1 (Radford et al., 2018) and BPE/Attention Is All You Need, removing GPT-2/HuggingFace links
- no changes to logic, API, or tests

This makes the documentation accurate for the vanilla GPT architecture and research lineage.
2025-10-16 16:33:53 +03:00
Sergey Penkovsky
9837ea3c3d docs(tokenizer): expand docstrings for BpeTokenizer
- docs: update and clarify docstrings for BpeTokenizer class and main methods (encode, decode)
- explain BPE algorithm, motivation, architecture, detailed usage examples, implementation details, references to original papers and major LLMs
- strictly doc improvements, no logic/API changes

This update makes tokenizer code easier to understand and use for language modeling research and engineering.
2025-10-16 15:26:17 +03:00
Sergey Penkovsky
baafca0546 docs(core): update docstrings for TokenEmbeddings
- docs: expand, clarify, and modernize docstrings for TokenEmbeddings class and its methods (__init__, forward, properties)
- explain layer purpose, motivation, math, parameter details, usage examples, and references
- no logic/API changes

This makes the input embedding code more accessible and maintainable for transformer and LLM development.
2025-10-16 15:14:53 +03:00
Sergey Penkovsky
516f9580fb docs(core): add docstrings and unit tests for SwiGLU block
- docs: rewrite and expand docstrings for SwiGLU class and forward method (motivation, math, architecture, usage, references to LLaMA/Mistral/PaLM)
- test: add unit tests for SwiGLU (shape, dtype, gradients, output range, fp16 support, reproducibility)
- strictly doc/tests, no logic or API changes

This improves transparency and reliability for gated FFN blocks in transformer architectures.
2025-10-16 15:09:09 +03:00
Sergey Penkovsky
64d33783e0 docs(core): add docstrings and unit tests for SiLU activation
- docs: expand and clarify docstrings for SiLU class and its method (mathematical formula, motivation, properties vs ReLU/GELU, usage, and references to Swish/LLM papers)
- test: add unit tests for SiLU (shape/dtype, behavior on large/small values, PyTorch reference, gradients, broadcast)
- no logic/API changes

This update improves reliability and usability of the SiLU activation module.
2025-10-16 14:48:50 +03:00
Sergey Penkovsky
6efc946027 docs(core): expand docstrings and add unit tests for RMSNorm
- docs: update/increase docstring detail for RMSNorm class and methods (motivation, formula, architecture, usage, references to LLaMA/PaLM/GPT)
- test: add comprehensive unit tests for RMSNorm (shape/type preservation, rms scaling, gradients for input and weights, fp16, large eps stability)

No code/API changes beyond docs and new tests.
2025-10-16 14:37:25 +03:00
Sergey Penkovsky
8018efae2a docs(core): expand docstrings for PositionalEmbeddings module
- docs: update and clarify docstrings for PositionalEmbeddings class and methods (__init__, forward)
- explain motivation, mathematical formulas, usage examples, architectural options (learned vs sinusoidal), external references
- no API or code changes

This makes the positional encoding component easier to understand and use for all transformer practitioners.
2025-10-16 14:09:05 +03:00
Sergey Penkovsky
0832d78acf docs(core): improve docstrings and add unit tests for GELU activation
- docs: rewrite and expand docstrings for GELU class and method (motivation, math formula, smoother ReLU for Transformers, usage, references)
- test: add dedicated tests for GELU (output shape, dtype, comparison with torch GELU, monotonicity, gradients, large/small value behavior)
- fix: align numerical test to allow for minor approximation difference vs PyTorch gelu

This update makes the GELU module more transparent and robust for deep learning practitioners and researchers.
2025-10-16 13:59:38 +03:00
Sergey Penkovsky
c338556cfe docs(core): improve and expand docstrings for FeedForward module
- docs: rewrite and clarify docstrings for FeedForward class and its methods (__init__, forward) with architectural explanation, pseudocode, motivation, parameter details, usage example, and key references (GELU, SwiGLU, Transformer)
- no changes to logic or APIs

This makes the feed-forward block more transparent for users and researchers working with transformer models.
2025-10-16 12:47:47 +03:00
Sergey Penkovsky
3a356f5d79 docs(core): improve and expand docstrings for Decoder module
- docs: rewrite and expand docstrings for Decoder class and its methods (__init__, forward)
- clarify the block’s architecture, pre-LN logic, flow with residual connections, and attention masking
- add mathematical pseudocode, motivation, feature list, usage example, and external references (papers, blog)
- no logic or behavior changes

This improves readability and makes the codebase easier to understand for transformer/LLM practitioners.
2025-10-16 12:40:46 +03:00
Sergey Penkovsky
923aa51e2a docs(core): add docstrings and unit tests for CachedDecoder module
- docs: Add detailed docstrings for CachedDecoder class and its methods (__init__, forward); explain autoregressive caching, architecture, math, usage, and links to GPT-2/LLM references
- test: Add comprehensive unit tests for CachedDecoder (initialization, forward with and without cache, cache chaining, output shape, error on long input, backward pass)
- These changes improve code clarity, reliability, and testing for decoder blocks with KV cache.
2025-10-16 12:30:53 +03:00
Sergey Penkovsky
ba3b04cec2 docs(core): add docstrings and unit tests for MistralDecoder
- docs: expanded docstrings for MistralDecoder class and methods (__init__, forward); explained architecture, key parameters, usage, and links to relevant papers (Mistral, Llama 2)
- test: add comprehensive unit tests for MistralDecoder (init, forward, cache handling, output shape, shape errors, backward)
- These changes improve explainability, reliability, and test coverage for the decoder module.
2025-10-15 18:07:11 +03:00
Sergey Penkovsky
e6ca8dee6f docs(core): add comprehensive docstrings and unit tests for GroupedQueryAttention (GQA)
- docs: Rewrite and expand docstrings for the GroupedQueryAttention class and all main methods (__init__, forward, _repeat_kv_heads, _create_sliding_window_mask):
    - explained GQA architecture and motivation
    - included mathematical formulas, step-by-step algorithms, usage examples
    - added references to relevant scientific papers (Mistral, Llama 2, etc.)
- test: Add dedicated unit tests for GQA (output shape correctness, mask/window logic, KV head replication, RoPE processing, error and edge-cases)
- docs/test: Documentation and tests now fully reflect modern GQA usage and best practices for LLM architectures

This commit makes the implementation, usage, and theoretical underpinnings of GQA transparent and reproducible for researchers and engineers.
2025-10-15 17:27:55 +03:00
Sergey Penkovsky
2e72dbaf07 test(llama): add unit tests for generation, cache, and edge cases
- Covers inference with and without cache and with sampling (top-k, top-p)
- Includes test for max sequence length (should raise ValueError)
- Verifies output shape and absence of dtype errors for the mask logic
- Minimal config and random data ensure tests are fast and robust

Motivation: Regression and integration protection for Llama decoding and sampling logic.
2025-10-15 14:37:35 +03:00
Sergey Penkovsky
dc440a3938 test(gpt2): add unit tests for generation, cache behavior, and error conditions
- Covers forward pass with and without KV-cache
- Verifies correct sequence generation for greedy, top-k, and top-p sampling
- Adds ValueError test for exceeding max sequence length
- Uses small random toy config and minimal setup for fast test feedback

Motivation: Prevent regressions in decoding, sampling, and KV-cache logic in GPT2 implementation.
2025-10-15 14:36:32 +03:00
Sergey Penkovsky
50d7593023 fix(gpt2, llama): proper top-k/top-p mask handling in sampling for PyTorch compatibility (bool/uint8)
- Refactored token selection logic in  methods of GPT2 and Llama classes.
- Masks are now created with dtype=torch.bool (or torch.uint8 for legacy PyTorch).
- Used True/False for mask/scatter instead of 1/0, ensuring correctness across PyTorch versions.
- Fixed RuntimeError: masked_fill_ only supports boolean masks, previously raised by uint8-masks in new PyTorch.
- Backward compatibility maintained: code works on PyTorch >=1.2 and for old clusters (via the else branch).

Motivation: Fixes sampling errors for all modern PyTorch users while keeping research code usable on old infra.
2025-10-15 14:35:10 +03:00
Sergey Penkovsky
38682e8c9d test(mistral): add unit tests for model generation and cache 2025-10-15 13:20:50 +03:00
Sergey Penkovsky
e791f7cd93 fix(mistral): fix top-k/top-p mask handling for PyTorch >=1.2 2025-10-15 13:20:30 +03:00
Sergey Penkovsky
d10044e4a7 refactor(core): refactor RoPE and MultiHeadAttention, add math-rich docs, expand tests, remove unused head_attention
- refactor: улучшена и унифицирована реализация RoPE, теперь поддерживаются строгие проверки размерности входа; внесены улучшения и структурные изменения в MultiHeadAttention (более понятная логика, строгая спецификация входов/выходов)
- docs: полностью переписаны docstrings для RoPE и MultiHeadAttention — включены математические формулы, ссылки на научные статьи, подробные пояснения по алгоритму, формату входных данных, ограничениям, примеры использования
- test: добавлены отдельные unit-тесты для RoPE (корректность формы, ошибки на неверную размерность, сохранение нормы, backward/градиенты, работу с параметрами start_pos и батчами)
- chore: удалён неиспользуемый модуль core/head_attention.py
- fix: теперь выбрасывается AssertionError при неправильной размерности входа RoPE; это позволило полностью покрыть тест-кейсы на ошибки

Этот коммит синхронизирует логику реализации базового внимания с современной практикой LLM, укрепляет документацию для инженеров и исследователей, а также расширяет надежность автотестирования библиотеки.
2025-10-15 11:04:07 +03:00
Sergey Penkovsky
ec0d2bd8d0 feat(mistral): add Mistral model implementation and configs
- implement Mistral model in llm/models/mistral/mistral.py with GroupedQueryAttention, SwiGLU, RoPE, sliding window attention
- add __init__.py for module export
- add config files for mistral training and generation
- update universal experiment runner to support Mistral model
- add notebook for Mistral experiments
2025-10-14 14:53:45 +03:00
Sergey Penkovsky
e5706a690d fix(rope, attention): корректное позиционирование RoPE при генерации с кэшем
- Исправлена ошибка расчёта позиции для RoPE (Rotary Positional Embeddings) при автодополнении с использованием кэша.
- В HeadAttention теперь передаётся start_pos в RoPE, вычисляемый из длины кэша.
- Обновлена сигнатура и логика метода RoPE.forward.
- Обновлен ноутбук llama.ipynb под новые интерфейсы и выводы.

BREAKING CHANGE: переопределён метод forward у RoPE, требуется обновить код, если RoPE использовался вручную.
2025-10-14 12:03:20 +03:00
Sergey Penkovsky
3e4815fcc6 refactor(experiments): migrate to universal runner + config structure, remove legacy scripts
- add universal runner run_llm_experiment.py with JSON-config driven LLM training / generation
- add configs for gpt, gpt2, llama (training/generation)
- remove individual train/generate scripts for each model
- update README with simple how-to for experiments block

BREAKING CHANGE: all llm_only experiments now run only through run_llm_experiment.py; legacy scripts removed
2025-10-14 11:57:23 +03:00
Sergey Penkovsky
0cc7850848 fix: format code 2025-10-06 23:03:01 +03:00
Sergey Penkovsky
237b86421e doc: update docstring 2025-10-06 23:02:03 +03:00
Sergey Penkovsky
712278e33c Рефакторинг: единообразие оформления кода (пробелы, кавычки, пустые строки), без изменения логики по всему проекту. 2025-10-06 22:57:19 +03:00
Sergey Penkovsky
332cad6159 Merge pull request #2 from pese-git/feature/llama
Feature/llama
2025-10-06 22:05:45 +03:00
Sergey Penkovsky
2434d34188 docs: научная и практическая документация для всех ключевых модулей LLM
- Улучшены и дополнены docstring базовых компонентов (decoder, cached_decoder, multi_head_attention, head_attention, feed_forward, token_embeddings, positional_embeddings, gelu, silu, swi_glu, rope, rms_norm)
- На русском языке: объяснены алгоритмы архитектур, приведены формулы и ссылки на статьи
- Для всех моделей (GPT, GPT2, LLaMA) добавлены подробные описания классов, методов forward/generate, форматы входа/выхода
- Примеры использования в каждом ключевом классе
- Описаны научные концепции, архитектурные отличия и причины выбора решений
2025-10-06 21:59:55 +03:00
Sergey Penkovsky
73ee3e16ec docs: update and enhance documentation for all core components and models
- Added detailed documentation for GPT, GPT2 and LLaMA models
- Enhanced docstrings in base_model.py, rope.py, rms_norm.py, swi_glu.py
- Updated README with architectural differences and usage examples
- Added scientific references and mathematical foundations
- Improved type hints and parameter descriptions
2025-10-06 20:34:02 +03:00
Sergey Penkovsky
3bc2848cf0 refactor: unify CachedDecoder implementation across models
- Completely removed duplicate CachedDecoder from llama.py
- Modified core CachedDecoder to support dependency injection:
  - Added feed_forward_layer parameter (required)
  - Added norm_layer parameter with LayerNorm default
  - Added rope parameter for RoPE support
  - Removed unused activation parameter
- Updated GPT2 to use new CachedDecoder with FeedForward
- Updated LLaMA to use new CachedDecoder with SwiGLU and RMSNorm
- Fixed parameter order in constructor to follow Python syntax rules

This eliminates all code duplication while maintaining architectural specificities through dependency injection.
2025-10-06 14:57:29 +03:00
Sergey Penkovsky
d99d605b35 refactor: partial removal of duplicate code by using core modules
- Removed duplicate HeadAttention and MultiHeadAttention implementations from llama.py
- Now importing MultiHeadAttention from core module
- Added RoPE support parameter to core HeadAttention constructor
- Kept LLaMA-specific CachedDecoder implementation (uses SwiGLU and RMSNorm)
- Core CachedDecoder uses different components (FeedForward and LayerNorm)
- Improved code reuse for attention components while maintaining LLaMA-specific decoder

This is a partial refactor - attention components are now shared, but decoder remains LLaMA-specific due to different normalization and activation requirements.
2025-10-06 14:26:32 +03:00
Sergey Penkovsky
211adf574c refactor: extract LLaMA components to separate modules in core directory
- Moved GELU, RMSNorm, RoPE, SiLU, and SwiGLU implementations from llama.py to dedicated files in core/
- Updated feed_forward.py to use new modular components
- Modified llama.py to import components from core modules instead of local definitions
- Improved code organization and reusability of activation functions and normalization layers

This refactor enables better code reuse across different model architectures and follows the single responsibility principle.
2025-10-06 14:09:19 +03:00
Sergey Penkovsky
f30cd530a9 feat: add LLaMA model implementation with RoPE positional encoding
- Added LLaMA model architecture with RMSNorm and SwiGLU activation
- Implemented Rotary Positional Embeddings (RoPE) for better positional encoding
- Created training script for LLaMA with BPE tokenizer
- Fixed matplotlib dependency version in uv.lock
- Added LLaMA module initialization

The implementation includes:
- TokenEmbeddings, HeadAttention, MultiHeadAttention with RoPE support
- RMSNorm normalization layer
- SwiGLU feed-forward activation
- Cached decoder implementation for efficient generation
2025-10-06 13:26:20 +03:00
Sergey Penkovsky
9898e8ee83 feat: add RoPE positional embeddings implementation in llama.ipynb
- Implement Rotary Positional Embeddings (RoPE) with separate cosine/sine components
- Add vectorized computation of inverse frequencies for RoPE
- Include tensor slicing utilities for even/odd column separation
- Update dependencies in pyproject.toml and uv.lock
2025-10-06 12:52:59 +03:00
Sergey Penkovsky
b6f56a2640 fix: typo in activation attribute for SwiGLU (rename _actvation to _activation) and minor index update 2025-10-05 23:01:58 +03:00
Sergey Penkovsky
e5b5a97811 Merge pull request #1 from pese-git/feature/gpt2
Feature/gpt2
2025-10-05 21:30:33 +03:00
Sergey Penkovsky
b9d9bdcc71 docs(readme): add explicit support notice for GPT-2 architecture and usage examples 2025-10-05 21:29:38 +03:00
Sergey Penkovsky
c31eed8551 fix(hf-integration): handle logits as tuple in hf_adapter, convert torch.Tensor to list in hf_tokenizer.decode for decoding compatibility 2025-10-05 20:47:36 +03:00
Sergey Penkovsky
3843e64098 test(core): fix FeedForward and MultiHeadAttention tests for unified interface and tuple outputs 2025-10-05 19:26:18 +03:00
Sergey Penkovsky
c39e68d71a feat(gpt2): add GPT2 architecture with universal FeedForward, CachedDecoder, and refactored components. Core modules now shared; add train and generate scripts for GPT2-BPE. 2025-10-05 19:11:20 +03:00
Sergey Penkovsky
f866ed7ac7 fix: universal logits extraction for tuple/model output in Trainer (GPT/GPT2 compatibility) 2025-10-05 15:52:21 +03:00
Sergey Penkovsky
aa408e941a docs: add GPT-2 analysis notebook
- Add gpt2.ipynb with GPT-2 model experiments and comparisons
2025-10-05 12:48:32 +03:00
Sergey Penkovsky
da1cf3fb55 fix: rename notebook 2025-10-05 12:46:17 +03:00