llm-arch-research

mirror of https://github.com/pese-git/llm-arch-research.git synced 2026-01-23 21:10:54 +00:00

Author	SHA1	Message	Date
Sergey Penkovsky	db0ab511d1	feat(gpt2): add Gpt2Decoder module, refactor model and add tests - Implemented core/gpt2_decoder.py: transformer decoder block with kv cache in GPT2 style - Refactored models/gpt/gpt2.py to use new Gpt2Decoder, improved documentation - Added tests/core/test_gpt2_decoder.py for main features and cache - Temporarily skipped HF proxy integration test for compatibility	2025-10-31 15:35:54 +03:00
Sergey Penkovsky	7744658716	Merge pull request #6 from pese-git/ref/gpt1 Ref/gpt1	2025-10-31 09:15:54 +03:00
Sergey Penkovsky	21cfd79c19	refactor(assets): update and reorganize GPT-1 architecture diagrams - Renamed GPT-1 main scheme files for clarity - Added new diagram files for attention, decoder, embeddings, and forward blocks (both .drawio and .png) - Removed deprecated files (gpt11.drawio, gpt1.svg) - Updated notebooks/gpt.ipynb with relevant changes	2025-10-30 14:40:31 +03:00
Sergey Penkovsky	9e2796e6be	docs(gpt1): add architecture diagrams and notebook updates - Added architecture diagrams for GPT-1: gpt1.drawio, gpt11.drawio (drawio format) - Exported visualization images: gpt1.png, gpt1.svg for documentation and presentations - Updated gpt.ipynb notebook to reference new materials and possibly add explanations of layers/logic - New assets help to clarify model structure and training flow for both contributors and external users	2025-10-24 17:42:11 +03:00
Sergey Penkovsky	25caf69ced	refactor(gpt1): migrate Decoder to GptDecoder, unify API, and update tests - Renamed Decoder (and decoder.py) to GptDecoder (gpt_decoder.py) for clarity in GPT1 - Implemented support for cache and use_cache parameters in GptDecoder.forward (API unification) - Adapted all usages in GPT model to use new decoder structure and handle tuple output - Refactored core tests (test_gpt.py, test_gpt_decoder.py, test_basic.py) to correctly expect tuple or logits and ensure shape/device checks work as before - Improved clarity and future extensibility for autoregressive generation and benchmarking - No changes to architectural details or training loop; pure API and test modernization	2025-10-22 16:27:08 +03:00
Sergey Penkovsky	ddc4924a37	refactor(models): unify generate() signatures across all LLM architectures\n\n- Unified method signature: (x, max_new_tokens, do_sample, temperature, top_k, top_p, use_cache, attention_mask, **kwargs)\n- Added del attention_mask, kwargs in every generate() for compatibility and clean API\n- Prepared for drop-in replacement and ease of future batching/serving\n\nNo changes to core model logic or sampling algorithms.	2025-10-22 11:57:26 +03:00
Sergey Penkovsky	92a34551b8	Merge pull request #5 from pese-git/feature/gemma Feature/gemma	2025-10-21 17:53:55 +03:00
Sergey Penkovsky	ea932a36f3	feat(gemma): document and test GeGLU, MultiQueryAttention, GemmaDecoder, update Gemma model docs - Add new core modules: GeGLU (Gated GELU Linear Unit), GemmaDecoder, MultiQueryAttention; all with highly detailed scientific (RU) docstrings: theory, usage, formulas, references - Major doc improvements in Gemma model: class, __init__, forward, generate now have full educational/engineering docstrings, use-case samples, and literature links - Add comprehensive unit tests: * tests/core/test_geglu.py: GeGLU coverage (shape, grads, edge, repeat, float16/skip) * tests/core/test_gemma_decoder.py: GemmaDecoder coverage (shape, mask, cache, repeatability, errors) * tests/core/test_multi_query_attention.py: MQA coverage (shape, cache, gradients, masking, dropout, raise) - All modules and tests follow strict quality/documentation standards, code is now robust for research & production	2025-10-21 15:12:45 +03:00
Sergey Penkovsky	cfb4b6dfb1	feat(gemma): initial implementation of Gemma model and configs - Add core Gemma model (architecture, attention, GeGLU, RoPE, RMSNorm, etc) - Add configs for training and generation: gemma_train.json, gemma_generate.json - Add Gemma notebook for exploratory analysis and demonstration - Add __init__.py for Gemma submodule - Update run_llm_experiment.py to support Gemma experiment configs test(gemma): add comprehensive unit tests for Gemma - Test forward pass (with/without cache) - Test autoregressive generation (greedy, top-k, top-p) - Test shape correctness and max sequence length errors - Test multi-layer stack and token embeddings docs: add documentation notebook for Gemma usage and analysis Closes: #issue (if applicable)	2025-10-21 01:02:15 +03:00
Sergey Penkovsky	58c4a00b48	Merge pull request #4 from pese-git/feature/mixtral Feature/mixtral	2025-10-20 16:36:39 +03:00
Sergey Penkovsky	c9da4c841b	feat(mixtral): add MixtralDecoder, enhance MoE and Mixtral model docs, add unit tests - Implement new core module: MixtralDecoder (llm/core/mixtral_decoder.py) with full Russian scientific docstrings, formal math, and usage examples - Improve MoE: add Russian docstrings for class, __init__, forward; validate top_k_experts; explain theory and components - Refactor Mixtral model: switch stack to MixtralDecoder, add comprehensive documentation for class, constructor and forward, clarify config usage and architecture - Add thorough unit tests: * tests/core/test_mixtral_decoder.py: checks shapes, errors, mask, dropout, grads etc. * tests/core/test_moe.py: covers normal and edge-case logic, gradients, shape, params check - All code and tests in compliance with recent scientific and engineering standards.	2025-10-20 16:07:51 +03:00
Sergey Penkovsky	b1737bbce2	feat(mixtral): initial implementation of Mixtral MoE model, configs, and tests - Add Mixtral architecture implementation with MoE support (llm/src/llm/models/mixtral/mixtral.py) - Introduce generic Mixture-of-Experts (MoE) block (llm/src/llm/core/moe.py) - Create dedicated configuration files for Mixtral training and generation experiments - Register and test Mixtral support in experiment runner (run_llm_experiment.py) - Add unit tests for Mixtral API including forward, caching, and generation modes - Include Jupyter notebook mixstral.ipynb for architectural exploration and research - Ensure correct handling of torch bool masks in sampling (top-k, top-p) during generation BREAKING CHANGE: Adds new model code and test coverage, modifying experiment runner logic to register Mixtral.	2025-10-20 08:12:11 +03:00
Sergey Penkovsky	1aba02cab9	Merge pull request #3 from pese-git/feature/mistral Feature/mistral	2025-10-17 20:45:20 +03:00
Sergey Penkovsky	9794db3e18	docs(readme): update project documentation for LLaMA, Mistral, HF integration - Added explicit support and usage examples for Mistral and LLaMA architectures in both root and llm/ READMEs - Updated directory structure and naming (datasets, tokenizers, mistral, hf-proxy) - Clarified quickstart and experiments usage including config location and CLI - Documented HuggingFace integration via and marked it as experimental - Highlighted differences and specifics of all supported architectures - Improved guide for launching training/generation/experiments - Made project scope and architecture more transparent for new contributors	2025-10-17 20:18:57 +03:00
Sergey Penkovsky	d947b7beb3	update and expand scientific docstrings for optimizer, scheduler, trainer - Expanded module-level and function/class docstrings in optimizer.py, scheduler.py, and trainer.py - Described mathematical foundations, theoretical motivations, and provided detailed usage examples for students - All docstrings in Russian, clear scientific style test(training): add comprehensive tests for optimizer, scheduler, and trainer modules - Added new test files for get_optimizer, get_linear_schedule_with_warmup, and Trainer - Tests cover parameter handling, edge cases, and expected learning dynamics (lr schedules and loss behavior) - Trainer now logs average epoch losses to self.loss_history for testability and analysis refactor(training/trainer): log epoch loss to loss_history for downstream analysis and tests BREAKING CHANGE: Trainer.loss_history is a new attribute consolidating average losses per epoch, enabling robust learning dynamics assertions in tests	2025-10-17 16:25:39 +03:00
Sergey Penkovsky	613d784565	doc(datasets): update docstrings and tests	2025-10-17 10:49:45 +03:00
Sergey Penkovsky	38c271ca3c	docs(models): update and expand docstrings for Mistral and its methods - docs: add comprehensive docstrings for the Mistral class (in Russian) and its methods (forward, generate) - docs: explain model architecture (GQA, Sliding Window Attention, SwiGLU, RMSNorm, RoPE), arguments, constraints, generation modes, usage examples, and references (Mistral, nucleus sampling) - strictly documentation improvements, no logic/API changes This commit makes Mistral model documentation clear and user-friendly for LLM engineering and inference.	2025-10-16 17:03:06 +03:00
Sergey Penkovsky	aec3c8adb6	docs(models): update and expand docstrings for LLaMA and generate method - docs: add full, detailed Russian-language docstring for LLaMA.generate (sampling, top-k/top-p, examples, all parameter constraints and references) - docs: bring LLaMA class header in line with modern LLM doc practices (motivation, architecture, references) - no changes to logic, API, or tests This makes the LLaMA model documentation fully transparent for all generation and inference modes.	2025-10-16 16:55:14 +03:00
Sergey Penkovsky	90eb2f4467	docs(models): expand docstring for generate method in GPT2 - docs: add detailed Russian-language docstring for generate method (args, nuances, sampling modes, error handling, usage examples, references to nucleus sampling and GPT-2 paper) - strictly doc improvements, no logic or API changes The updated documentation helps users clearly understand all generation options, constraints, and application modes in GPT2 LLMs.	2025-10-16 16:43:27 +03:00
Sergey Penkovsky	a3415d404a	docs(models): update References in GPT docstring for vanilla implementation - docs: update and focus References in GPT model docstring to only original GPT-1 (Radford et al., 2018) and BPE/Attention Is All You Need, removing GPT-2/HuggingFace links - no changes to logic, API, or tests This makes the documentation accurate for the vanilla GPT architecture and research lineage.	2025-10-16 16:33:53 +03:00
Sergey Penkovsky	9837ea3c3d	docs(tokenizer): expand docstrings for BpeTokenizer - docs: update and clarify docstrings for BpeTokenizer class and main methods (encode, decode) - explain BPE algorithm, motivation, architecture, detailed usage examples, implementation details, references to original papers and major LLMs - strictly doc improvements, no logic/API changes This update makes tokenizer code easier to understand and use for language modeling research and engineering.	2025-10-16 15:26:17 +03:00
Sergey Penkovsky	baafca0546	docs(core): update docstrings for TokenEmbeddings - docs: expand, clarify, and modernize docstrings for TokenEmbeddings class and its methods (__init__, forward, properties) - explain layer purpose, motivation, math, parameter details, usage examples, and references - no logic/API changes This makes the input embedding code more accessible and maintainable for transformer and LLM development.	2025-10-16 15:14:53 +03:00
Sergey Penkovsky	516f9580fb	docs(core): add docstrings and unit tests for SwiGLU block - docs: rewrite and expand docstrings for SwiGLU class and forward method (motivation, math, architecture, usage, references to LLaMA/Mistral/PaLM) - test: add unit tests for SwiGLU (shape, dtype, gradients, output range, fp16 support, reproducibility) - strictly doc/tests, no logic or API changes This improves transparency and reliability for gated FFN blocks in transformer architectures.	2025-10-16 15:09:09 +03:00
Sergey Penkovsky	64d33783e0	docs(core): add docstrings and unit tests for SiLU activation - docs: expand and clarify docstrings for SiLU class and its method (mathematical formula, motivation, properties vs ReLU/GELU, usage, and references to Swish/LLM papers) - test: add unit tests for SiLU (shape/dtype, behavior on large/small values, PyTorch reference, gradients, broadcast) - no logic/API changes This update improves reliability and usability of the SiLU activation module.	2025-10-16 14:48:50 +03:00
Sergey Penkovsky	6efc946027	docs(core): expand docstrings and add unit tests for RMSNorm - docs: update/increase docstring detail for RMSNorm class and methods (motivation, formula, architecture, usage, references to LLaMA/PaLM/GPT) - test: add comprehensive unit tests for RMSNorm (shape/type preservation, rms scaling, gradients for input and weights, fp16, large eps stability) No code/API changes beyond docs and new tests.	2025-10-16 14:37:25 +03:00
Sergey Penkovsky	8018efae2a	docs(core): expand docstrings for PositionalEmbeddings module - docs: update and clarify docstrings for PositionalEmbeddings class and methods (__init__, forward) - explain motivation, mathematical formulas, usage examples, architectural options (learned vs sinusoidal), external references - no API or code changes This makes the positional encoding component easier to understand and use for all transformer practitioners.	2025-10-16 14:09:05 +03:00
Sergey Penkovsky	0832d78acf	docs(core): improve docstrings and add unit tests for GELU activation - docs: rewrite and expand docstrings for GELU class and method (motivation, math formula, smoother ReLU for Transformers, usage, references) - test: add dedicated tests for GELU (output shape, dtype, comparison with torch GELU, monotonicity, gradients, large/small value behavior) - fix: align numerical test to allow for minor approximation difference vs PyTorch gelu This update makes the GELU module more transparent and robust for deep learning practitioners and researchers.	2025-10-16 13:59:38 +03:00
Sergey Penkovsky	c338556cfe	docs(core): improve and expand docstrings for FeedForward module - docs: rewrite and clarify docstrings for FeedForward class and its methods (__init__, forward) with architectural explanation, pseudocode, motivation, parameter details, usage example, and key references (GELU, SwiGLU, Transformer) - no changes to logic or APIs This makes the feed-forward block more transparent for users and researchers working with transformer models.	2025-10-16 12:47:47 +03:00
Sergey Penkovsky	3a356f5d79	docs(core): improve and expand docstrings for Decoder module - docs: rewrite and expand docstrings for Decoder class and its methods (__init__, forward) - clarify the block’s architecture, pre-LN logic, flow with residual connections, and attention masking - add mathematical pseudocode, motivation, feature list, usage example, and external references (papers, blog) - no logic or behavior changes This improves readability and makes the codebase easier to understand for transformer/LLM practitioners.	2025-10-16 12:40:46 +03:00
Sergey Penkovsky	923aa51e2a	docs(core): add docstrings and unit tests for CachedDecoder module - docs: Add detailed docstrings for CachedDecoder class and its methods (__init__, forward); explain autoregressive caching, architecture, math, usage, and links to GPT-2/LLM references - test: Add comprehensive unit tests for CachedDecoder (initialization, forward with and without cache, cache chaining, output shape, error on long input, backward pass) - These changes improve code clarity, reliability, and testing for decoder blocks with KV cache.	2025-10-16 12:30:53 +03:00
Sergey Penkovsky	ba3b04cec2	docs(core): add docstrings and unit tests for MistralDecoder - docs: expanded docstrings for MistralDecoder class and methods (__init__, forward); explained architecture, key parameters, usage, and links to relevant papers (Mistral, Llama 2) - test: add comprehensive unit tests for MistralDecoder (init, forward, cache handling, output shape, shape errors, backward) - These changes improve explainability, reliability, and test coverage for the decoder module.	2025-10-15 18:07:11 +03:00
Sergey Penkovsky	e6ca8dee6f	docs(core): add comprehensive docstrings and unit tests for GroupedQueryAttention (GQA) - docs: Rewrite and expand docstrings for the GroupedQueryAttention class and all main methods (__init__, forward, _repeat_kv_heads, _create_sliding_window_mask): - explained GQA architecture and motivation - included mathematical formulas, step-by-step algorithms, usage examples - added references to relevant scientific papers (Mistral, Llama 2, etc.) - test: Add dedicated unit tests for GQA (output shape correctness, mask/window logic, KV head replication, RoPE processing, error and edge-cases) - docs/test: Documentation and tests now fully reflect modern GQA usage and best practices for LLM architectures This commit makes the implementation, usage, and theoretical underpinnings of GQA transparent and reproducible for researchers and engineers.	2025-10-15 17:27:55 +03:00
Sergey Penkovsky	2e72dbaf07	test(llama): add unit tests for generation, cache, and edge cases - Covers inference with and without cache and with sampling (top-k, top-p) - Includes test for max sequence length (should raise ValueError) - Verifies output shape and absence of dtype errors for the mask logic - Minimal config and random data ensure tests are fast and robust Motivation: Regression and integration protection for Llama decoding and sampling logic.	2025-10-15 14:37:35 +03:00
Sergey Penkovsky	dc440a3938	test(gpt2): add unit tests for generation, cache behavior, and error conditions - Covers forward pass with and without KV-cache - Verifies correct sequence generation for greedy, top-k, and top-p sampling - Adds ValueError test for exceeding max sequence length - Uses small random toy config and minimal setup for fast test feedback Motivation: Prevent regressions in decoding, sampling, and KV-cache logic in GPT2 implementation.	2025-10-15 14:36:32 +03:00
Sergey Penkovsky	50d7593023	fix(gpt2, llama): proper top-k/top-p mask handling in sampling for PyTorch compatibility (bool/uint8) - Refactored token selection logic in methods of GPT2 and Llama classes. - Masks are now created with dtype=torch.bool (or torch.uint8 for legacy PyTorch). - Used True/False for mask/scatter instead of 1/0, ensuring correctness across PyTorch versions. - Fixed RuntimeError: masked_fill_ only supports boolean masks, previously raised by uint8-masks in new PyTorch. - Backward compatibility maintained: code works on PyTorch >=1.2 and for old clusters (via the else branch). Motivation: Fixes sampling errors for all modern PyTorch users while keeping research code usable on old infra.	2025-10-15 14:35:10 +03:00
Sergey Penkovsky	38682e8c9d	test(mistral): add unit tests for model generation and cache	2025-10-15 13:20:50 +03:00
Sergey Penkovsky	e791f7cd93	fix(mistral): fix top-k/top-p mask handling for PyTorch >=1.2	2025-10-15 13:20:30 +03:00
Sergey Penkovsky	d10044e4a7	refactor(core): refactor RoPE and MultiHeadAttention, add math-rich docs, expand tests, remove unused head_attention - refactor: улучшена и унифицирована реализация RoPE, теперь поддерживаются строгие проверки размерности входа; внесены улучшения и структурные изменения в MultiHeadAttention (более понятная логика, строгая спецификация входов/выходов) - docs: полностью переписаны docstrings для RoPE и MultiHeadAttention — включены математические формулы, ссылки на научные статьи, подробные пояснения по алгоритму, формату входных данных, ограничениям, примеры использования - test: добавлены отдельные unit-тесты для RoPE (корректность формы, ошибки на неверную размерность, сохранение нормы, backward/градиенты, работу с параметрами start_pos и батчами) - chore: удалён неиспользуемый модуль core/head_attention.py - fix: теперь выбрасывается AssertionError при неправильной размерности входа RoPE; это позволило полностью покрыть тест-кейсы на ошибки Этот коммит синхронизирует логику реализации базового внимания с современной практикой LLM, укрепляет документацию для инженеров и исследователей, а также расширяет надежность автотестирования библиотеки.	2025-10-15 11:04:07 +03:00
Sergey Penkovsky	ec0d2bd8d0	feat(mistral): add Mistral model implementation and configs - implement Mistral model in llm/models/mistral/mistral.py with GroupedQueryAttention, SwiGLU, RoPE, sliding window attention - add __init__.py for module export - add config files for mistral training and generation - update universal experiment runner to support Mistral model - add notebook for Mistral experiments	2025-10-14 14:53:45 +03:00
Sergey Penkovsky	e5706a690d	fix(rope, attention): корректное позиционирование RoPE при генерации с кэшем - Исправлена ошибка расчёта позиции для RoPE (Rotary Positional Embeddings) при автодополнении с использованием кэша. - В HeadAttention теперь передаётся start_pos в RoPE, вычисляемый из длины кэша. - Обновлена сигнатура и логика метода RoPE.forward. - Обновлен ноутбук llama.ipynb под новые интерфейсы и выводы. BREAKING CHANGE: переопределён метод forward у RoPE, требуется обновить код, если RoPE использовался вручную.	2025-10-14 12:03:20 +03:00
Sergey Penkovsky	3e4815fcc6	refactor(experiments): migrate to universal runner + config structure, remove legacy scripts - add universal runner run_llm_experiment.py with JSON-config driven LLM training / generation - add configs for gpt, gpt2, llama (training/generation) - remove individual train/generate scripts for each model - update README with simple how-to for experiments block BREAKING CHANGE: all llm_only experiments now run only through run_llm_experiment.py; legacy scripts removed	2025-10-14 11:57:23 +03:00
Sergey Penkovsky	0cc7850848	fix: format code	2025-10-06 23:03:01 +03:00
Sergey Penkovsky	237b86421e	doc: update docstring	2025-10-06 23:02:03 +03:00
Sergey Penkovsky	712278e33c	Рефакторинг: единообразие оформления кода (пробелы, кавычки, пустые строки), без изменения логики по всему проекту.	2025-10-06 22:57:19 +03:00
Sergey Penkovsky	332cad6159	Merge pull request #2 from pese-git/feature/llama Feature/llama	2025-10-06 22:05:45 +03:00
Sergey Penkovsky	2434d34188	docs: научная и практическая документация для всех ключевых модулей LLM - Улучшены и дополнены docstring базовых компонентов (decoder, cached_decoder, multi_head_attention, head_attention, feed_forward, token_embeddings, positional_embeddings, gelu, silu, swi_glu, rope, rms_norm) - На русском языке: объяснены алгоритмы архитектур, приведены формулы и ссылки на статьи - Для всех моделей (GPT, GPT2, LLaMA) добавлены подробные описания классов, методов forward/generate, форматы входа/выхода - Примеры использования в каждом ключевом классе - Описаны научные концепции, архитектурные отличия и причины выбора решений	2025-10-06 21:59:55 +03:00
Sergey Penkovsky	73ee3e16ec	docs: update and enhance documentation for all core components and models - Added detailed documentation for GPT, GPT2 and LLaMA models - Enhanced docstrings in base_model.py, rope.py, rms_norm.py, swi_glu.py - Updated README with architectural differences and usage examples - Added scientific references and mathematical foundations - Improved type hints and parameter descriptions	2025-10-06 20:34:02 +03:00
Sergey Penkovsky	3bc2848cf0	refactor: unify CachedDecoder implementation across models - Completely removed duplicate CachedDecoder from llama.py - Modified core CachedDecoder to support dependency injection: - Added feed_forward_layer parameter (required) - Added norm_layer parameter with LayerNorm default - Added rope parameter for RoPE support - Removed unused activation parameter - Updated GPT2 to use new CachedDecoder with FeedForward - Updated LLaMA to use new CachedDecoder with SwiGLU and RMSNorm - Fixed parameter order in constructor to follow Python syntax rules This eliminates all code duplication while maintaining architectural specificities through dependency injection.	2025-10-06 14:57:29 +03:00
Sergey Penkovsky	d99d605b35	refactor: partial removal of duplicate code by using core modules - Removed duplicate HeadAttention and MultiHeadAttention implementations from llama.py - Now importing MultiHeadAttention from core module - Added RoPE support parameter to core HeadAttention constructor - Kept LLaMA-specific CachedDecoder implementation (uses SwiGLU and RMSNorm) - Core CachedDecoder uses different components (FeedForward and LayerNorm) - Improved code reuse for attention components while maintaining LLaMA-specific decoder This is a partial refactor - attention components are now shared, but decoder remains LLaMA-specific due to different normalization and activation requirements.	2025-10-06 14:26:32 +03:00
Sergey Penkovsky	211adf574c	refactor: extract LLaMA components to separate modules in core directory - Moved GELU, RMSNorm, RoPE, SiLU, and SwiGLU implementations from llama.py to dedicated files in core/ - Updated feed_forward.py to use new modular components - Modified llama.py to import components from core modules instead of local definitions - Improved code organization and reusability of activation functions and normalization layers This refactor enables better code reuse across different model architectures and follows the single responsibility principle.	2025-10-06 14:09:19 +03:00

1 2

66 Commits