- Улучшены и дополнены docstring базовых компонентов (decoder, cached_decoder, multi_head_attention, head_attention, feed_forward, token_embeddings, positional_embeddings, gelu, silu, swi_glu, rope, rms_norm)
- На русском языке: объяснены алгоритмы архитектур, приведены формулы и ссылки на статьи
- Для всех моделей (GPT, GPT2, LLaMA) добавлены подробные описания классов, методов forward/generate, форматы входа/выхода
- Примеры использования в каждом ключевом классе
- Описаны научные концепции, архитектурные отличия и причины выбора решений
- Completely removed duplicate CachedDecoder from llama.py
- Modified core CachedDecoder to support dependency injection:
- Added feed_forward_layer parameter (required)
- Added norm_layer parameter with LayerNorm default
- Added rope parameter for RoPE support
- Removed unused activation parameter
- Updated GPT2 to use new CachedDecoder with FeedForward
- Updated LLaMA to use new CachedDecoder with SwiGLU and RMSNorm
- Fixed parameter order in constructor to follow Python syntax rules
This eliminates all code duplication while maintaining architectural specificities through dependency injection.
- Removed duplicate HeadAttention and MultiHeadAttention implementations from llama.py
- Now importing MultiHeadAttention from core module
- Added RoPE support parameter to core HeadAttention constructor
- Kept LLaMA-specific CachedDecoder implementation (uses SwiGLU and RMSNorm)
- Core CachedDecoder uses different components (FeedForward and LayerNorm)
- Improved code reuse for attention components while maintaining LLaMA-specific decoder
This is a partial refactor - attention components are now shared, but decoder remains LLaMA-specific due to different normalization and activation requirements.
- Moved GELU, RMSNorm, RoPE, SiLU, and SwiGLU implementations from llama.py to dedicated files in core/
- Updated feed_forward.py to use new modular components
- Modified llama.py to import components from core modules instead of local definitions
- Improved code organization and reusability of activation functions and normalization layers
This refactor enables better code reuse across different model architectures and follows the single responsibility principle.
- Added LLaMA model architecture with RMSNorm and SwiGLU activation
- Implemented Rotary Positional Embeddings (RoPE) for better positional encoding
- Created training script for LLaMA with BPE tokenizer
- Fixed matplotlib dependency version in uv.lock
- Added LLaMA module initialization
The implementation includes:
- TokenEmbeddings, HeadAttention, MultiHeadAttention with RoPE support
- RMSNorm normalization layer
- SwiGLU feed-forward activation
- Cached decoder implementation for efficient generation
- Implement Rotary Positional Embeddings (RoPE) with separate cosine/sine components
- Add vectorized computation of inverse frequencies for RoPE
- Include tensor slicing utilities for even/odd column separation
- Update dependencies in pyproject.toml and uv.lock
- Replace deprecated torch.uint8 and .byte() with torch.bool in GPT.generate
- Add save/load methods to BPETokenizer for proper merges and vocab_list serialization
- Update dependencies in pyproject.toml
- Add LLM library with GPT model implementation
- Add hf-proxy for HuggingFace integration
- Add experiments for training and generation
- Add comprehensive documentation and examples
- Configure uv workspace with proper dependencies