π§ Project Agastya (38M Parameter Causal Engine)
Designed, engineered, and trained from scratch by Dinesh, Project Agastya is a custom-architected 38-million parameter autoregressive language model built utilizing pure PyTorch tensor layers. By pairing a custom Byte-Level Byte Pair Encoding (BPE) tokenizer with a multi-layered causal transformer topology, Agastya is optimized for high-speed local stream execution, minimal VRAM overhead, and rapid deployment inside lightweight cloud container spaces.
π Technical Architecture Specifications
Agastya completely bypasses generic pre-packaged wrappers like Hugging Face transformers classes for its core training lifecycle. Its neural layers are instantiated explicitly out of raw tensor modules using the following structural dimensions:
| Architectural Component | Specification Parameter | Functional Description |
|---|---|---|
| Total Parameters | 38,154,240 (~38M) | Active computational weight matrices |
Layer Depth (n_layer) |
12 Blocks | Sequential Pre-LN Transformer layers |
Attention Heads (n_head) |
8 Heads | Parallel contextual subspace windows |
Embedding Width (n_embd) |
512 Dimensions | Hidden feature vector state width |
Context Horizon (block_size) |
256 Sub-word Tokens | Total attention span allocation boundary |
Vocabulary Size (vocab_size) |
2,000 Allocations | Specialized Byte-Level BPE tokenizations |
| Tensor Precision | 32-bit Floating-Point (FP32) | Core calculation resolution tracking |
| Active Memory/VRAM Load | ~184.80 MB | Full network weight footprint in execution |
𧬠Mathematical Formulation
The core calculation driving Agastya's predictive capability uses a causal Scaled Dot-Product Attention mechanism combined with a strict lower-triangular causal attention mask matrix ($IL$) to enforce autoregressive token serialization.
The attention computation maps matrices according to the following equation:
Where:
- $Q$, $K$, and $V$ represent the Query, Key, and Value vector translations extracted out of input hidden tensors via independent linear projection planes.
- $d_k$ is the scaling dimension factor derived directly via:
- $IL$ maps token entries outside current causal horizons to $-\infty$ values prior to passing through the softmax classification line, ensuring future tokens remain hidden during calculation loops.
π Repository Structural Layout
βββ dataset/
β βββ generate_chat_data.py # Script synthesizing custom synthetic text pairs
β βββ input.txt # Primary core training corpus dictionary
β βββ large_input.txt # Expanded corpus handling advanced contextual data
βββ frontend/ (Next.js App)
β βββ app/
β β βββ layout.tsx # System viewport viewport mapping
β β βββ page.tsx # Next.js Brutalist chat stream dashboard interface
β βββ package.json # Client structural system dependency manifests
β βββ tailwind.config.js # Styling mapping properties handles
βββ model/ (Local Artifact Cache)
β βββ agastya_final_chatbot.pth # Saved Pytorch tensor layer weights binary
β βββ agastya_tokenizer.json # Saved custom trained Byte-Level BPE vocab maps
βββ train_tokenizer.py # Dual-track custom BPE engine training pipeline
βββ finetune_agastya.py # Causal cross-entropy gradient tracking train loop
βββ talk_to_agastya.py # Local interactive testing terminal handler
βββ main_api.py # FastAPI local system loop loop back streaming server
βββ register_hf_model.py # Automated cloud artifact upload synchronization hub
βββ benchmark_hf_agastya.py # Live remote cloud model telemetry benchmarking script