-
Notifications
You must be signed in to change notification settings - Fork 9
Advanced Training Configuration
This guide covers advanced training configuration options and best practices for optimizing your AlexNet model training.
python my_alexnet_cnn.py train [OPTIONS]- Type: Float
- Default: 0.001
- Description: Controls the step size during gradient descent
Guidelines:
- Start with 0.001 for most cases
- Use 0.0001 for fine-tuning or if training is unstable
- Use 0.01 for faster convergence (may be less stable)
- Monitor training loss; if it oscillates, reduce learning rate
Example:
python my_alexnet_cnn.py train --learning-rate 0.0001- Type: Integer
- Default: 100
- Description: Maximum number of training epochs
Guidelines:
- Start with 50-100 epochs for initial experiments
- Monitor validation accuracy to avoid overfitting
- Use early stopping if validation accuracy plateaus
- More epochs may be needed for larger datasets
Example:
python my_alexnet_cnn.py train --max_epochs 50- Type: Integer
- Default: 10
- Description: Frequency of logging training metrics
Guidelines:
- Use 1 for detailed monitoring (slower)
- Use 10 for balanced logging
- Use 50+ for large datasets to reduce I/O overhead
Example:
python my_alexnet_cnn.py train --display-step 5- Type: String
-
Default:
images_shuffled.pkl - Description: Path to preprocessed training data
Guidelines:
- Always use shuffled data for better convergence
- Ensure dataset is preprocessed before training
- Use absolute or relative path from project root
Example:
python my_alexnet_cnn.py train --dataset_training custom_dataset.pklpython my_alexnet_cnn.py trainpython my_alexnet_cnn.py train \
--learning-rate 0.001 \
--max_epochs 100 \
--display-step 10 \
--dataset_training images_shuffled.pklpython my_alexnet_cnn.py train \
--learning-rate 0.01 \
--max_epochs 20 \
--display-step 2python my_alexnet_cnn.py train \
--learning-rate 0.0001 \
--max_epochs 200 \
--display-step 5-
Initialization
- Model loads from checkpoint (if exists) or initializes randomly
- Training dataset is loaded from pickle file
- Adam optimizer is configured
-
Epoch Loop
- For each epoch, all batches are processed
- Each batch undergoes forward and backward propagation
- Weights are updated using Adam optimizer
-
Batch Processing
- Batch size: 64 images (configurable in code)
- Images and labels are fed to the model
- Loss is calculated using categorical cross-entropy
- Gradients are computed and applied
-
Checkpointing
- Model is saved after each epoch to
ckpt_dir/model.ckpt - Can resume training from checkpoint
- Model is saved after each epoch to
-
Logging
- Training metrics logged to console and
FileLog.log - TensorBoard logs saved to
ckpt_dir/
- Training metrics logged to console and
The training process outputs:
- Batch-level training loss and accuracy
- Epoch-level validation accuracy (if validation set available)
- ROC curve visualization after training completes
- Model checkpoint files
Launch TensorBoard to monitor training in real-time:
tensorboard --logdir=ckpt_dir/Access at: http://localhost:6006
Available visualizations:
- Loss curves
- Accuracy trends
- Weight distributions
- Gradient histograms
Training logs are written to FileLog.log:
tail -f FileLog.logSome parameters require code modifications:
Location: my_alexnet_cnn.py:22
BATCH_SIZE = 64Guidelines:
- Reduce if encountering memory errors (e.g., 32, 16)
- Increase for faster training on powerful GPUs (e.g., 128, 256)
- Smaller batches may improve generalization
Location: my_alexnet_cnn.py:28-29
input_dropout = 0.8 # Keep 80% of input neurons
hidden_dropout = 0.5 # Keep 50% of hidden neuronsGuidelines:
- Increase dropout to prevent overfitting (e.g., 0.6 for hidden)
- Decrease dropout if model is underfitting (e.g., 0.7 for hidden)
- Input dropout typically higher than hidden dropout
Location: my_alexnet_cnn.py:30
std_dev = 0.1Guidelines:
- Use He initialization for ReLU:
math.sqrt(2/n_input) - Current value (0.1) works well for most cases
- Adjust if encountering gradient issues
Location: When creating optimizer (search for "Adam" in code)
Default: 0.1 (relatively high)
Guidelines:
- Standard epsilon: 1e-7 or 1e-8
- Higher epsilon (current) may help with numerical stability
- Reduce if optimizer seems too conservative
Although not implemented by default, you can:
- Train on a larger dataset first
- Save the checkpoint
- Fine-tune on your specific dataset with lower learning rate
Consider implementing:
- Step decay: Reduce learning rate every N epochs
- Exponential decay: Gradually reduce learning rate
- Cosine annealing: Cyclical learning rate
Currently implemented via crop/pad. Consider adding:
- Random flips
- Random rotations
- Color jittering
- Random crops
Monitor validation accuracy and stop training when:
- Validation accuracy stops improving
- Validation loss starts increasing (overfitting)
If overfitting:
- Increase dropout rates
- Add L2 regularization to weights
- Use more data augmentation
- Reduce model complexity
If underfitting:
- Decrease dropout rates
- Increase model capacity
- Train for more epochs
- Increase learning rate
Symptoms: Training accuracy high, validation accuracy low
Solutions:
- Increase dropout rates
- Add data augmentation
- Use more training data
- Reduce model complexity
Symptoms: Loss decreases very slowly
Solutions:
- Increase learning rate
- Check data preprocessing
- Verify batch normalization is working
- Ensure data is shuffled
Symptoms: Loss jumps up and down
Solutions:
- Decrease learning rate
- Reduce batch size
- Check for data issues (corrupted images, incorrect labels)
Symptoms: Metrics stop improving
Solutions:
- Reduce learning rate (fine-tuning)
- Try different optimizer
- Check if model capacity is sufficient
- Verify data quality
- Always shuffle training data before training
- Start with default parameters for initial experiments
- Monitor both training and validation metrics
- Save checkpoints regularly to prevent data loss
- Use TensorBoard for real-time monitoring
- Document your experiments (parameters, results)
- Validate on a separate test set after training
- Use GPU acceleration for faster training
- Small dataset (<1000 images): 5-10 minutes per epoch (CPU)
- Medium dataset (1000-10000 images): 1-2 minutes per epoch (GPU)
- Large dataset (>10000 images): 5-10 minutes per epoch (GPU)
- Random baseline: 25% (4 classes)
- Good performance: 70-80%
- Excellent performance: 85-95%
- Perfect score: >95% (may indicate overfitting)
See Advanced Troubleshooting for detailed solutions to common training problems.