๐ ๏ธ DevOps Engineer - Complete Guide¶
Specialized guide for DevOps Engineers at Appgain
๐ฏ Role Overview¶
DevOps Engineers manage our multi-cloud infrastructure, ensure system reliability, and maintain automated deployment pipelines across OVH, AWS, and Azure environments.
๐ Specialized Learning¶
Required Courses¶
Additional Resources¶
๐ Infrastructure Management¶
Monitoring Stack¶
- Prometheus: Metrics collection and storage
- Grafana: Dashboards and visualization
- Alertmanager: Alert routing and management
- Loki: Log aggregation and querying
System Status & Monitoring¶
- ๐ Appgain Status Page: https://status.instabackend.io/ - Real-time system status and service health
- ๐ Centralized Logging (Loki): http://monitor.instabackend.io:3003/explore?schemaVersion=1&panes=%7B%223u6%22:%7B%22datasource%22:%22belvga9xd44cga%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bservice%3D%5C%22admin-server%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22belvga9xd44cga%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 - Centralized log aggregation and querying
Deployment Pipeline¶
- GitLab CI/CD: Automated deployment pipelines
- Docker Compose: Container orchestration
- PM2: Process management for Node.js applications
- Nginx: Reverse proxy and load balancing
๐ฅ๏ธ Server Infrastructure Management¶
Production Servers (OVH Cloud)¶
- Core Services: ovh-appgain-server, ovh-parse-server, ovh-kong, ovh-pushsender
- Development: ovh-devops, ovh-preprod
- Analytics: ovh-growthmachine, ovh-airbyte
- Applications: ovh-shrinkit, ovh-spinsys, ovh-ikhair, ovh-wasl
Database Infrastructure¶
- MongoDB Cluster: ovh-mongo-master, ovh-mongo-slave1, ovh-mongo-slave2
- Specialized DBs: ovh-ikhair-db, ovh-wasl-db
- Cloud Storage: aws-efs, ovh-efs
Cloud Infrastructure¶
- Azure: zafra-apps, zafra-db, ihsan-app, ihsan-db
- ERP Systems: odoo-preprod, odoo-prod
๐ง Key Responsibilities¶
1. Infrastructure Management¶
- Server Provisioning: Set up and configure new servers
- Configuration Management: Maintain server configurations
- Resource Optimization: Monitor and optimize resource usage
- Security Hardening: Implement security best practices
2. Deployment & CI/CD¶
- Pipeline Management: Maintain GitLab CI/CD pipelines
- Automated Deployments: Ensure smooth deployment processes
- Rollback Procedures: Quick rollback capabilities
- Environment Management: Dev, staging, production environments
3. Monitoring & Observability¶
- System Monitoring: Prometheus metrics collection
- Log Management: Centralized logging with Loki
- Alert Management: Configure and maintain alerts
- Performance Optimization: Monitor and optimize system performance
4. Security & Compliance¶
- Access Control: Manage SSH keys and user access
- Firewall Management: Configure and maintain firewalls
- SSL/TLS: Certificate management and renewal
- Backup Strategy: Regular backups and disaster recovery
๐ Daily Operations¶
Morning Routine¶
# Check system status
curl http://monitor.instabackend.io:9090/-/healthy
# Review alerts
curl http://monitor.instabackend.io:9093/a../alerts
# Check server resources
for server in ovh-appgain-server ovh-parse-server ovh-kong; do
ssh $server "df -h && free -h"
done
Deployment Workflow¶
# Deploy new version
git push origin main
# Monitor deployment
gitlab-ci monitor
# Verify deployment
curl https://api.instabackend.io/health
# Rollback if needed
git revert HEAD && git push origin main
Maintenance Tasks¶
# Update system packages
for server in ovh-appgain-server ovh-parse-server; do
ssh $server "sudo apt update && sudo apt upgrade -y"
done
# Restart services
docker compose restart [service_name]
# Backup databases
bash scripts/daily_backup.sh
# Clean up old logs
find /var/log -name "*.log" -mtime +30 -delete
๐ Key Operations¶
Server Access¶
# Core Production Services
ovh-appgain-server # Main application server
ovh-parse-server # Parse server and APIs
ovh-devops # DevOps environment
ovh-kong # API Gateway
# Service Management
docker compose restart [service_name]
bash update_image.sh [service] [tag]
pm2 restart all
# Health Monitoring
curl http://monitor.instabackend.io:9090/-/healthy
bash daily_backup.sh
# Database Operations
ovh-mongo-master # MongoDB primary
ovh-mongo-slave1 # MongoDB replica 1
ovh-mongo-slave2 # MongoDB replica 2
# Specialized Services
ovh-growthmachine # Growth analytics and ML
ovh-airbyte # Data pipeline orchestration
ovh-ikhair # Donation payment platform
ovh-wasl # Wasl platform
# Cloud Infrastructure
aws-efs # AWS file storage
zafra-apps # Azure applications
odoo-prod # Odoo production system
๐ Success Metrics¶
System Reliability¶
- Uptime: 99.9% system availability
- Response Time: < 200ms average response time
- Error Rate: < 0.1% error rate
- Deployment Success: > 99% successful deployments
Security & Compliance¶
- Security Patches: 100% critical patches applied within 24 hours
- Backup Success: 100% successful daily backups
- SSL Certificates: 100% valid and up-to-date certificates
- Access Control: Zero unauthorized access incidents
๐ง Server Management Best Practices¶
Connection Stability¶
- Autossh: All servers use autossh with ServerAliveInterval 60
- SSH Keys: Ubuntu user with SSH key authentication
- Keep-alive: ServerAliveCountMax 3 for connection stability
Monitoring & Alerting¶
- Prometheus: Metrics collection on all servers
- Grafana: Real-time dashboards and visualization
- Alertmanager: Automated alert routing and escalation
- Status Page: Public status page for transparency
Backup & Recovery¶
- Database Backups: Daily automated backups with point-in-time recovery
- File System: Regular snapshots and backups
- Configuration: Version-controlled configuration management
- Disaster Recovery: Tested recovery procedures
Security Hardening¶
- Firewall Rules: UFW firewall with minimal open ports
- SSL/TLS: Let's Encrypt certificates with auto-renewal
- Access Control: SSH key-based authentication only
- Regular Audits: Security assessments and penetration testing
๐ฏ Project Examples¶
1. Infrastructure Automation¶
- Goal: Automate server provisioning and configuration
- Technology: Terraform + Ansible
- Benefits: Reduced manual work, consistent configurations
- Metrics: Deployment time, configuration drift
2. Monitoring Enhancement¶
- Goal: Comprehensive system monitoring and alerting
- Technology: Prometheus + Grafana + Alertmanager
- Benefits: Proactive issue detection, reduced downtime
- Metrics: Mean time to detection, alert accuracy
3. CI/CD Pipeline Optimization¶
- Goal: Streamlined deployment process
- Technology: GitLab CI/CD + Docker
- Benefits: Faster deployments, reduced errors
- Metrics: Deployment frequency, lead time
๐ง Troubleshooting¶
Common Issues¶
- Service Down: Check service status and logs
- High Resource Usage: Monitor CPU, memory, disk usage
- Network Issues: Check connectivity and firewall rules
- Deployment Failures: Review CI/CD pipeline logs
Debug Commands¶
# Check service status
systemctl status [service_name]
docker ps -a
# Monitor resources
htop
df -h
free -h
# Check logs
journalctl -u [service_name] -f
docker logs [container_name]
# Network diagnostics
ping [server_ip]
traceroute [server_ip]
๐ Learning Path¶
Week 1: Foundation¶
- Complete DevOps foundation courses
- Set up development environment
- Understand infrastructure architecture
- Learn basic server management
Week 2: Hands-on¶
- Deploy first service
- Set up monitoring
- Configure CI/CD pipeline
- Learn backup procedures
Week 3: Advanced¶
- Optimize system performance
- Implement security measures
- Automate routine tasks
- Document procedures
Week 4: Production¶
- Manage production systems
- Handle incidents
- Optimize for scale
- Share knowledge with team
๐น Video Resources & Tutorials¶
Infrastructure Setup Videos¶
Email Server Setup¶
WhatsApp Lite Channel¶
Task Queue & Push Sender¶
Loki Centralized Logging¶
CI/CD & Automation Videos¶
Note: Additional CI/CD and automation videos will be added as they become available.
Troubleshooting Videos¶
Note: Additional troubleshooting videos will be added as they become available.
๐ฏ Quick Navigation¶
- System Architecture? โ Common Knowledge
- Foundation Knowledge? โ Foundation Courses
- Learning Resources? โ Learning Resources
- Support? โ Support & Contacts
๐ง DevOps Engineers are the backbone of our infrastructure, ensuring reliable, scalable, and secure systems that power our entire platform.
โ Back to Home | โ Previous: Common Knowledge
Ask Chehab GPT