🛠️ DevOps Engineer - Complete Guide¶

Specialized guide for DevOps Engineers at Appgain

🎯 Role Overview¶

DevOps Engineers manage our multi-cloud infrastructure, ensure system reliability, and maintain automated deployment pipelines across OVH, AWS, and Azure environments.

📚 Specialized Learning¶

Required Courses¶

Additional Resources¶

🚀 Infrastructure Management¶

Monitoring Stack¶

Prometheus: Metrics collection and storage
Grafana: Dashboards and visualization
Alertmanager: Alert routing and management
Loki: Log aggregation and querying

System Status & Monitoring¶

🌐 Appgain Status Page: https://status.instabackend.io/ - Real-time system status and service health
📊 Centralized Logging (Loki): http://monitor.instabackend.io:3003/explore?schemaVersion=1&panes=%7B%223u6%22:%7B%22datasource%22:%22belvga9xd44cga%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bservice%3D%5C%22admin-server%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22belvga9xd44cga%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 - Centralized log aggregation and querying

Deployment Pipeline¶

GitLab CI/CD: Automated deployment pipelines
Docker Compose: Container orchestration
PM2: Process management for Node.js applications
Nginx: Reverse proxy and load balancing

🖥️ Server Infrastructure Management¶

Production Servers (OVH Cloud)¶

Core Services: ovh-appgain-server, ovh-parse-server, ovh-kong, ovh-pushsender
Development: ovh-devops, ovh-preprod
Analytics: ovh-growthmachine, ovh-airbyte
Applications: ovh-shrinkit, ovh-spinsys, ovh-ikhair, ovh-wasl

Database Infrastructure¶

MongoDB Cluster: ovh-mongo-master, ovh-mongo-slave1, ovh-mongo-slave2
Specialized DBs: ovh-ikhair-db, ovh-wasl-db
Cloud Storage: aws-efs, ovh-efs

Cloud Infrastructure¶

Azure: zafra-apps, zafra-db, ihsan-app, ihsan-db
ERP Systems: odoo-preprod, odoo-prod

🔧 Key Responsibilities¶

1. Infrastructure Management¶

Server Provisioning: Set up and configure new servers
Configuration Management: Maintain server configurations
Resource Optimization: Monitor and optimize resource usage
Security Hardening: Implement security best practices

2. Deployment & CI/CD¶

Pipeline Management: Maintain GitLab CI/CD pipelines
Automated Deployments: Ensure smooth deployment processes
Rollback Procedures: Quick rollback capabilities
Environment Management: Dev, staging, production environments

3. Monitoring & Observability¶

System Monitoring: Prometheus metrics collection
Log Management: Centralized logging with Loki
Alert Management: Configure and maintain alerts
Performance Optimization: Monitor and optimize system performance

4. Security & Compliance¶

Access Control: Manage SSH keys and user access
Firewall Management: Configure and maintain firewalls
SSL/TLS: Certificate management and renewal
Backup Strategy: Regular backups and disaster recovery

📋 Daily Operations¶

Morning Routine¶

# Check system status
curl http://monitor.instabackend.io:9090/-/healthy

# Review alerts
curl http://monitor.instabackend.io:9093/a../alerts

# Check server resources
for server in ovh-appgain-server ovh-parse-server ovh-kong; do
  ssh $server "df -h && free -h"
done

Deployment Workflow¶

# Deploy new version
git push origin main

# Monitor deployment
gitlab-ci monitor

# Verify deployment
curl https://api.instabackend.io/health

# Rollback if needed
git revert HEAD && git push origin main

Maintenance Tasks¶

# Update system packages
for server in ovh-appgain-server ovh-parse-server; do
  ssh $server "sudo apt update && sudo apt upgrade -y"
done

# Restart services
docker compose restart [service_name]

# Backup databases
bash scripts/daily_backup.sh

# Clean up old logs
find /var/log -name "*.log" -mtime +30 -delete

🚀 Key Operations¶

Server Access¶

# Core Production Services
ovh-appgain-server    # Main application server
ovh-parse-server      # Parse server and APIs
ovh-devops           # DevOps environment
ovh-kong             # API Gateway

# Service Management
docker compose restart [service_name]
bash update_image.sh [service] [tag]
pm2 restart all

# Health Monitoring
curl http://monitor.instabackend.io:9090/-/healthy
bash daily_backup.sh

# Database Operations
ovh-mongo-master     # MongoDB primary
ovh-mongo-slave1     # MongoDB replica 1
ovh-mongo-slave2     # MongoDB replica 2

# Specialized Services
ovh-growthmachine    # Growth analytics and ML
ovh-airbyte          # Data pipeline orchestration
ovh-ikhair           # Donation payment platform
ovh-wasl             # Wasl platform

# Cloud Infrastructure
aws-efs              # AWS file storage
zafra-apps           # Azure applications
odoo-prod            # Odoo production system

📊 Success Metrics¶

System Reliability¶

Uptime: 99.9% system availability
Response Time: < 200ms average response time
Error Rate: < 0.1% error rate
Deployment Success: > 99% successful deployments

Security & Compliance¶

Security Patches: 100% critical patches applied within 24 hours
Backup Success: 100% successful daily backups
SSL Certificates: 100% valid and up-to-date certificates
Access Control: Zero unauthorized access incidents

🔧 Server Management Best Practices¶

Connection Stability¶

Autossh: All servers use autossh with ServerAliveInterval 60
SSH Keys: Ubuntu user with SSH key authentication
Keep-alive: ServerAliveCountMax 3 for connection stability

Monitoring & Alerting¶

Prometheus: Metrics collection on all servers
Grafana: Real-time dashboards and visualization
Alertmanager: Automated alert routing and escalation
Status Page: Public status page for transparency

Backup & Recovery¶

Database Backups: Daily automated backups with point-in-time recovery
File System: Regular snapshots and backups
Configuration: Version-controlled configuration management
Disaster Recovery: Tested recovery procedures

Security Hardening¶

Firewall Rules: UFW firewall with minimal open ports
SSL/TLS: Let's Encrypt certificates with auto-renewal
Access Control: SSH key-based authentication only
Regular Audits: Security assessments and penetration testing

🎯 Project Examples¶

1. Infrastructure Automation¶

Goal: Automate server provisioning and configuration
Technology: Terraform + Ansible
Benefits: Reduced manual work, consistent configurations
Metrics: Deployment time, configuration drift

2. Monitoring Enhancement¶

Goal: Comprehensive system monitoring and alerting
Technology: Prometheus + Grafana + Alertmanager
Benefits: Proactive issue detection, reduced downtime
Metrics: Mean time to detection, alert accuracy

3. CI/CD Pipeline Optimization¶

Goal: Streamlined deployment process
Technology: GitLab CI/CD + Docker
Benefits: Faster deployments, reduced errors
Metrics: Deployment frequency, lead time

🔧 Troubleshooting¶

Common Issues¶

Service Down: Check service status and logs
High Resource Usage: Monitor CPU, memory, disk usage
Network Issues: Check connectivity and firewall rules
Deployment Failures: Review CI/CD pipeline logs

Debug Commands¶

# Check service status
systemctl status [service_name]
docker ps -a

# Monitor resources
htop
df -h
free -h

# Check logs
journalctl -u [service_name] -f
docker logs [container_name]

# Network diagnostics
ping [server_ip]
traceroute [server_ip]

📚 Learning Path¶

Week 1: Foundation¶

Complete DevOps foundation courses
Set up development environment
Understand infrastructure architecture
Learn basic server management

Week 2: Hands-on¶

Deploy first service
Set up monitoring
Configure CI/CD pipeline
Learn backup procedures

Week 3: Advanced¶

Optimize system performance
Implement security measures
Automate routine tasks
Document procedures

Week 4: Production¶

Manage production systems
Handle incidents
Optimize for scale
Share knowledge with team

📹 Video Resources & Tutorials¶

Infrastructure Setup Videos¶

Email Server Setup¶

WhatsApp Lite Channel¶

Task Queue & Push Sender¶

Loki Centralized Logging¶

CI/CD & Automation Videos¶

Note: Additional CI/CD and automation videos will be added as they become available.

Troubleshooting Videos¶

Note: Additional troubleshooting videos will be added as they become available.

System Architecture? → Common Knowledge
Foundation Knowledge? → Foundation Courses
Learning Resources? → Learning Resources
Support? → Support & Contacts

🔧 DevOps Engineers are the backbone of our infrastructure, ensuring reliable, scalable, and secure systems that power our entire platform.

← Back to Home | ← Previous: Common Knowledge