Skip to content

๐Ÿ› ๏ธ DevOps Engineer - Complete Guide

Specialized guide for DevOps Engineers at Appgain

๐ŸŽฏ Role Overview

DevOps Engineers manage our multi-cloud infrastructure, ensure system reliability, and maintain automated deployment pipelines across OVH, AWS, and Azure environments.

๐Ÿ“š Specialized Learning

Required Courses

Additional Resources

๐Ÿš€ Infrastructure Management

Monitoring Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Dashboards and visualization
  • Alertmanager: Alert routing and management
  • Loki: Log aggregation and querying

System Status & Monitoring

Deployment Pipeline

  • GitLab CI/CD: Automated deployment pipelines
  • Docker Compose: Container orchestration
  • PM2: Process management for Node.js applications
  • Nginx: Reverse proxy and load balancing

๐Ÿ–ฅ๏ธ Server Infrastructure Management

Production Servers (OVH Cloud)

  • Core Services: ovh-appgain-server, ovh-parse-server, ovh-kong, ovh-pushsender
  • Development: ovh-devops, ovh-preprod
  • Analytics: ovh-growthmachine, ovh-airbyte
  • Applications: ovh-shrinkit, ovh-spinsys, ovh-ikhair, ovh-wasl

Database Infrastructure

  • MongoDB Cluster: ovh-mongo-master, ovh-mongo-slave1, ovh-mongo-slave2
  • Specialized DBs: ovh-ikhair-db, ovh-wasl-db
  • Cloud Storage: aws-efs, ovh-efs

Cloud Infrastructure

  • Azure: zafra-apps, zafra-db, ihsan-app, ihsan-db
  • ERP Systems: odoo-preprod, odoo-prod

๐Ÿ”ง Key Responsibilities

1. Infrastructure Management

  • Server Provisioning: Set up and configure new servers
  • Configuration Management: Maintain server configurations
  • Resource Optimization: Monitor and optimize resource usage
  • Security Hardening: Implement security best practices

2. Deployment & CI/CD

  • Pipeline Management: Maintain GitLab CI/CD pipelines
  • Automated Deployments: Ensure smooth deployment processes
  • Rollback Procedures: Quick rollback capabilities
  • Environment Management: Dev, staging, production environments

3. Monitoring & Observability

  • System Monitoring: Prometheus metrics collection
  • Log Management: Centralized logging with Loki
  • Alert Management: Configure and maintain alerts
  • Performance Optimization: Monitor and optimize system performance

4. Security & Compliance

  • Access Control: Manage SSH keys and user access
  • Firewall Management: Configure and maintain firewalls
  • SSL/TLS: Certificate management and renewal
  • Backup Strategy: Regular backups and disaster recovery

๐Ÿ“‹ Daily Operations

Morning Routine

# Check system status
curl http://monitor.instabackend.io:9090/-/healthy

# Review alerts
curl http://monitor.instabackend.io:9093/a../alerts

# Check server resources
for server in ovh-appgain-server ovh-parse-server ovh-kong; do
  ssh $server "df -h && free -h"
done

Deployment Workflow

# Deploy new version
git push origin main

# Monitor deployment
gitlab-ci monitor

# Verify deployment
curl https://api.instabackend.io/health

# Rollback if needed
git revert HEAD && git push origin main

Maintenance Tasks

# Update system packages
for server in ovh-appgain-server ovh-parse-server; do
  ssh $server "sudo apt update && sudo apt upgrade -y"
done

# Restart services
docker compose restart [service_name]

# Backup databases
bash scripts/daily_backup.sh

# Clean up old logs
find /var/log -name "*.log" -mtime +30 -delete

๐Ÿš€ Key Operations

Server Access

# Core Production Services
ovh-appgain-server    # Main application server
ovh-parse-server      # Parse server and APIs
ovh-devops           # DevOps environment
ovh-kong             # API Gateway

# Service Management
docker compose restart [service_name]
bash update_image.sh [service] [tag]
pm2 restart all

# Health Monitoring
curl http://monitor.instabackend.io:9090/-/healthy
bash daily_backup.sh

# Database Operations
ovh-mongo-master     # MongoDB primary
ovh-mongo-slave1     # MongoDB replica 1
ovh-mongo-slave2     # MongoDB replica 2

# Specialized Services
ovh-growthmachine    # Growth analytics and ML
ovh-airbyte          # Data pipeline orchestration
ovh-ikhair           # Donation payment platform
ovh-wasl             # Wasl platform

# Cloud Infrastructure
aws-efs              # AWS file storage
zafra-apps           # Azure applications
odoo-prod            # Odoo production system

๐Ÿ“Š Success Metrics

System Reliability

  • Uptime: 99.9% system availability
  • Response Time: < 200ms average response time
  • Error Rate: < 0.1% error rate
  • Deployment Success: > 99% successful deployments

Security & Compliance

  • Security Patches: 100% critical patches applied within 24 hours
  • Backup Success: 100% successful daily backups
  • SSL Certificates: 100% valid and up-to-date certificates
  • Access Control: Zero unauthorized access incidents

๐Ÿ”ง Server Management Best Practices

Connection Stability

  • Autossh: All servers use autossh with ServerAliveInterval 60
  • SSH Keys: Ubuntu user with SSH key authentication
  • Keep-alive: ServerAliveCountMax 3 for connection stability

Monitoring & Alerting

  • Prometheus: Metrics collection on all servers
  • Grafana: Real-time dashboards and visualization
  • Alertmanager: Automated alert routing and escalation
  • Status Page: Public status page for transparency

Backup & Recovery

  • Database Backups: Daily automated backups with point-in-time recovery
  • File System: Regular snapshots and backups
  • Configuration: Version-controlled configuration management
  • Disaster Recovery: Tested recovery procedures

Security Hardening

  • Firewall Rules: UFW firewall with minimal open ports
  • SSL/TLS: Let's Encrypt certificates with auto-renewal
  • Access Control: SSH key-based authentication only
  • Regular Audits: Security assessments and penetration testing

๐ŸŽฏ Project Examples

1. Infrastructure Automation

  • Goal: Automate server provisioning and configuration
  • Technology: Terraform + Ansible
  • Benefits: Reduced manual work, consistent configurations
  • Metrics: Deployment time, configuration drift

2. Monitoring Enhancement

  • Goal: Comprehensive system monitoring and alerting
  • Technology: Prometheus + Grafana + Alertmanager
  • Benefits: Proactive issue detection, reduced downtime
  • Metrics: Mean time to detection, alert accuracy

3. CI/CD Pipeline Optimization

  • Goal: Streamlined deployment process
  • Technology: GitLab CI/CD + Docker
  • Benefits: Faster deployments, reduced errors
  • Metrics: Deployment frequency, lead time

๐Ÿ”ง Troubleshooting

Common Issues

  • Service Down: Check service status and logs
  • High Resource Usage: Monitor CPU, memory, disk usage
  • Network Issues: Check connectivity and firewall rules
  • Deployment Failures: Review CI/CD pipeline logs

Debug Commands

# Check service status
systemctl status [service_name]
docker ps -a

# Monitor resources
htop
df -h
free -h

# Check logs
journalctl -u [service_name] -f
docker logs [container_name]

# Network diagnostics
ping [server_ip]
traceroute [server_ip]

๐Ÿ“š Learning Path

Week 1: Foundation

  • Complete DevOps foundation courses
  • Set up development environment
  • Understand infrastructure architecture
  • Learn basic server management

Week 2: Hands-on

  • Deploy first service
  • Set up monitoring
  • Configure CI/CD pipeline
  • Learn backup procedures

Week 3: Advanced

  • Optimize system performance
  • Implement security measures
  • Automate routine tasks
  • Document procedures

Week 4: Production

  • Manage production systems
  • Handle incidents
  • Optimize for scale
  • Share knowledge with team

๐Ÿ“น Video Resources & Tutorials

Infrastructure Setup Videos

Email Server Setup

WhatsApp Lite Channel

Task Queue & Push Sender

Loki Centralized Logging

CI/CD & Automation Videos

Note: Additional CI/CD and automation videos will be added as they become available.

Troubleshooting Videos

Note: Additional troubleshooting videos will be added as they become available.


๐ŸŽฏ Quick Navigation


๐Ÿ”ง DevOps Engineers are the backbone of our infrastructure, ensuring reliable, scalable, and secure systems that power our entire platform.

โ† Back to Home | โ† Previous: Common Knowledge

Ask Chehab GPT