Archive service keeps failing
Incident Report for Intempt
Postmortem

Problem: Executors of scala Spark applications are failing periodically and the number of driver pod restarts is directly connected to the number of failed executors.

Reason:Error is OOM

Status: Ressolved

Solution: Increase memory for the service

Preventing steps: Monitor the health of the repo

Posted Jan 26, 2022 - 01:32 PST

Resolved
Executors and drivers keep restarting a few times a day. First thing that can be a reason for that is not enough memory (strict memory limit to 1G for driver).
Posted Jan 23, 2022 - 01:30 PST