Service restart count exceeded for Customer Data Platform
Incident Report for Intempt
Postmortem

Problem: Executors of scala Spark applications are failing periodically and the number of driver pod restarts is directly connected to the number of failed executors.

Reason: Not found yet. Error is OOM, but reason of that isn’t clear.

Status: In progress

Solution: Not found yet. Investigating the real reason for executors failing. Actions that should be taken depend on real reason.

Preventing steps: -

Posted Jan 26, 2022 - 01:26 PST

Resolved
Executors of scala Spark applications are failing periodically and the number of driver pod restarts is directly connected to the number of failed executors. We need to investigate the core of that
Posted Jan 15, 2022 - 01:25 PST