People, Person, Computer, Electronics, LCD Screen, Laptop, Pc

Site Reliability ArchitectIllinois, United States

BigPanda

Site Reliability Architect

Illinois | United States

About The Position

BigPanda helps NOC and Operations teams to make faster and smarter decisions when (or before) outages occur. Some of the world's largest enterprises, including retailers, airlines, media companies and technology vendors, rely on us to provide a SaaS platform that is highly available and highly scalable, processing thousands of alerts per second. The number and size of our customers are growing fast, as is the complexity of systems that they monitor. Our customers rely on BigPanda to be their single point of truth for issues with their digital operations, making BigPanda a critical component in their IT infrastructure that has to be up and running 24/7.

At BigPanda we believe that availability is a critical product feature, and that effective, scalable software and infrastructure are the keys to building and operating our systems. We believe that failure is a fact of life, and is something to be managed thoughtfully and elegantly, not avoided. As Site Reliability Architect, you should be obsessed with availability. You will act as consultant for engineers and product managers when new products and services are getting ready to launch. You will work directly with our Engineering teams to support our "always available" platform.

We adopt a DevOps culture and all reliability work occurs in the development teams. While this provides a lot of agility and accountability, this also means you will need to find ways to drive meaningful change without directly managing developers (there is no dedicated SRE team). Having said that we have a very strong engineering team that you will be able to leverage with the right strategy in mind. The name of the game is automating everything, because hiring linearly with our traffic growth is unsustainable.

What would make you a good fit?
  • You are capable of strategic thinking, finding the balance between addressing tactical concerns and driving long term initiatives. You are driven mainly by the ROI of your initiatives on the company's business.
  • You're a clear communicator - you drive change by presenting problems and collaborating with peers and stakeholders to find the most cost effective solution.
  • You are comfortable identifying technical and process-related shortcomings, and can lay out a vision to fix them. You aren't afraid to institute change by experimentation.
  • You're obsessed with metrics, leading a data-driven culture for measuring operational success or failure.
  • You understand that incidents in production happen, and use those incidents for learning and improving. You follow-through on post-mortems and remediation plans.
  • You have a "can-do" approach and know how to get things done.
  • You acknowledge that we are still a startup and hands-on work is part of the job.



Requirements


  • Previous experience running production-grade software at scale and an appreciation for the complex and emergent behaviors inherent to distributed systems, preferably SaaS.
  • Previous experience running the IT operations of a mission critical B2B product, preferably SaaS. This includes the technical understanding of solutions that help monitor and maintain the availability of such products.
  • Proven experience in driving strategic changes in the organization, product or architecture to improve the overall availability and performance of the system.
  • Experience with technical design and deployment of monitoring systems, including timeseries databases and alerting systems, that monitor a large scale environment - hundreds of servers, dozen microservices, 10 different clusters of databases and queueing systems. We currently use Graphite/Grafana/Nagios as our main monitoring stack, so experience with any of them is a plus.
  • Experience with managing large scale containerized infrastructure (Docker and Kubernetes) is a big plus
  • Experience with APM of Node.JS and/or Scala (JVM) applications is a big plus



I'm interested

Please send me alerts for jobs like this

Not ?

Thank you. Please wait while we forward you to the application.

Similar Jobs

Storage Software Engineer

Fungible United States United States
Software for creating reliable, high performance storage infrastructure in a large scale Data Center environment is central to our mission. Your role will be to design and implement software components for storage devices and protocols.Skills, Edu...

Software Engineer – Routing

Versa Networks Indiana United States Indiana, United States
The Routing Software Engineer will be at the forefront of architecture, design and implementation of Versa's high-performance routing software. You will be part of, and work closely with a team of outstanding software & system test engineers. ...

Software Developer - macOS Internals

Webroot United States United States
You are an experienced software developer with expertise in low-level macOS internals. You have a strong foundation in macOS and you seek to apply that knowledge at a company committed to providing top-notch information security and OS malware/thr...