blue bubble background

Chegg reduces MTTR by 87%

Industry
Business Challenge

To get the most out of their education, students need reliable, affordable access to learning tools and reference materials. That’s exactly what Chegg delivers—100% digitally.

Chegg has built a comprehensive online learning platform that provides students with on-demand access to a wide range of digital study aids, math and writing helpers, and subject-specific tutoring, as well as low-cost textbook rentals and access to internship opportunities. Running entirely on Amazon Web Services (AWS), the end-to-end platform, supports students throughout their educational journeys. As such, a top priority for Chegg is ensuring that its study tools and services are available whenever students need them.

Chegg Vice President of Engineering Services Steve Evans notes, “In serving college students, we have a very tech-savvy group of customers with high expectations of their digital experience. For example, Sunday nights are the biggest nights during the school year for students using our Chegg Study product. If that study aid is not available, it creates distrust with our students and raises questions as to whether they can depend on us to help them in their college experience. The need for us to be resilient and available for students is super-critical.”

Chegg runs hundreds of hosts in AWS, with about 80% of the compute workload containerized with Docker via Amazon Elastic Container Service (ECS). In total, the company has more than 500 services in production, all instrumented with New Relic. Because Chegg must ensure that its platform is a consistently reliable resource for students, the drive to assure resilience in its 24x7 application environment is pervasive throughout the DevOps lifecycle. One challenge, however, is the size and complexity of the environment, precipitated by the self-service tools for launching Docker microservices. Chegg currently runs about 500 microservices and is constantly adding more.

“New Relic APM is critical for us because of the dynamic nature of our tech stack. It gives us one place to go to understand the state of our applications. Every engineer here relies on New Relic, day in and day out. About 80% of our alerts are sourced from New Relic. It's the backbone of our monitoring capabilities.”

Evans continues, “The thing I really love about New Relic APM is that it takes a bunch of raw data about an application, and it turns it into insights. Out of the box, I can go into the New Relic APM console and see that this application is red, that application is green or yellow. Then I can drill down and look at why this application is red or why the error rate is so high. I can start solving the issue.”

Providing end-to-end visibility across multiple teams

When incidents do arise, Evans and his team track how long it takes to detect the issue as well as the time required to determine the cause and resolve the problem. In the past, one of the challenges to shrinking this time was directly related to Chegg’s distributed organization. For example, an outage that occurred in January 2018 involved a frontend page issuing too many API calls to a backend system, which in turn brought down a database.

“It’s very rare that we have an incident contained within a single team at Chegg,” says Evans. “Having New Relic means a frontend engineer can start troubleshooting an incident and slide all the way through to the data layer. It’s that whole end-to-end visibility that is key to reducing the time it takes to detect and resolve incidents.”

In fact, following the deployment of New Relic, Evans saw MTTR shrink in 2018 from 197 minutes to 33 minutes. In 2019, MTTR moved even lower to 24 minutes. To further streamline the troubleshooting process, Evans is now rolling out New Relic logs to gain a consolidated view of log messages in context with event and trace data. This will eliminate the added time and effort of moving among multiple consoles to assemble a complete picture of an incident.

“With New Relic logs, we’ll be able to have a frontend engineer look at a backend API provided by the commerce or identity team, and immediately transition to see the specific log messages for an application, even if that application is owned by another team,” says Evans. “That significantly lowers the bar for how much an individual engineer needs to know about the broader ecosystem to be successful at understanding a situation. Being able to fly back and forth between APM and logs is going to be really powerful for us.”

Extending New Relic with programmability

Chegg continues to explore the broad range of capabilities New Relic brings to its engineering organization, including programmability. The team is kick-starting the development of custom applications—leveraging open source applications from New Relic, such as Groundskeeper, which helps keep services up to date. Evans says, “We went through a big agent update exercise in 2018 to get accurate licensing data around containerized apps, and it was not fun. Groundskeeper is a huge win for us."

The team is also looking at using the automation and orchestration capabilities of the New Relic programmability feature for workflow replication. Evans explains, “Say one team creates some really cool dashboards around a particular application, and later on, a different team runs into the same kinds of problems for another application. We want to easily iterate on those dashboards and programmatically replicate them to any other team."

Instrumenting AWS Lambda

As Chegg continues to evolve its AWS environment, the company is increasingly moving toward a serverless computing environment using AWS Lambda. Evans is working with New Relic to instrument and monitor that environment.

“We're pushing towards serverless as much as we can,” says Evans. “Our New Relic and AWS account teams recently partnered to do an immersion day on-site with us around AWS Lambda development and observability. This was very valuable in helping us truly understand how we need to approach AWS Lambda instrumentation, best practices, and how to get that same APM experience we’ve been enjoying in containers, but on AWS Lambda.”

Evans adds, “With the support we've gotten from New Relic in this journey, we really feel like we have partners standing alongside us, listening to our feedback, and helping us understand how we can do better.

“The thing I really love about New Relic APM is that it takes a bunch of raw data about an application and turns it into insights," says Steve.

Ensuring quality digital customer experience

At the end of the day, it is all about ensuring a positive digital experience for Chegg’s college student customers. Using New Relic browser and mobile, the engineering team can track that digital customer experience and monitor things such as page-load performance and the overall experience interacting with Chegg online and in mobile apps.

For example, Chegg uses browser to track a standardized set of key performance indicators (KPIs), capturing how Chegg’s React application is responding to user interaction from JavaScript and DOM events to network activities. The KPIs include Time To First Bite (TTFB) with a target of 1 second, Time To Dom Interactive (TTDI) at 2 seconds, and Page Load Time and Real User Monitoring (RUM) at 4 seconds. The Chegg team also collaborated with New Relic to build a set of dashboards from browser to visualize how the company is performing against those KPIs. 

87%

reduction in MTTR

50%

reduction in page load times in Australia

80%

alerts sourced from New Relic

Chegg dashboard

The dashboard provides a single dynamic view of aggregated data over the entire Chegg portfolio, as well as the ability to filter information to focus on a single product. In addition, with just a few clicks, the team set up alerts on important KPIs, and integrated Slack and VictorOps for alerting through NewRelic if a KPI goes above the threshold.

This level of visibility and insight is also helping Chegg to ensure a quality digital experience for international customers as it expands into new markets. For example, a recent issue affecting Canadian customers represented a small percentage of all transactions for Chegg. However, 100% of all transactions in Canada were affected. Being able to break out specific cohorts of customers based on country code enables Evans and team to better understand the user experience on a country-by-country basis.

Another way New Relic helps Chegg with its international expansion is by breaking out page performance by country. This was especially evident with the implementation of Amazon CloudFront. “We’re actually able to use New Relic data to show the improvements from our CloudFront implementation,” says Evans. “In Australia, it moved us from page loads of about 11 seconds down to around 7 seconds—a huge impact. Being able to measure that data accurately allows us to now refocus our engineering efforts on the digital experience rather than just page-load times.”

Embedding observability within the engineering culture

Chegg’s primary goal is to ensure that every college student has a positive experience accessing its education tools and services. Evans is confident they will: “Customers can expect a good experience with Chegg because New Relic gives our engineers insights into the behavior of their applications, so they can continuously improve that experience.”

Evans concludes, “New Relic is such an embedded part of our engineering culture now. It’s hard to imagine what life would be like without it.”