How Gentle & Marvel constructed a predictive upkeep resolution for gaming machines on AWS

[ad_1]

This submit is co-written with Aruna Abeyakoon and Denisse Colin from Gentle and Marvel (L&W).

Headquartered in Las Vegas, Gentle & Marvel, Inc. is the main cross-platform world recreation firm that gives playing services. Working with AWS, Gentle & Marvel lately developed an industry-first safe resolution, Gentle & Marvel Join (LnW Join), to stream telemetry and machine well being knowledge from roughly half 1,000,000 digital gaming machines distributed throughout its on line casino buyer base globally when LnW Join reaches its full potential. Over 500 machine occasions are monitored in near-real time to present a full image of machine circumstances and their working environments. Using knowledge streamed by means of LnW Join, L&W goals to create higher gaming expertise for his or her end-users in addition to convey extra worth to their on line casino prospects.

Gentle & Marvel teamed up with the Amazon ML Options Lab to make use of occasions knowledge streamed from LnW Hook up with allow machine studying (ML)-powered predictive upkeep for slot machines. Predictive upkeep is a typical ML use case for companies with bodily tools or equipment property. With predictive upkeep, L&W can get superior warning of machine breakdowns and proactively dispatch a service group to examine the difficulty. It will scale back machine downtime and keep away from important income loss for casinos. With no distant diagnostic system in place, subject decision by the Gentle & Marvel service group on the on line casino ground might be expensive and inefficient, whereas severely degrading the client gaming expertise.

The character of the challenge is extremely exploratory—that is the primary try at predictive upkeep within the gaming {industry}. The Amazon ML Options Lab and L&W group launched into an end-to-end journey from formulating the ML downside and defining the analysis metrics, to delivering a high-quality resolution. The ultimate ML mannequin combines CNN and Transformer, that are the state-of-the-art neural community architectures for modeling sequential machine log knowledge. The submit presents an in depth description of this journey, and we hope you’ll get pleasure from it as a lot as we do!

On this submit, we talk about the next:

  • How we formulated the predictive upkeep downside as an ML downside with a set of acceptable metrics for analysis
  • How we ready knowledge for coaching and testing
  • Knowledge preprocessing and have engineering strategies we employed to acquire performant fashions
  • Performing a hyperparameter tuning step with Amazon SageMaker Automated Mannequin Tuning
  • Comparisons between the baseline mannequin and the ultimate CNN+Transformer mannequin
  • Extra strategies we used to enhance mannequin efficiency, similar to ensembling

Background

On this part, we talk about the problems that necessitated this resolution.

Dataset

Slot machine environments are extremely regulated and are deployed in an air-gapped surroundings. In LnW Join, an encryption course of was designed to offer a safe and dependable mechanism for the info to be introduced into an AWS knowledge lake for predictive modeling. The aggregated recordsdata are encrypted and the decryption secret is solely accessible in AWS Key Administration Service (AWS KMS). A cellular-based non-public community into AWS is about up by means of which the recordsdata had been uploaded into Amazon Easy Storage Service (Amazon S3).

LnW Join streams a variety of machine occasions, similar to begin of recreation, finish of recreation, and extra. The system collects over 500 various kinds of occasions. As proven within the following
, every occasion is recorded together with a timestamp of when it occurred and the ID of the machine recording the occasion. LnW Join additionally information when a machine enters a non-playable state, and it is going to be marked as a machine failure or breakdown if it doesn’t get better to a playable state inside a sufficiently quick time span.

Machine ID Occasion Sort ID Timestamp
0 E1 2022-01-01 00:17:24
0 E3 2022-01-01 00:17:29
1000 E4 2022-01-01 00:17:33
114 E234 2022-01-01 00:17:34
222 E100 2022-01-01 00:17:37

Along with dynamic machine occasions, static metadata about every machine can also be accessible. This consists of info similar to machine distinctive identifier, cupboard kind, location, working system, software program model, recreation theme, and extra, as proven within the following desk. (All of the names within the desk are anonymized to guard buyer info.)

Machine ID Cupboard Sort OS Location Sport Theme
276 A OS_Ver0 AA Resort & On line casino StormMaiden
167 B OS_Ver1 BB On line casino, Resort & Spa UHMLIndia
13 C OS_Ver0 CC On line casino & Lodge TerrificTiger
307 D OS_Ver0 DD On line casino Resort NeptunesRealm
70 E OS_Ver0 EE Resort & On line casino RLPMealTicket

Downside definition

We deal with the predictive upkeep downside for slot machines as a binary classification downside. The ML mannequin takes within the historic sequence of machine occasions and different metadata and predicts whether or not a machine will encounter a failure in a 6-hour future time window. If a machine will break down inside 6 hours, it’s deemed a high-priority machine for upkeep. In any other case, it’s low precedence. The next determine offers examples of low-priority (prime) and high-priority (backside) samples. We use a fixed-length look-back time window to gather historic machine occasion knowledge for prediction. Experiments present that longer look-back time home windows enhance mannequin efficiency considerably (extra particulars later on this submit).

low priority and high priority examples

Modeling challenges

We confronted a few challenges fixing this downside:

  • We’ve an enormous quantity occasion logs that include round 50 million occasions a month (from roughly 1,000 recreation samples). Cautious optimization is required within the knowledge extraction and preprocessing stage.
  • Occasion sequence modeling was difficult as a result of extraordinarily uneven distribution of occasions over time. A 3-hour window can include anyplace from tens to 1000’s of occasions.
  • Machines are in a superb state more often than not and the high-priority upkeep is a uncommon class, which launched a category imbalance subject.
  • New machines are added repeatedly to the system, so we had to ensure our mannequin can deal with prediction on new machines which have by no means been seen in coaching.

Knowledge preprocessing and have engineering

On this part, we talk about our strategies for knowledge preparation and have engineering.

Function engineering

Slot machine feeds are streams of unequally spaced time collection occasions; for instance, the variety of occasions in a 3-hour window can vary from tens to 1000’s. To deal with this imbalance, we used occasion frequencies as a substitute of the uncooked sequence knowledge. A simple strategy is aggregating the occasion frequency for the complete look-back window and feeding it into the mannequin. Nevertheless, when utilizing this illustration, the temporal info is misplaced, and the order of occasions isn’t preserved. We as a substitute used temporal binning by dividing the time window into N equal sub-windows and calculating the occasion frequencies in every. The ultimate options of a time window are the concatenation of all its sub-window options. Rising the variety of bins preserves extra temporal info. The next determine illustrates temporal binning on a pattern window.

temporal binning on a sample window

First, the pattern time window is cut up into two equal sub-windows (bins); we used solely two bins right here for simplicity for illustration. Then, the counts of the occasions E1, E2, E3, and E4 are calculated in every bin. Lastly, they’re concatenated and used as options.

Together with the occasion frequency-based options, we used machine-specific options like software program model, cupboard kind, recreation theme, and recreation model. Moreover, we added options associated to the timestamps to seize the seasonality, similar to hour of the day and day of the week.

Knowledge preparation

To extract knowledge effectively for coaching and testing, we make the most of Amazon Athena and the AWS Glue Knowledge Catalog. The occasions knowledge is saved in Amazon S3 in Parquet format and partitioned in accordance with day/month/hour. This facilitates environment friendly extraction of knowledge samples inside a specified time window. We use knowledge from all machines within the newest month for testing and the remainder of the info for coaching, which helps keep away from potential knowledge leakage.

ML methodology and mannequin coaching

On this part, we talk about our baseline mannequin with AutoGluon and the way we constructed a personalized neural community with SageMaker automated mannequin tuning.

Constructing a baseline mannequin with AutoGluon

With any ML use case, it’s essential to ascertain a baseline mannequin for use for comparability and iteration. We used AutoGluon to discover a number of traditional ML algorithms. AutoGluon is easy-to-use AutoML software that makes use of automated knowledge processing, hyperparameter tuning, and mannequin ensemble. The very best baseline was achieved with a weighted ensemble of gradient boosted determination tree fashions. The convenience of use of AutoGluon helped us within the discovery stage to navigate rapidly and effectively by means of a variety of attainable knowledge and ML modeling instructions.

Constructing and tuning a personalized neural community mannequin with SageMaker automated mannequin tuning

After experimenting with completely different neural networks architectures, we constructed a personalized deep studying mannequin for predictive upkeep. Our mannequin surpassed the AutoGluon baseline mannequin by 121% in recall at 80% precision. The ultimate mannequin ingests historic machine occasion sequence knowledge, time options similar to hour of the day, and static machine metadata. We make the most of SageMaker automated mannequin tuning jobs to seek for the perfect hyperparameters and mannequin architectures.

The next determine exhibits the mannequin structure. We first normalize the binned occasion sequence knowledge by common frequencies of every occasion within the coaching set to take away the overwhelming impact of high-frequency occasions (begin of recreation, finish of recreation, and so forth). The embeddings for particular person occasions are learnable, whereas the temporal characteristic embeddings (day of the week, hour of the day) are extracted utilizing the bundle GluonTS. Then we concatenate the occasion sequence knowledge with the temporal characteristic embeddings because the enter to the mannequin. The mannequin consists of the next layers:

  • Convolutional layers (CNN) – Every CNN layer consists of two 1-dimensional convolutional operations with residual connections. The output of every CNN layer has the identical sequence size because the enter to permit for straightforward stacking with different modules. The entire variety of CNN layers is a tunable hyperparameter.
  • Transformer encoder layers (TRANS) – The output of the CNN layers is fed along with the positional encoding to a multi-head self-attention construction. We use TRANS to straight seize temporal dependencies as a substitute of utilizing recurrent neural networks. Right here, binning of the uncooked sequence knowledge (lowering size from 1000’s to a whole bunch) helps alleviate the GPU reminiscence bottlenecks, whereas protecting the chronological info to a tunable extent (the variety of the bins is a tunable hyperparameter).
  • Aggregation layers (AGG) – The ultimate layer combines the metadata info (recreation theme kind, cupboard kind, places) to supply the precedence degree likelihood prediction. It consists of a number of pooling layers and totally linked layers for incremental dimension discount. The multi-hot embeddings of metadata are additionally learnable, and don’t undergo CNN and TRANS layers as a result of they don’t include sequential info.

customized neural network model architecture

We use the cross-entropy loss with class weights as tunable hyperparameters to regulate for the category imbalance subject. As well as, the numbers of CNN and TRANS layers are essential hyperparameters with the attainable values of 0, which suggests particular layers could not all the time exist within the mannequin structure. This manner, we have now a unified framework the place the mannequin architectures are searched together with different normal hyperparameters.

We make the most of SageMaker automated mannequin tuning, also called hyperparameter optimization (HPO), to effectively discover mannequin variations and the big search house of all hyperparameters. Automated mannequin tuning receives the personalized algorithm, coaching knowledge, and hyperparameter search house configurations, and searches for finest hyperparameters utilizing completely different methods similar to Bayesian, Hyperband, and extra with a number of GPU cases in parallel. After evaluating on a hold-out validation set, we obtained the perfect mannequin structure with two layers of CNN, one layer of TRANS with 4 heads, and an AGG layer.

We used the next hyperparameter ranges to seek for the perfect mannequin structure:

hyperparameter_ranges = {
# Studying Charge
"learning_rate": ContinuousParameter(5e-4, 1e-3, scaling_type="Logarithmic"),
# Class weights
"loss_weight": ContinuousParameter(0.1, 0.9),
# Variety of enter bins
"num_bins": CategoricalParameter([10, 40, 60, 120, 240]),
# Dropout price
"dropout_rate": CategoricalParameter([0.1, 0.2, 0.3, 0.4, 0.5]),
# Mannequin embedding dimension
"dim_model": CategoricalParameter([160,320,480,640]),
# Variety of CNN layers
"num_cnn_layers": IntegerParameter(0,10),
# CNN kernel measurement
"cnn_kernel": CategoricalParameter([3,5,7,9]),
# Variety of tranformer layers
"num_transformer_layers": IntegerParameter(0,4),
# Variety of transformer consideration heads
"num_heads": CategoricalParameter([4,8]),
#Variety of RNN layers
"num_rnn_layers": IntegerParameter(0,10), # elective
# RNN enter dimension measurement
"dim_rnn":CategoricalParameter([128,256])
}

To additional enhance mannequin accuracy and scale back mannequin variance, we educated the mannequin with a number of unbiased random weight initializations, and aggregated the end result with imply values as the ultimate likelihood prediction. There’s a trade-off between extra computing sources and higher mannequin efficiency, and we noticed that 5–10 needs to be an affordable quantity within the present use case (outcomes proven later on this submit).

Mannequin efficiency outcomes

On this part, we current the mannequin efficiency analysis metrics and outcomes.

Analysis metrics

Precision is essential for this predictive upkeep use case. Low precision means reporting extra false upkeep calls, which drives prices up by means of pointless upkeep. As a result of common precision (AP) doesn’t totally align with the excessive precision goal, we launched a brand new metric named common recall at excessive precisions (ARHP). ARHP is the same as the typical of remembers at 60%, 70%, and 80% precision factors. We additionally used precision at prime Ok% (Ok=1, 10), AUPR, and AUROC as further metrics.

Outcomes

The next desk summarizes the outcomes utilizing the baseline and the personalized neural community fashions, with 7/1/2022 because the practice/check cut up level. Experiments present that rising the window size and pattern knowledge measurement each enhance the mannequin efficiency, as a result of they include extra historic info to assist with the prediction. Whatever the knowledge settings, the neural community mannequin outperforms AutoGluon in all metrics. For instance, recall on the fastened 80% precision is elevated by 121%, which allows you to rapidly establish extra malfunctioned machines if utilizing the neural community mannequin.

Mannequin Window size/Knowledge measurement AUROC AUPR ARHP Recall@Prec0.6 Recall@Prec0.7 Recall@Prec0.8 Prec@top1% Prec@top10%
AutoGluon baseline 12H/500k 66.5 36.1 9.5 12.7 9.3 6.5 85 42
Neural Community 12H/500k 74.7 46.5 18.5 25 18.1 12.3 89 55
AutoGluon baseline 48H/1mm 70.2 44.9 18.8 26.5 18.4 11.5 92 55
Neural Community 48H/1mm 75.2 53.1 32.4 39.3 32.6 25.4 94 65

The next figures illustrate the impact of utilizing ensembles to spice up the neural community mannequin efficiency. All of the analysis metrics proven on the x-axis are improved, with greater imply (extra correct) and decrease variance (extra steady). Every box-plot is from 12 repeated experiments, from no ensembles to 10 fashions in ensembles (x-axis). Comparable traits persist in all metrics apart from the Prec@top1% and Recall@Prec80% proven.

After factoring within the computational price, we observe that utilizing 5–10 fashions in ensembles is appropriate for Gentle & Marvel datasets.

Conclusion

Our collaboration has resulted within the creation of a groundbreaking predictive upkeep resolution for the gaming {industry}, in addition to a reusable framework that could possibly be utilized in quite a lot of predictive upkeep eventualities. The adoption of AWS applied sciences similar to SageMaker automated mannequin tuning facilitates Gentle & Marvel to navigate new alternatives utilizing near-real-time knowledge streams. Gentle & Marvel is beginning the deployment on AWS.

If you need assist accelerating using ML in your services, please contact the Amazon ML Options Lab program.


In regards to the authors

Aruna Abeyakoon is the Senior Director of Knowledge Science & Analytics at Gentle & Marvel Land-based Gaming Division. Aruna leads the industry-first Gentle & Marvel Join initiative and helps each on line casino companions and inner stakeholders with client habits and product insights to make higher video games, optimize product choices, handle property, and well being monitoring & predictive upkeep.

Denisse Colin is a Senior Knowledge Science Supervisor at Gentle & Marvel, a number one cross-platform world recreation firm. She is a member of the Gaming Knowledge & Analytics group serving to develop revolutionary options to enhance product efficiency and prospects’ experiences by means of Gentle & Marvel Join.

Tesfagabir Meharizghi is a Knowledge Scientist on the Amazon ML Options Lab the place he helps AWS prospects throughout numerous industries similar to gaming, healthcare and life sciences, manufacturing, automotive, and sports activities and media, speed up their use of machine studying and AWS cloud companies to resolve their enterprise challenges.

Mohamad Aljazaery is an utilized scientist at Amazon ML Options Lab. He helps AWS prospects establish and construct ML options to handle their enterprise challenges in areas similar to logistics, personalization and suggestions, pc imaginative and prescient, fraud prevention, forecasting and provide chain optimization.

Yawei Wang is an Utilized Scientist on the Amazon ML Resolution Lab. He helps AWS enterprise companions establish and construct ML options to handle their group’s enterprise challenges in a real-world situation.

Yun Zhou is an Utilized Scientist on the Amazon ML Options Lab, the place he helps with analysis and growth to make sure the success of AWS prospects. He works on pioneering options for numerous industries utilizing statistical modeling and machine studying strategies. His curiosity consists of generative fashions and sequential knowledge modeling.

Panpan Xu is a Utilized Science Supervisor with the Amazon ML Options Lab at AWS. She is engaged on analysis and growth of Machine Studying algorithms for high-impact buyer functions in quite a lot of industrial verticals to speed up their AI and cloud adoption. Her analysis curiosity consists of mannequin interpretability, causal evaluation, human-in-the-loop AI and interactive knowledge visualization.

Raj Salvaji leads Options Structure within the Hospitality phase at AWS. He works with hospitality prospects by offering strategic steerage, technical experience to create options to complicated enterprise challenges. He attracts on 25 years of expertise in a number of engineering roles throughout Hospitality, Finance and Automotive industries.

Shane Rai is a Principal ML Strategist with the Amazon ML Options Lab at AWS. He works with prospects throughout a various spectrum of industries to resolve their most urgent and revolutionary enterprise wants utilizing AWS’s breadth of cloud-based AI/ML companies.

[ad_2]

Leave a Comment

Your email address will not be published. Required fields are marked *