Control Room Five Whys with ChatGPT
Step into the control room of a Methanol chemical plant to troubleshoot faults using 'five whys' with ChatGPT.
Last week, Exponential Industry welcomed David Ariens, publisher of The IT/OT Insider, to discuss IT/OT Convergence with ‘Henry Elon Rogo’ a custom ChatGPT built by Exponential Industry. The GPT provided numerous tips for building a fully digital factory. Once the digital technologies are converged, the GPT can help us diagnose factory issues in real-time.
This week, using the ‘Five whys’ technique, we will use the Exponential Industry GPT to investigate ACME Methanol’s plant from ControlRooms.ai. There are known criticisms of this technique, however, using a GPT as an assistant may be sufficient to overcome them and be able to evaluate the problem space holistically, even with limited knowledge by the investigator.
Note: For context on Methanol plants check out “Designing Basic & Detailed Processes for Mega Methanol Plants” at Valve World. It’s quite amazing that ChatGPT can guess some of the physical components based solely on limited information in the process tag name (like BFW).
Troubleshooting the Night Shift
I’ve set the demo environment time range from 12/13/2023 20:00:00 to 12/14/2023 08:00:00 Plant time if you’d like to play along with me.
Prompt: Imagine you are a plant manager for ACME Methanol, a chemical plant producing Methanol. You are going to identify the root cause of plant interruptions using the five whys technique. I will show you an image of a plant dashboard and ask you to identify why.
Understood. Please go ahead and share the image of the plant dashboard. Once I have the image, I'll use the Five Whys technique to help identify potential root causes of the interruptions in the ACME Methanol plant's operations.
Why did the plant experience disruptions?
Prompt: Based on this plant status dashboard image of the December 13th night shift, why did the plant experience disruptions?
The image you've shared appears to be a status dashboard for a Methanol plant, detailing the operation of various systems over a time period from the evening of December 13th to the morning of December 14th. The dashboard is showing timestamps where certain systems had events, which could be normal operations or could indicate disruptions.
To initiate the Five Whys analysis, we need to start with a specific event or problem. I'll use the information visible in the image to identify a hypothetical issue and then perform the Five Whys technique. Here's an example based on the provided dashboard:
Why was there a disruption in the plant?
There were multiple timestamps where the systems such as Reaction North, Reaction South, and Cooling had events that could indicate disruptions.
Why did those systems have events at those times?
Potential causes could be equipment failure, process deviation, or external factors affecting those systems.
Why did equipment failure or process deviation occur?
Equipment failure could be due to a lack of maintenance or aging equipment. Process deviation could be due to operator error or unexpected variations in feedstock quality.
Why was there a lack of maintenance or aging equipment?
The maintenance schedule may not be adequate, or there may have been a delay or oversight in performing scheduled maintenance. Aging equipment might be due to capital investment deferral.
Why was the maintenance schedule not adequate, or why was capital investment deferred?
This could be due to budget constraints, management decisions prioritizing other investments, or a lack of awareness of the importance of timely maintenance and equipment updates.
This iterative questioning helps to dig deeper into the issue. However, to identify the actual root cause, plant data, logs, maintenance records, and input from the operation team would be required. The events marked on the dashboard would need to be cross-referenced with actual incident reports to confirm if they indeed were disruptions and to gather more details about each event.
Prompt: Only answer my first why question and nothing more. I will prompt you for the second question.
Why was there a disruption in the plant?
Based on the dashboard image, disruptions in the plant could be indicated by the events marked at various systems. The "why" for these disruptions could be due to several reasons such as equipment malfunction, process upsets, control system anomalies, or fluctuations in feedstock quality. The exact nature of the disruption would require further investigation into the specific incidents that these markers represent, including checking logs, system alerts, and operator reports from the time of the events.
Prompt: Which systems caused the most severe disruptions based on the dashboard image?
The dashboard displays various systems and timestamps where events occurred. Severity is typically indicated by color-coding; however, in the provided image, there's no explicit indication of severity attached to the events. Typically, we would look for red markers or similar indicators to signify high severity.
Without such color differentiation or additional context, we must infer based on the frequency and pattern of the events. The systems with the most frequent event markers could potentially be the ones causing the most severe disruptions. From the image, the "Cooling" and "Reaction North" systems show a higher frequency of events compared to other systems. This could suggest that these systems experienced more severe disruptions or were more prone to issues during the timeframe displayed. However, it is important to cross-reference this with the plant's operational logs or incident reports for accurate interpretation of the dashboard data.
Drill Down into ‘Reaction North’ Process Parameters
Prompt: Which process parameters based on the "Reaction North" dashboard image are anomalous?
Based on the "Reaction North" dashboard image, several process parameters exhibit anomalies indicated by spikes or drops in their trend lines. Here's a brief analysis:
FI-9844.PV (PURGE6 #10 FLOW PPH) - There is a significant spike which is an anomaly indicating a sudden increase in flow. This could be due to a valve opening unexpectedly or a control issue.
FIC-21106.PV (BFW FLOW PPH) - This also shows a spike similar to FI-9844.PV, suggesting an unusual flow behavior that could be related to boiler feedwater (BFW) flow.
FIC-24865.PV (STEAM FLOW PPH) - There is a noticeable spike indicating a sudden increase in steam flow. This could be a result of control system actions or a manual intervention.
TT-27715.S.PV (MEOH INLET DEGC) - The spike here is quite severe and goes beyond the scale of the graph. This suggests a rapid temperature change at the methanol inlet, which is quite concerning and could impact reaction kinetics.
TT-27715.6-PV (OUT HEATUP DEGC) - Although less severe than the MEOH INLET spike, there is an anomalous peak indicating a temperature variation.
These anomalies could indicate issues such as instrumentation faults, control loop stability problems, or actual process upsets. Each of these parameters would need to be investigated to understand the underlying cause of the anomaly and to determine the impact on the Reaction North system's operation. It would be essential to cross-reference these anomalies with operational logs, alarm histories, and maintenance records to diagnose the issues accurately.
Why did the flow parameters correlate with a spike?
Prompt: Why would the PURGEG #10 Flow, BFW Flow, and Steam Flow all spike simultaneously?
Simultaneous spikes in PURGE6 #10 Flow, Boiler Feedwater (BFW) Flow, and Steam Flow could be indicative of a systemic event affecting multiple process streams. Here are some potential reasons for this occurrence:
Control System Response: A control system could have reacted to a process deviation by opening valves to increase flow in an attempt to bring the process back within desired parameters.
Instrumentation or Sensor Fault: A common fault in the instrumentation system, such as a power surge, could cause erroneous readings simultaneously if multiple flow meters share the same power supply or signal processing unit.
Process Upset: An upstream process upset could result in a cascade effect where multiple streams are affected simultaneously. For example, a sudden change in reactor conditions could necessitate an increase in coolant flow, which would impact the steam system and necessitate adjustments in the purge flow to maintain pressure balance.
Operational Changes or Manual Intervention: A manual change made by the operating staff, like adjusting setpoints, or an operational change such as a start-up or shutdown procedure, could cause these flows to increase as part of a coordinated action.
Safety System Activation: If a safety system detected an abnormal condition, it might initiate a purge to prevent overpressure or other dangerous conditions, affecting steam and BFW flows as part of the safety procedure.
To determine the exact cause, plant personnel would need to investigate the control system's logs, review operator actions, and check for any alarm or safety system activations that occurred at the time of the spikes.
Why would a sensor fault occur?
Prompt: We can rule out control system response, process upset, operational changes, and safety system activation. No power surge was detected either. Why would a sensor fault in the reactor occur?
If a sensor fault occurred in the absence of a control system response, process upset, operational changes, safety system activations, or a power surge, there are several potential reasons to consider:
Aging or Degraded Sensors: Over time, sensors can drift from their calibrated settings due to normal wear and tear or due to harsh operating conditions like high temperatures, pressures, or corrosive environments, common in chemical plants.
Wiring or Connection Issues: Faulty wiring, loose connections, or corrosion at connection points can lead to intermittent or false readings.
Signal Interference: Electromagnetic interference from nearby equipment or improperly shielded cables can cause erratic sensor signals.
Component Failure: Failures in components associated with the sensors, such as transmitters or power supply units, can lead to simultaneous anomalies if multiple sensors share the same components.
Software Glitches: Software issues within the data acquisition or signal processing systems can lead to simultaneous spikes in readouts if the fault affects the common software handling multiple sensor outputs.
Calibration Error: If sensors were recently calibrated and a mistake was made, it could affect readings. This is especially true if a single calibration standard or procedure is applied to multiple sensors.
Maintenance Activities: Maintenance work on or near the sensors or their wiring can inadvertently affect sensor readings, especially if not properly isolated or if procedures were not followed.
Environmental Factors: Sudden changes in environmental conditions, such as humidity, temperature, or vibration, can affect sensor performance, particularly if these changes are rapid and sensors are not designed to cope with such fluctuations.
Identifying a sensor fault would typically involve cross-checking with other process parameters, running diagnostics on the sensors and their associated circuitry, and potentially replacing or recalibrating the sensors to see if the problem persists.
Why would environmental factors come into play?
Prompt: It looks like the anomaly occurred precisely at 11:00 PM local time. Which environmental factors would be more likely at that time to cause the sensor fault?
If the sensor fault occurred precisely at 11:00 PM, several environmental factors could be more prevalent at this time that might influence the sensor readings:
Temperature Drop: As the ambient temperature often drops at night, temperature-sensitive sensors might produce different readings if they are not properly compensated for temperature variations.
Condensation: Cooler nighttime temperatures can lead to condensation on sensor surfaces or within sensor housings, especially if the sensors are exposed to the outdoor environment or are in areas with fluctuating temperature and humidity levels.
Shift Change: If there is a shift change around that time, the new crew might be starting up equipment or systems that could cause transient conditions in the plant, such as steam hammer or changes in pressure that momentarily affect sensor readings.
Electrical Load Variations: There may be changes in the electrical load of the plant or surrounding area at this time due to shifts in demand, which could potentially cause fluctuations in power quality (although a power surge was ruled out, other subtle disturbances might still be relevant).
Lighting Systems: The switching on or off of large industrial lighting systems can sometimes cause electrical disturbances or a momentary change in temperature that could affect sensitive equipment.
Scheduled Maintenance or Testing: Some plants have automated testing or calibration checks scheduled during off-peak hours, which could temporarily affect sensor outputs.
Atmospheric Pressure Changes: Changes in atmospheric pressure, which can be more pronounced during night-time temperature transitions, might affect certain types of sensors, particularly those measuring flows and levels.
To pinpoint the exact environmental factor, it would be necessary to review the environmental monitoring data, check for any scheduled activities or changes in the plant, and consider any reports from the operations crew on duty at that time.
Responding to Criticisms of Five Whys
ChatGPT’s expressive nature is often annoying in most contexts. In the context of complex systems diagnosis like manufacturing lines, is not one of them. Many of the criticisms of ‘five whys’ are due to knowledge or time limitations. Let’s break down each one:
Tendency for investigators to stop at symptoms rather than going on to lower-level root causes.
As demonstrated above, the GPT can analyze dashboard images! There are no time series data feeds in the demo that are accessible, but the GPT can also analyze those with Code Interpreter. The multi-modal nature of the GPT makes it easy for an investigator to add additional context about the problem and simply try asking again. This alleviates stopping early due to time constraints or having to crunch data before asking for more.
Inability to go beyond the investigator's current knowledge – the investigator cannot find causes that they do not already know.
The GPT generated context-aware listicles of possible causes along the way, giving the investigator many options to consider potentially outside of their current knowledge. Especially with an aging workforce in manufacturing, a GPT can facilitate knowledge transfer and mentorship in a new way.
Lack of support to help the investigator provide the right answer to "why" questions.
The GPT provided plausible reasons why almost instantly. The investigator didn’t need outside support to be able to keep probing.
Results are not repeatable – different people using five whys come up with different causes for the same problem.
The GPT and the investigator prompting it will have variations in results. However, it may be more consistent if specific context is provided.
Tendency to isolate a single root cause, whereas each question could elicit many different root causes.
No tendency to isolate here! The GPT is quite expressive and gives the investigator a wide field of view to avoid single root cause isolation.
Industrial Productivity Gains with ChatGPT
There are two major problems with any new technology or good. How to build it and how to harness it.
For GPTs, the foundation models are currently being built by large tech companies with access to millions of GPUs. This is unlikely to change in the immediate future.
On the other hand, many industrial companies have huge corpora of designs, operating procedures, telemetry data, and more that they cannot analyze fast enough. Being able to use GPTs to efficiently search this body of knowledge is the future.
The last three weeks demonstrated primarily surface-level capabilities the foundation models already have that can help industrial workers be more productive. In the coming weeks, Exponential Industry will break down some of the underlying concepts of GPTs and how industrial companies can structure their websites, applications, products, etc. to harness GPTs even more effectively.