Thermoelectric coolers (TECs) offer a promising solution for direct cooling of local hotspots and active thermal management in advanced electronic systems. However, TECs present significant trade-offs among spatial cooling, heating, and power consumption. The optimization of TECs requires extensive simulations, which are impractical for managing actual systems with multiple hotspots under spatial and temporal variations. In this study, we present a novel machine learning-assisted optimization algorithm for thermoelectric coolers that can achieve global optimal temperature by individually controlling TEC units based on real-time multi-hotspot conditions across the entire domain. We train a convolutional neural network with a combination of the inception module and multi-task learning approach to comprehend the coupled thermal-electrical physics underlying the system and attain accurate predictions for both temperature and power consumption with and without TECs. Due to the intricate interaction among passive thermal gradient, Peltier effect and Joule effect, a local optimal TEC control experiences spatial temperature trade-off which may not lead to a global optimal solution. To address this issue, we develop a backtracking-based optimization algorithm using the machine learning model to iterate all possible TEC assignments for attaining global optimal solutions. For any m × n matrix with NHS hotspots (n, m ≤ 10, 1 ≤ NHS ≤ 20), our algorithm is capable of providing 52.4% peak temperature reduction and its corresponding TEC array control within an average of 1.64 s while iterating through tens of temperature predictions behind-the-scenes. This represents a speed increase of over three orders of magnitude compared to traditional finite element method strategies which take approximately 27 min.
I. INTRODUCTION
Despite great advancements in semiconductor technology beyond the sub-3 nm node,1 most thermal management techniques nowadays are limited to the macroscale operation. The trend toward device miniaturization and the rapid emergence of System-on-Chip (SoC) inevitably complicate the thermal behavior within microelectronic devices.2,3 Specifically, multiple on-chip hotspots exhibit spatial and temporal changes due to workload variations, environmental fluctuations, device defects and aging, which can occur among modules,4,5 cores (processors),6,7 and transistors.8 The complexity of the hotspot behavior presents unprecedented challenges for conventional thermal management methods which only rely on uniform control, necessitating a more efficient, sophisticated, and intelligent approach capable of on-demand thermal management to ensure optimal functionality and longevity of microelectronic devices.9
Among various active cooling techniques, thermoelectric coolers (TECs) offer distinctive local cooling capability as well as several other advantages,10–12 making them a promising solution to hotspot thermal management. In recent years, there have been emerging designs utilizing single TECs13–16 and TEC arrays17–19 for on-chip hotspot cooling in microelectronic devices. Revolutionary materials, including nanostructured Si,20–22 self-hygroscopic hydrogel,23 and flexible inorganics,24,25 are extensively studied to improve TEC cooling performance. However, TEC cooling exhibits significant trade-offs in spatial temperature and power consumption, and its performance relies on multiple variables including TEC voltages and hotspot conditions.12,26 The high non-linearity in TEC behavior requires multiple solutions for optimization, which brings expensive computational cost to conventional finite element method (FEM) simulations. Furthermore, in actual applications where multiple hotspots undergo spatial and temporal evolution, the traditional techniques become more challenging and even impossible to realize a real-time optimal TEC control.
The thriving field of machine learning offers a powerful tool for thermoelectric research by providing neural network models that greatly expedite the process of thermoelectric material selection,27 TEC design,28,29 and optimization.30,31 However, these models primarily focus on the analysis of back-end designs by considering an individual, isolated TEC device,28,29 or using over-simplified optimization logics such as linear control30 and uniform control.31 Apparently, there is still vacancy and urgent need for a more comprehensive model that can comprehend the coupled thermal-electrical physics in TECs while predicting their spatial interplays with multiple hotspots undergoing dynamic evolution across the entire domain. Ultimately, this model should be capable of providing responsive TEC control and the corresponding power consumption over the entire domain to achieve real-time global optimal temperature.
In this study, we present a machine learning-assisted optimization algorithm for TECs that fulfills the aforementioned on-demand thermal management. We utilize our previous holey silicon-based TEC array with independent TEC control32 as an illustrative example for conducting the analysis. We develop a convolutional neural network (CNN) with the inception module and multi-task learning (MTL) approach to perceive the spatial correlation of TECs and hotspots, thereby accurately predicting temperature and power consumption by comprehending the thermal-electrical physics underlying the system. During the TEC optimization process, the major challenge lies in the intricate thermal-electrical interaction among multiple hotspots and TECs, since a local optimal TEC control may not lead to a global optimal solution due to temperature redistribution. Therefore, we develop a backtracking-based optimization algorithm that efficiently explores all potential TEC assignments in order to obtain the global optimal temperature based on real-time hotspot conditions. Note that this methodology can be applied to general TEC/TEC array designs with a wide range of thermoelectric materials (e.g., Bi2Te3/Sb2Te3), configurations (e.g., lateral- and vertical-oriented TECs), and device scales (e.g., module-scale and transistor-scale). Consequently, this approach can hopefully provide efficient TEC/TEC array control logics for the future TEC-incorporated electronic systems.
II. TEC MODELING
To demonstrate our study, we choose our previous theoretical designs,32 the holey silicon-based singe TEC [Fig. 1(a)], and its scaling array [Fig. 1(b)] as our TEC model. The model features a lateral orientation of the TEC components (i.e., Peltier electrodes and holey silicon region) along with the central hotspot. Here, holey silicon is the thermoelectric material due to its compatibility with microfabrication processes, and because the introduction of vertical nanoholes results in substantial decrease of in-plane thermal conductivity due to phonon boundary scattering, meanwhile retaining excellent electrical properties (electrical conductivity and Seebeck coefficient) from p-type silicon.15,21,33–35 When positive voltage is applied to the cooler, lateral heat redistribution occurs which provides active heat removal from the hotspot to the in-plane surroundings. Compared to a single TEC, the TEC array as shown in Fig. 1(b) has multiple coolers enclosed by a single ground that allow for independent TEC control. Based on the existing hotspot conditions, different coolers can accept different input voltages to achieve on-demand thermal management.
Holey silicon-based lateral TEC and its array model. (a) The schematic of a single TEC. (b) The schematic of an arbitrary m × n TEC array with NHS assigned hotspots and NTEC assigned TECs (1 ≤ m, n ≤ 10; 0 ≤ NHS, NTEC ≤ min[m × n, 20]). The intensities of hotspots and TECs are in nine discrete levels. (c) FEM modeling examples of a 3 × 3 hotspot-TEC array with three scenarios: hotspots only, TECs only, and hotspots + TECs.
Holey silicon-based lateral TEC and its array model. (a) The schematic of a single TEC. (b) The schematic of an arbitrary m × n TEC array with NHS assigned hotspots and NTEC assigned TECs (1 ≤ m, n ≤ 10; 0 ≤ NHS, NTEC ≤ min[m × n, 20]). The intensities of hotspots and TECs are in nine discrete levels. (c) FEM modeling examples of a 3 × 3 hotspot-TEC array with three scenarios: hotspots only, TECs only, and hotspots + TECs.
III. NEURAL NETWORK DEVELOPMENT
Figure 2 illustrates the research workflow which includes massive data generation and data postprocessing. Here, we utilize MATLAB R2021a to generate random hotspot and TEC inputs based on the given constraints, which will be sequentially fed to COMSOL 6.0 to conduct FEM simulations and evaluate temperature maps and power consumption as outputs. In MATLAB, the random values of rows (m) and columns (n) in the TEC array will be first defined. Later, two random m × n matrices with values from 0 to 8 will be generated, representing the hotspot and TEC intensities. The maximum number of assigned hotspots (NHS) and TECs (NTEC) depend not only on the model constraints but also on the robustness of the training process. Here, we select min[m × n, 36] as the maximum number since it exceeds the original span (0 ≤ NHS, NTEC ≤ min[m × n, 20]) for conducting a more robust training, meanwhile, it is not so large to weaken the independency of individual units. With the defined hotspot and TEC inputs, steady-state FEM simulations will be conducted, followed by the evaluation of temperature maps. Eventually, the temperature will be stored in an m × n output and the power consumption at eight intensities will be in an 8 × 1 output. The autonomous program generates 100 000 random samples, whose total number considers the complexity of the model, the desired accuracy, and the computation resources. After generating the original data set, data splitting of training set (70%), development set (10%), and test set (20%) is performed. The training set is further augmented with the transformation of flipping (horizontal and vertical), rotation (90°, 180°, and 270°), and their combinations, resulting in a total number of training samples as eight times as the original (560 000). This augmentation offers a cost-effective way to create new data, improve model’s accuracy and robustness,36 and facilitate the learning of inherent symmetry.
Research workflow. (Left) Massive data generation based on autonomous FEM simulations and (right) data postprocessing, including data splitting, data augmentation, neural network training, and TEC optimization algorithm design.
Research workflow. (Left) Massive data generation based on autonomous FEM simulations and (right) data postprocessing, including data splitting, data augmentation, neural network training, and TEC optimization algorithm design.
CNN based on inception module and multi-task learning. The input matrices and output temperature matrix are padded into 10 × 10 arrays for uniform shapes. The output power matrix is an 8 × 1 array corresponding to eight TEC intensities.
CNN based on inception module and multi-task learning. The input matrices and output temperature matrix are padded into 10 × 10 arrays for uniform shapes. The output power matrix is an 8 × 1 array corresponding to eight TEC intensities.
Loss analysis of 4200 random samples. (a)–(c) MAE loss as a function of array dimensions (1 ≤ m, n ≤ 10) in 3000 samples. (a) Hotspots only. (b) TECs only. (c) Hotspots + TECs. (d) MAE loss as a function of hotspot and TEC counts (1 ≤ NHS, NTEC ≤ 10) in a 6 × 6 TEC array in 1200 samples. (e) Error of power consumption between ground-truth and predicted values for all samples.
Loss analysis of 4200 random samples. (a)–(c) MAE loss as a function of array dimensions (1 ≤ m, n ≤ 10) in 3000 samples. (a) Hotspots only. (b) TECs only. (c) Hotspots + TECs. (d) MAE loss as a function of hotspot and TEC counts (1 ≤ NHS, NTEC ≤ 10) in a 6 × 6 TEC array in 1200 samples. (e) Error of power consumption between ground-truth and predicted values for all samples.
Summary of MSE loss.
. | Training loss (70% data) . | Validation loss (10% data) . | Test loss (20% data) . |
---|---|---|---|
MSE—temperature | 0.604 | 0.936 | 0.951 |
MSE—power | 0.258 | 0.447 | 0.443 |
. | Training loss (70% data) . | Validation loss (10% data) . | Test loss (20% data) . |
---|---|---|---|
MSE—temperature | 0.604 | 0.936 | 0.951 |
MSE—power | 0.258 | 0.447 | 0.443 |
The predictions of a 1 × 1 TEC array (i.e., a single TEC) under hotspot only and TEC only scenarios are illustrated in Fig. 6. For temperature predictions, the MAE loss can be simply interpreted as the local error in a single TEC. The results show that the CNN model not only captures the proportional relationship between temperature and hotspot intensity, but it also successfully predicts the parabolic TEC cooling as a function of input voltage50 due to coupled Joule and Peltier effect.
Predictions of 1 × 1 array. (a) Temperature prediction for hotspot only. (b) Temperature prediction for TEC only. (c) Power consumption prediction for TEC only.
Predictions of 1 × 1 array. (a) Temperature prediction for hotspot only. (b) Temperature prediction for TEC only. (c) Power consumption prediction for TEC only.
To demonstrate the multi-hotspot scenarios, Fig. 7 illustrates four prediction examples of a 6 × 6 TEC array with the following: (a) random hotspots, (b) random hotspots + TEC cooling, (c) clustered hotspots, and (d) clustered hotspots + TEC cooling. In the first two scenarios, nine arbitrary hotspots are assigned with random intensities to represent a system incorporating different modules that experience various local heating conditions. In the last two scenarios, on the other hand, nine assigned hotspots are clustered within a 3 × 3 region with equal intensity to mimic a system consisting of similar components that undergo simultaneous workloads. It is observed that the orderly and clustered scenarios exhibit slightly higher MAE loss compared to the scattered and random scenarios due to the lower likelihood of generating well-organized data during data generation. Furthermore, for Figs. 7(b) and 7(d), larger MAE loss is identified owing to the introduction of TEC cooling mechanism. Nevertheless, the key features of the TEC array can be safely captured: First of all, significant lateral heat redistribution can be observed from the predictions, where higher temperature occurs near the TEC-assigned regions compared to their hotspots-only counterparts. Second, the clustered TECs are predicted to have poorer effectiveness compared to the isolated TECs with the same intensity. This is because the adjacent TECs tend to generate active heat flow against each other, resulting in ineffective cooling. Lastly, local TEC cooling can be influenced by its corresponding hotspot conditions. A TEC is more likely to provide greater temperature reduction when its local hotspot has higher intensity. In summary, local TEC cooling will impact and be impacted by the surrounding TECs and hotspots. Therefore, achieving a global optimal solution is almost impossible by simply considering the local optimal TEC control.
Case studies of a 6 × 6 array for temperature predictions (unit: °C). (a) Arbitrary hotspots without TECs. (b) Arbitrary hotspots with TECs (not optimal). (c) Clustered hotspots without TECs. (d) Clustered hotspots with TECs (not optimal).
Case studies of a 6 × 6 array for temperature predictions (unit: °C). (a) Arbitrary hotspots without TECs. (b) Arbitrary hotspots with TECs (not optimal). (c) Clustered hotspots without TECs. (d) Clustered hotspots with TECs (not optimal).
IV. TEC OPTIMIZATION ALGORITHM
Due to the complex dependency among TEC array and multiple hotspots, using the traditional FEM-based techniques to enumerate all possible solutions seem difficult and even impossible. However, with the efficient optimization algorithm based on machine learning model, real-time global optimal solution can be feasible. In this study, we set our target to find the global optimal temperature (i.e., the smallest peak temperature) based on the existing hotspot conditions, while other possibilities, such as looking for the minimum TEC power/TEC counts for achieving acceptance temperature, can also be possible. Figure 8 demonstrates the flow chart of the backtracking-based52,53 TEC decision-making algorithm. This algorithm can compute the lowest peak temperature across the multi-hotspot system given the number of available TEC intensities (i.e., optimization level, K). The established CNN model serves as a function to efficiently evaluate the current status. To improve efficiency and reduce unnecessary iterations, two assumptions are made: first, the highest-temperature grid has the most priority to assign the TEC. Second, the next TEC assignment must lead to a lower peak temperature compared to the current one. Only when these two assumptions hold will the algorithm look for a deeper solution based on the existing ones. Figure 9 demonstrates three cases of the 9 × 9 TEC array control using the developed algorithm: (a) random sparse hotspots, (b) random dense hotspots, and (c) clustered hotspots. The peak temperatures of three samples (i.e., 348, 362, and 349 °C, respectively) experience substantial temperature reduction (i.e., dropped down by 177, 172, and 176 °C, respectively) after the single-level optimal TEC control. Here, we define cooling effectiveness = (Tpeak,0–Tpeak,opt)/(Tpeak,0–30 °C), where Tpeak,0 and Tpeak,opt are the original and optimal peak temperatures, respectively. At this point, the cooling effectiveness of three samples yields 55%, 52%, and 55%, respectively. The total iteration counts (and times) are 26 (936 ms), 44 (1584 ms), and 47 (1692 ms), respectively. A greater K can lead to higher temperature reduction and cooling effectiveness at the expense of more computational cost. Additionally, all samples show a trend where the hotspot is moving toward the center as the optimization process evolves. This is because the TEC cooling inherently drives the system toward a more uniform temperature field, which manifests as the formation of a centralized hotspot with mitigated temperature gradient. It is worth noting that only small and moderate intensities of TEC (i.e., 1–5) are found in the optimal TEC assignments, while those large intensities of TEC (i.e., 6–8) may either cause too much penalty (i.e., temperature rise) to its neighboring or generate too much local Joule heat, which are abandoned by the optimization algorithm. This again demonstrates the fact that a local optimal TEC control may not suffice for the global optimal temperature. Interestingly, for clustered hotspots, optimal TEC assignments follow a staggered “checkerboard” pattern. This observation motivates a novel strategy for TEC array placement against uniform heat flux.
Case studies of a 9 × 9 array for backtracking-based decision-making (unit: °C). The pink matrix represents the real-time hotspot conditions, the blue matrix represents the optimal TEC assignment based on the optimization level, and the red-green matrix denotes the corresponding global optimal temperature map. Three scenarios are discussed: (a) random sparse hotspots, (b) random dense hotspots, and (c) clustered hotspots.
Case studies of a 9 × 9 array for backtracking-based decision-making (unit: °C). The pink matrix represents the real-time hotspot conditions, the blue matrix represents the optimal TEC assignment based on the optimization level, and the red-green matrix denotes the corresponding global optimal temperature map. Three scenarios are discussed: (a) random sparse hotspots, (b) random dense hotspots, and (c) clustered hotspots.
Figure 10 evaluates the as-achieved cooling performance within 1800 random samples using the machine learning-assisted TEC optimization algorithm. A total of 600 samples are generated for each array size of 6 × 6, 8 × 8, and 10 × 10. Among these samples, there are 30 samples for every NHS ranging from 1 to 20. Here, the maximum optimization level is set at six, which allows up to six discrete intensities for the assigned TEC voltage. As a result, the average peak temperature reductions for the six levels are {206, 222, 226, 226, 226} °C, the average cooling effectiveness values are {52.3%, 56.5%, 57.4%, 57.6%, 57.6%, 57.6%}, and the corresponding power consumption yields {24.4, 29.0, 29.4, 29.7, 29.8, 29.8} mW, respectively. For 0 < K ≤ 3, an increase in K provide a greater peak temperature reduction and higher cooling effectiveness at the expense of increased power consumption. However, for 3 < K ≤ 6, the cooling reaches a plateau. In general, larger arrays and smaller hotspot counts can result in more significant cooling due to more available TECs and larger space for heat redistribution.
As-achieved cooling performance and power consumption of 1800 random samples using the machine learning-assisted TEC optimization algorithm. (a) and (b) Peak temperature reduction as a function of the optimization level. (c) and (d) The corresponding cooling effectiveness. (e) and (f) The corresponding power consumption. The first column is varied by array dimensions and the second is by hotspot counts.
As-achieved cooling performance and power consumption of 1800 random samples using the machine learning-assisted TEC optimization algorithm. (a) and (b) Peak temperature reduction as a function of the optimization level. (c) and (d) The corresponding cooling effectiveness. (e) and (f) The corresponding power consumption. The first column is varied by array dimensions and the second is by hotspot counts.
Figure 11 demonstrates the efficiency analysis based on the aforementioned 1800 samples using the machine learning-assisted TEC optimization algorithm. Here, the iteration count indicates the total number of predictions required to perform a single optimization, which reflects the computational cost. From level one to six, the average iteration counts are {36, 344, 1400, 2951, 3906, 4105}, respectively. The iteration count increases with the increased optimization level and becomes excessively large when NHS is large. Given the decreasing margin of cooling improvements, it is highly recommended to apply the TEC optimization algorithm with K ≤ 3 in order to achieve balanced computational cost and TEC cooling performance.
Efficiency analysis of 1800 random samples using the machine learning-assisted TEC optimization algorithm. (a) Iteration statistics of various array dimensions. (b) Iteration statistics of various hotspot counts.
Efficiency analysis of 1800 random samples using the machine learning-assisted TEC optimization algorithm. (a) Iteration statistics of various array dimensions. (b) Iteration statistics of various hotspot counts.
To further investigate the efficiency of the machine learning-based TEC optimization algorithm, we record the single prediction time in Table II for both FEM simulation and CNN model within 3000 samples as mentioned in Fig. 5. Following this, in Table III, we summarize the running time for the optimization algorithm using the 1800 samples mentioned in Figs. 10 and 11. All FEM simulations are performed using COMSOL 6.0 with CPU computation on an AMD Ryzen 9 3950× processor (16-core, 3.5 GHz) and 128GB memory, with a maximum of 1 582 714 degrees of freedom. The CNN model predictions are computed using GPU acceleration on a NVIDIA GeForce RTX 2080Ti (11GB) with a total of 124 157 612 parameters. Based on the statistical results, the average FEM simulation time for a single prediction is found to be 45 s. Larger array sizes generally result in a longer computational time due to the increased degrees of freedom. Conversely, the CNN prediction demonstrates similar computational time through various input variables with an average time of only 42 ms. A speed increase of over three orders of magnitude of is found when using the CNN model to conduct a single prediction compared to the traditional FEM methods. Furthermore, with the acceleration of the CNN model, the single-level, double-level, and triple-level TEC optimization can be carried out within an average time of 1.5, 14.5, and 58.8 s, respectively, where the same amount of FEM computation will take about 26.8 min, 4.3h, and 17.4 h, respectively. The significant increases in speed pave the way for on-demand thermal management using realistic TEC systems. Future research will focus on the practical integration of TEC array into complex SoC systems, exploring ways to leverage machine learning-assisted TEC optimization algorithm to ensure efficient and reliable operation.
Summary of single prediction time between simulation and the CNN model.
m × n . | Test samples . | Maximum simulation time (s) . | Minimum simulation time (s) . | Average simulation time (s) . | Average CNN prediction time (ms) . |
---|---|---|---|---|---|
[1,25] | 1530 | 39 | 4 | 20 | 42 |
[26,50] | 900 | 76 | 29 | 53 | 43 |
[51,75] | 390 | 108 | 69 | 89 | 42 |
[76,100] | 180 | 149 | 102 | 123 | 42 |
m × n . | Test samples . | Maximum simulation time (s) . | Minimum simulation time (s) . | Average simulation time (s) . | Average CNN prediction time (ms) . |
---|---|---|---|---|---|
[1,25] | 1530 | 39 | 4 | 20 | 42 |
[26,50] | 900 | 76 | 29 | 53 | 43 |
[51,75] | 390 | 108 | 69 | 89 | 42 |
[76,100] | 180 | 149 | 102 | 123 | 42 |
Summary of running time for machine learning-assisted TEC optimization.
Optimization levels . | Test samples . | Maximum time (iterations) . | Minimum time (iterations) . | Average time (iterations) . |
---|---|---|---|---|
1 | 1800 | 3.3 s (78) | 0.3 s (8) | 1.5 s (36) |
2 | 1800 | 51 s (1212) | 0.4 s (9) | 14 s (344) |
3 | 1800 | 282 s (6702) | 0.4 s (9) | 59 s (1440) |
4 | 1800 | 815 s (19 403) | 0.4 s (9) | 124 s (2951) |
5 | 1800 | 1356 s (32 286) | 0.4 s (9) | 164 s (3906) |
6 | 1800 | 1545 s (36 780) | 0.4 s (9) | 172 s (4105) |
Optimization levels . | Test samples . | Maximum time (iterations) . | Minimum time (iterations) . | Average time (iterations) . |
---|---|---|---|---|
1 | 1800 | 3.3 s (78) | 0.3 s (8) | 1.5 s (36) |
2 | 1800 | 51 s (1212) | 0.4 s (9) | 14 s (344) |
3 | 1800 | 282 s (6702) | 0.4 s (9) | 59 s (1440) |
4 | 1800 | 815 s (19 403) | 0.4 s (9) | 124 s (2951) |
5 | 1800 | 1356 s (32 286) | 0.4 s (9) | 164 s (3906) |
6 | 1800 | 1545 s (36 780) | 0.4 s (9) | 172 s (4105) |
V. CONCLUSIONS
In this study, we present a novel machine learning-assisted TEC optimization algorithm aimed at achieving global optimal temperature control for on-demand multi-hotspot thermal management in microelectronic systems. Our findings demonstrate the ability of the machine learning-assisted algorithm to dynamically adapt to the evolving thermal landscape of microelectronic devices, efficiently offering optimal TEC control for managing the spatial and temporal variations of hotspots. The algorithm not only mitigates the computational burdens associated with the traditional FEM-based optimization techniques but also heralds a significant leap toward achieving the on-demand thermal management imperative for the sustainability and performance of advanced semiconductor devices.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Jiajian Luo: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Funding acquisition (lead); Investigation (lead); Methodology (lead); Project administration (lead); Resources (lead); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead). Jaeho Lee: Supervision (lead); Writing – review & editing (lead).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.