A Mutual Information-Based Assessment of Reverse Engineering on Rewards of Reinforcement Learning

Rewards are critical hyperparameters in reinforcement learning (RL), since in most cases different reward values will lead to greatly different performance. Due to their commercial value, RL rewards become the target of reverse engineering by the inverse reinforcement learning (IRL) algorithm family. Existing efforts typically utilize two metrics to measure the IRL performance: the expected value difference (EVD) and the mean reward loss (MRL). Unfortunately, in some cases, EVD and MRL can give completely opposite results, due to MRL focusing on whole state-space rewards, while EVD only considering partly sampled rewards. Such situation naturally rises to one fundamental question: whether current metrics and assessment are sufficient and accurate for more general use. Thus, in this article, based on the metric called normalized mutual information of reward clusters (C-NMI), we propose a novel IRL assessment; we aim to fill this research gap by considering a middle-granularity state space between the entire state space and the specific sampling space. We utilize the agglomerative nesting algorithm (AGNES) to control dynamical C-NMI computing via a fourth-order tensor model with injected manipulated trajectories. With such a model, we can uniformly capture different-dimension values of MRL, EVD, and C-NMI, and perform more comprehensive and accurate assessment and analyses. Extensive experiments on several mainstream IRLs are experimented in object world, hence revealing that the assessing accuracy of our method increases 110.13% and 116.59%, respectively, when compared with the EVD and MRL. Meanwhile, C-NMI is more robust than EVD and MRL under different demonstrations.


I. INTRODUCTION
R EINFORCEMENT learning (RL) together with unsupervised-and supervised-learning constitute the complete framework of machine learning in the AI field.Compared with the expert supervision that utilized in supervised learning, RL relies on step-by-step reward feedback from the interaction between the autonomous agent and the environment.RL algorithm can learn the corresponding optimal policy from a large number of trajectories via a series of action attempts of the agent during training time.Because of this unique characteristics, RL can be regraded as autonomous learning, and greatly drives AI development to grounding applications, including the Boston Dynamics robots [1] and AlphaGo Zero [2] software agent.
In RL, different rewards usually lead to performance's significant differences.As the critical hyperparameter in RL, the reward refers to the feedback given by the corresponding environment according to the agent's action selection performance at each state.Reward can guide the agent to learn a optimal policy, which maximize the sum of expected rewards.Utilizing a simple car driving, for example, we may simply design two rewards: +1 means taking an action into a state that has a predefined safe distance to the car in front of our car, while -1 means taking an action into a state not having a safe distance.However, artificial rewards for the given task do not work well in many cases, such as drone autonomous driving and unmanned aerial vehicle.From this, we can see that well designed rewards have high commercial value, and need to be carefully predesigned according to the expert knowledge.
Unfortunately, RL rewards have become the target of reverse engineering through inverse reinforcement learning (IRL) [17].IRL algorithms aim to learn rewards by imitating history expert trajectories, while IRL is designed to make the manually design of rewards become easier.More specifically, given a certain number of expert trajectories, IRL can obtain approximate rewards via model learning.Such a situation makes IRL become a double-edged sword, meaning that instead of providing help for RL training, IRL can also be used as a kind of reverse engineering to achieve approximated rewards.In recent years, with the development trend of online open AI, the reverse engineering vulnerability of IRL has become a serious challenge.For example, Facebook [8] has opened up a large number of expert trajectories of several RL models on Github to encourage researchers to retrain, and eventually upload their new models back to Facebook.Hence, it is necessary to explore accurate assessments on IRL performance, and further expose the threat of reward reverse engineering.
IRL methods are mainly classified into two categories: linear methods, including MMP [34], MWAL [35], MaxEnt [33], and AN [19]; and nonlinear kernel function-based ones, including GPIRL [32], LEARCH [18], and FIRL [20].Under the scenario of IRL performance assessing, there are generally two assessment metrics: expected value difference (EVD) and mean reward loss (MRL), which we call them EVD and MRL, respectively.EVD pay attention to the expected value of several sampling trajectories, while MRL focuses the entire state space and get the MRL.However, as the motivating example shows in Section III, EVD and MRL can give completely opposite assessing results.For this reason, which assessment metric should be trusted?This is the fundamental point that inspires us to rethink whether the current metrics are sufficient for accurate evaluation.Thus, it is necessary to develop a novel metric that can fill the gap with middle granularity between the entire state space and the specific sampling space.
In this work, we propose a novel assessing metric of reverse engineering on RL rewards, and develop a metric via normalized mutual information of reward clusters (C-NMI).We employ an agglomerative nesting algorithm (AGNES) [31] for dynamical C-NMI computing to quantify the reward clusters' similarity compared with the reward ground truth.We build a fourth-order tensor model embedded with manipulated trajectories, which are formed from both suboptimal [22] and inverted [21] trajectories.Based on such a fourth-order tensor model, we can uniformly capture and store different-dimension metric values of MRL, EVD, and C-NMI.We can also perform comprehensive assessment by computing a lower 1.5 × interquartile range (IQR) [12] whisker for MRL and EVD, and an upper 1.5 × IQR whisker for C-NMI.
In our experiment, we target object world (OW) as the benchmark, and implement seven mainstream IRL algorithms: Max-Ent, MMP, GPIRL, MWAL, FIRL, AN, and LEARCH.For setting the ground truth, the consistency between MRL and EVD is analyzed.In our datasets, statistical analysis shows that C-NMI has better assessing accuracy than MRL and EVD.In detail, C-NMI-based assessment can achieve the highest accuracy of 0.89, and is robust under different cases as well.
We summarize our contributions as follows:  [3]), researchers found such attacks that can steal the machine learning model almost perfectly.While our work make the first attempt to assess the reverse engineering threats against the open RL platform.Wang et al. [9] focused on revealing hyperparameter stealing, in which the stolen hyperparameter is highly important for model performance.Similarly, for RL model, reward function is the critical hyperparameter, which should be designed before model begins.However, comparing with the hyperparameters within the traditional machine learning model, the number of rewards are unknown, which is a challenge for IRL reverse engineering.Thus, we both pay attention to the unknown hyperparameters' number, and the specific values of reward function in RL model.IRL Assessment: Many studies [32] [19] have shown different measures with various names on IRL performance, including the EVD, MRL, approximation error, percent misprediction, feature expected distance, policy loss, and learning score.In earlier works, the approximation error is usually utilized to assess the similarity between the original reward function and the reward function approximated by IRL.Currently, EVD and MRL are used to evaluate the reward function's similarity.In our work, we also employ them as a highly important measures.Some other measures, such as percent misprediction and feature expected distance, are used from a very different angle (state action value function) to indirectly reflect the reward function's similarity, that is, to compare the new learned policy's action with an expert's action of trajectories.Furthermore, a RL algorithm's learning score is obtained through using an approximated reward to actually interact with the environment.However, from the reverse engineering perspective, the percent misprediction, feature expected distance, and learning score, are far from direct and unsuitable.Therefore, we introduce mutual information to assess the discreteness similarity between original rewards and rewards by approximated IRL.Li et al. [10] proposed a mutual information-based method for improving IRL, they just used the mutual information for features ranking based on the relevant evaluation results in the IRL training process; this is totally different from our use in assessing discrete reward clusters.

III. MOTIVATING EXAMPLE
We choose OW, a popular game that has appeared in many IRL-related experiments.We choose two different IRL algorithms, LEARCH and MWAL, to analyze the confused IRL results.We use the metrics MRL and EVD.Fig. 1 presents two graphs for comparison.The left graph shows the ground truth of true discrete rewards.There are only three types of rewards: +1 means an agent reaching a white square, -1 for the black square, and 0 for the gray square.The IRL results are shown in the right part of each graph.For comparison, we use a red dotted line to surround each cluster of +1 rewards, and a green dotted line for -1 reward clusters.Moreover, for the right graph in Fig. 1(a) and (b), we utilize different color dotted lines to represents the clustering result given by IRL algorithm, LEARCH and MWAL, respectively.
We can see that when assessing with MRL metric, it indicates that the MWAL's performance is better (MRL=4.5314);meanwhile, under the assessment of EVD shows that the LEARCH gives the better result (EVD=16.6681).Obviously, under such scenario, MRL and EVD give completely opposite assessing results, and which metric should we trusted to determine the effectiveness of IRL algorithm?This is the fundamental point that inspires us to rethink whether the current metrics are sufficient for accurate evaluation.One possible reason is that MRL focuses the entire state space and get the MRL, while EVD pay attention to the expected value of several sampling trajectories.Thus, it is necessary to develop a novel metric that can fill the gap with middle granularity between the entire state space and the specific sampling space, further, to give an accurate evaluation result.In fact, from the aspect of clustering result, we can see that the result of LEARCH is much similar to the original ground truth, which inspires us that reward cluster-related features should be considered in an accurate assessment.

A. Reinforcement Learning (RL)
The five-tuple M = {S, A, P, γ, r}, is generally be utilized to represented the Markov decision process, in which S denotes the state space, and A represents the action space, P (s |s, a) denotes the corresponding transition probability that transfers from state s to s (s, s ∈ S) with action selection a ∈ A, γ denotes the discount factor and it ranges from 0 to 1, the fifth element r denotes the reward function.And the optimal policy π * can be denoted as The purpose of RL is to obtain a policy, for which the input should be the observation s, and the output is the action selection probability under each state.At time t, the action selecting is a t , and the policy can be computed as the probability p(a t |s t ).
From the beginning to the end of a certain task, the agent can generate a trajectory τ = {s 1 , a 1 , . . ., s T , a T }.The state-action value function can be calculated as Q π (s t , a t ) = E(r t+1 + γr t+2 + γ 2 r t+3 + . . .|s t , a t ), and it an be simplified as E(γ t −t r n t +1 |s t , a t ), t ∈ {t, t + 1, . . ., T }.According to the Bellman equation (dynamic programming equation), the state value function can be represented as V π (s t ) = a t π(a t |s t )Q π (s t , a t ).Thus, the state-action value function can also be represented as In order to obtain a good policy, Rθ = τ R(τ )p θ(τ ) should be maximized through a gradient ascent-based update θ ← θ + η∇ Rθ .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Inverse Reinforcement Learning (IRL)
IRL can be described as M\r together with the expert demonstration D * = {τ 1 , τ 2 , . ..},where τ i is represented as τ i = {s i,1 , a i,1 , . . ., s i,T , a i,T }.The purpose of IRL is to find a reward function r, and under r the optimal policy π * can give the maximum probability when following the given expert trajectories.The probability of choosing action a at state s can be represented as P (a|s) ∝ exp(Q r (s, a)), in which Q r (s, a) = E[r + V r (s )].While the state value function V r (s) can be calculated as a p(a|s)Q r (s, a), thus, P (a|s) can be represented as exp(Q r (s, a) − V r (s)).Under reward function r the log likelihood of the given expert trajectories can be represented as and maximizing Equation (1) directly to obtain reward function r.

C. Classical Metrics: MRL and EVD
MRL: measures the average difference between the learned rewards and the true rewards, and MRL can be calculated as follows: in which r e denotes the true reward, and we use r u to represent the learned reward.EVD: measures the difference between the expected cumulated reward under the ground truth rewards and the learned rewards, which can be represented as follows: where π * denotes the optimal policy under the true rewards r e , and π u is derived from the IRL's inverse rewards , where ζ is the number of selected trajectories; and τ * /τ u denotes the T steps trajectory generated by policy π * /π u .
MRL and EVD both range from 0 to +∞, the smaller their value, the better the IRL performance.

V. MUTUAL INFORMATION-BASED IRL ASSESSMENT
In this section, as shown in Fig. 2, we propose our mutual information-based assessment in detail.We define and compute the reward clustering-based metric, and present our comprehensive fourth-order tensor model embedded in manipulated trajectory.

A. Generating Reward Clusters
Given the reward set r, we utilize the AGNES to obtain the reward-close clusters.As shown in Algorithm 1, which begins with |r| clusters.In line 5, we calculate the minimum distance between cluster C α and C β , which can be calculated as follows: where C i and C j are clusters, and C i C j = ∅, d(r i , r j ) is the Euclidean distance [14] between two rewards r i ∈ C i and r j ∈ C j .In line 6, we merge two individual clusters, whose distance is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We then repeat the operations from line 4 to line 8 until the number of cluster reaches |k|.

B. C-NMI
Given the ground truth reward set r e , if the number of clusters is assumed as k x ∈ k for clusters C e x , then we have C e x = {C e x;1 , C e x;2 , . . ., C e x;k x }; furthermore, C e x;i C e x;j = ∅ (i, j ∈ {1, 2, . . ., k x }), and k x i=1 C e x;i = r e .In order to compute the mutual information of reward clusters, we first conduct state space sampling to make the cluster comparison under the same space.Fig. 3  To rescale C-MI, we calculate the normalized mutual information of reward clusters (C-NMI) between C u x,y and C e x,y , which can be represented as follows:

C. Fourth-Order Tensor Model
Towards dynamic compositions of variables including cluster number, top K of cluster ranking, and percentage of manipulated trajectories, we design a fourth-order tensor model to uniformly capture and store different-dimension metric values of MRL, EVD, and C-NMI.Hence, we can conduct a more comprehensive assessment, as well as evaluate the accuracy for our proposed metric C-NMI.
Fig. 4 depicts the overall architecture of the fourth-order tensor model.This architecture contains three layers.The fourthorder tensor layer is responsible for capturing and storing multidimensional metrics.The second layer is designed for extracting key metric values based on the IQR of statistics, and the output layer outputs the final results for assessment.
Fourth-Order Tensor: We construct a fourth-order tensor B ∈ R X×Y ×Z×V , where X = |k|, Y = |o|, Z = |m|, and V = |g|.In the third order, we implement a method (see Algorithm 2) to inject suboptimal and inverted trajectories as manipulated ones.Randomly choosing a subspace π ⊂ π, and π contains the optimal policy π * .The operation command cmd == 0 means injecting suboptimal trajectories as manipulated ones; else, it means injecting inverted trajectories.From lines 2 to 3, we In line 5, we invert r e , and obtain the inverted reward set r inv .From line 6 to line 7, we compute the inverted policy, and we generate the inverted trajectories as follows: We then repeat operations from line 9 to line 14 to inject τ sub i and τ inv i as manipulated trajectories until the control percentage reaches p |p| , which can be represented as follows: Finally, we output the manipulated demonstrations D s * and D i * .Thus, the third order m contains |p|-dimension features, which can be represented as (m 1 , . . ., m |p| ) = (p 1 , p 2 , . . ., p |p| ); meanwhile, the fourth order g has threedimension features to separately represent three metrics, and it can be represented as (g 1 , g 2 , g 3 ) = (MRL, EVD, C-NMI).By expanding tensor B along g, we can obtain the matrix B (g) : where B x,y,z,1 = (B MRL, EVD, and C-NMI based on the IQR of the corresponding multidimensional sequence, because it is a commonly used robust measure of scale [4].The IQR of B x,y,z,v can be computed as follows: where Q v 3 and Q v 1 indicate the upper-and lower-quartiles of B x,y,z,v , respectively.Thus, the upper 1.5× IQR whiskers can be represented as ).These can separately characterize the highest and lowest occurring values of B x,y,z,v , thereby avoiding the influence of outliers.
According to the definitions of MRL, EVD, and C-NMI, we set the lower 1.5× IQR whiskers of B x,y,z,1 and B x,y,z,2 as the key metric values of MRL and EVD, respectively.We set the upper 1.5× IQR of B x,y,z,3 as the key metric value for C-NMI.

A. Experimental Setup
Benchmark: The parameters for the experiment platform are shown in Table II.Since the OW is most commonly used in IRL experiments, we utilize this environment as our experimental benchmark.OW is a gridworld where dots of primary colors, e.g., blue and green, and secondary colors, e.g., red and cyan, which are placed on the grid randomly, as shown in Fig. 1.In OW, the agent maximizes its expected discounted reward by following a policy that provides the probabilities of actions (moving up/down/left/right or staying still) at each state, with each subject to a transition probability.Each state, i.e., grid block, is described by the shortest distances to dots among each color group.The reward is assigned such that if a block is 1 step within a green dot and 3 steps within a blue dot, the reward is +1; if it is 3 steps within a blue dot only, the reward is −1; and 0 otherwise.Considering the number of primary colors Col 1 = 2, and the number of secondary colors Col 2 = 2.For constructing the benchmark of OW, we use a common MATLAB function named round(rand(1)) to assign each grid a value of 0 or 1; if it is 1, then we place an object on this grid.Meanwhile, each object is randomly assigned one of the Col 1 primary and Col 2 secondary colors.
Datasets: According to the grid map size, we build six datasets for experiment analysis, which can be recorded as DS =  [32] to generate 20 grid map samples randomly, which can be represented as DS i×i = {ds 1 i , ds 2 i , . . ., ds 20 i }.Baseline: In the field of IRL algorithm assessing, there is not existing one specific "ground truth" to give an expert evaluation, as different researchers utilize different evaluation metrics, such as only use EVD, MRL, or use both of them.Thus, in order to evaluate the assessing accuracy of C-NMI, we utilize MRL together with EVD to design a baseline for accuracy comparison among different metrics.Through ranking MRL and EVD values in ascending order, we can obtain two ordered sets mrl and evd.Then, we divide mrl and evd into three subsets with same sizes, and use tertiles 1 () and tertiles 2 () to represent the lower-and upper-tertiles [7], respectively.
Using the IRL algorithm A, we define the function MRL(A) = Q 1  1 − 1.5 × IQR(B i,j,z,1 ).If MRL(A) ≤ tertiles 1 (mrl), A has a good performance, and tagging A with l mrl (A) = 1; while MRL(A) ≥ tertiles 2 (mrl) shows that A performs poor, thus tag A with l mrl (A) = −1; last, l mrl (A) = 0 indicates moderate performance.Similarly, we also have l evd (A) = 1, −1, or 0. Finally, we define the baseline as follows: F bl is defined in Table III in which metric ∈ {mrl, evd, c − nmi , ws}, and ws represents the weighted sum of metric MRL and EVD which can be denoted as

B. Average Assessing Accuracy Analysis of C-NMI
In this section, to compare the average assessing accuracy of different metrics, we take MRL, EVD, C-NMI, and the weighted sum metrics ω 1 •MRL+ω 2 •EVD into consideration.As shown in Table V, we can obtain several observations: under LEARCH algorithm, C-NMI increases 116.95% and 132.69%, respectively, when comparing with EVD and MRL metric.In addition, the average metric value among different IRLs has increased 110.13% and 116.59%, respectively, when compared with EVD and MRL.Under GPIRL algorithm, C-NMI will give the highest accuracy.
For ws metric, when the combination of (ω 1 , ω 2 ) = ( 2 5 , 3 5 ) results in the highest accuracy (0.89) with GPIRL.Under most algorithms, C-NMI's performance is better than the ws.Furthermore, when the combination is ( 2 5 , 3  5 ), which can give the best performance, our the average assessing accuracy of C-NMI improves the mean by up to 72.69%.Thus, generally speaking, compared with all other assessing metrics C-NMI has the best performance.
In order to verify the above observations more intuitively, we visualize the rewards learned through 7 mainstream IRLs for a case study.We target the same 40 × 40 grid map for comparison.Eight subgraphs are presented in Fig. 5

C. Hyperparameters Analysis
Cluster number (k) Under different cluster numbers, the performance of C-NMI metric is shown in Fig. 6(a).Varying the cluster number k in the range of {2, 3, . . ., 10}, and record the average C-NMI assessing accuracy for all k.It can be seen that a higher average assessing accuracy is obtained for a larger k, which reveals that, with the increase of cluster number k, the rewards division is finer, and the C-NMI can give a better evaluating performance of IRL algorithms.When k ≥ 8, C-NMI can reach the optimal ¯ .Thus, 8 is a good choice for k in the C-NMI computing.
Top K of cluster ranking (o) In order to verify o's influence on the average assessing accuracy of C-NMI, we vary o in the range of {2, 3, . . ., 10}, then make comparison among 7 mainstream IRLs.As shown in Fig. 6(b), o = 7 is a good choice, since the optimal average assessing accuracy can be achieved with all 7 IRL algorithms.As the top K of cluster ranking o increases, the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
performance of C-NMI increases as well, which indicates that the larger the state space sampling, the better the performance of C-NMI.

D. Robustness Comparison
To validate the robustness of the different assessment metrics, we compare the performance of MRL, EVD, C-NMI, and ω 1 •MRL+ω 2 •EVD under different demonstrations.The results are shown in Fig. 7, in which we use a dotted red line to surround the smallest accuracy variance among different demonstrations.It can be seen that for all of the algorithms, the assessing accuracy of C-NMI varies in the smallest range.This indicates that C-NMI has the highest robustness compared to the other metrics.In other words, C-NMI will give a more stable high assessing accuracy for any D ∈ D. One possible reason for this is that when D is mixed with manipulated trajectories, IRL algorithms may give "unbalanced" policies that learn well in some states but poorly in others.Our C-NMI samples multiple state spaces in the whole state space to calculate the reward clusters' similarity, which can resolve the problem of the "unbalanced" policy to some extent.
In addition, we also find that the robustness of MRL and EVD is not good enough.The reason for this is that MRL computes the MRL on the global space, and thus it can not handle the "unbalanced" policy in extreme situations.For instance, the learned policy gives extremely close rewards in a few states, and very far rewards in others.Meanwhile, EVD may randomly sample initial states that locate in the well-learned space, and give a good assessment under the "unbalanced" policy.

VII. CONCLUSION
In this work, we pay attention to the inconformity problem of MRL-and EVD-based IRL assessment.There are two main challenges for us to address: 1) how to design a novel metric by combining the advantages of both MRL and EVD, and 2) how to construct a comprehensive assessment method for accurate comparison and analysis.To address such challenges, we craft a novel assessment of IRL based on the metric called normalized mutual information of reward clusters (C-NMI).Hence, we attempt to fill the existing research gap by considering a middle-granularity state space between the entire state space and the certain sampled space.We utilize the AGNES to controlling dynamical C-NMI computing via a fourth-order tensor model with injected manipulated trajectories.Furthermore, we give extensive experiments on seven mainstream IRLs.We analyze the experimental results on various aspects, including accuracy and robustness.The experimental results show that our method increases 110.13% and 116.59%, respectively, when compared with the EVD and MRL.Meanwhile, C-NMI is more robust than EVD and MRL under different demonstrations.

Fig. 1 .
Fig. 1.Opposite results given by EVD and MRL under the OW benchmark (a) LEARCH algorithm and (b) MWAL algorithm.

Fig. 2 .
Fig. 2. Illustration of mutual information-based IRL assessment.The whole framework contains two parts, in which the first one is the C-NMI computing, and the second one is the fourth-order tensor model.

Algorithm 1 :Fig. 3 .
Fig. 3. Illustration of state space sampling.For the left map, different colors denote different clusters, in which "blue" represents cluster C ex;1 , "green" denotes cluster C e x;2 , "black" is cluster C e x;3 , and "red" is cluster C e x;4 .Then ranking C e x based on each clusters' size, and sampling the top 3 clusters to form a new state space S which colored by "gray," while the remaining states is colored with "white." where H(C u x,y ) = − |C u x,y | i=1 p(C u x,y;i ) log p(C u x,y;i ), H(C e x,y ) = − |C e x,y | j=1 p(C e x,y;j ) log p(C e x,y;j ), and C-NMI(C u x,y , C e x,y ) ∈ [0, 1].C-NMI=1 means that the two clusters are exactly coincident.

Fig. 4 .
Fig. 4. Framework of the 4-order tensor model.This framework contains three layers, in which the first layer is the 4-order tensor layer, the second one is the extraction layer, and the last one is the output layer.

Fig. 5 .
Fig. 5. Rewards visualization for the case study of (a) ground truth, (b) MMP, (c) MWAL, (d) MaxEnt, (e) AN, (f) GPIRL, (g) LEARCH, and (h) FIRL under a 40 × 40 grid map.For each subfigure the x-axis denotes the horizontal position of the given grid map, and the y-axis represents the vertical position of the given grid map.Moreover, for the 40 × 40 grid map, the range of x-axis and the y-axis are all in {0, 1, . . ., 39}.

Fig. 6 .
Fig. 6.Comparison of ¯ c−nmi (A) under (a) various cluster number k, and (b) various top K of cluster ranking o.

1 ;
) in which ds j i denotes the jth sample in DS i×i .F acc is defined in Table IV.If l(A D ds j i ) − l metric (A D ds j i ) = 0, then F acc (l(A D otherwise, F acc equals 0. The average assessing accuracy across different demonstrations can be calculated as follows: For l c−nmi (A), similarly, ranking C-NMI values by descending order and obtaining ordered set c-nmi.We make comparison of C-NMI(C e , C A ) and tertiles 1/2 (c-nmi) under the given learned result C A .Thus, we can tag A with l c−nmi (A) = 1, −1 or 0. In the same way, we can obtain l ws (A).Noting that , ¯ both range from 0 to 1, and the larger , ¯ represent the performance is better.
for comparison: subfigure (a) shows the ground truth rewards, and subfigures (b) to (h) reveal the learned rewards under different IRL algorithms.Table VI compares the corresponding key metric values of MRL, EVD, and C-NMI with the baseline.We mark the key metric value with an underline if l metric (A) = −1, and bold if l metric (A) = 1.It can be seen that MRL gives a wrong assessment of MWAL, and EVD makes
illustrates this state space sampling.We first descend all of the clusters in C e x based on the size of each cluster |C e x;i |.According to the top K of cluster ranking o y ∈ o, we form C e x,y = {C e x,y;1 , . . ., C e x,y;o y }, and then obtain the corresponding state space S as follows: where C u x,y is the clustering result for r u .p(C u x,y;i , C e x,y;j ) = represents the probability that r belonging C u x,y;i .Similarly, p(C e x,y;j ) = |C e x,y;j | |S | .The C-MI(C u x,y ; C e x,y ) ranges from 0 to +∞, and the C-MI will increase with the higher dependence of clusters C u x,y and C e x,y .C-MI=0 if and only if C u x,y and C e x,y are completely independent.

TABLE III DEFINITION
OF FUNCTION F bl

TABLE IV DEFINITION
OFFUNCTION F ACC compute the suboptimal policy π sub and generate the corresponding trajectories τ sub i , which can be calculated as follows:π sub = arg max π \{π * } E ∞ t=0 γ t r e s t |π(9)