Python – How to calculate the mean and standard deviation of similarity matrix

dataframe, numpy, pandas, python, similarity

I am working with CSV files and I have a code that calculates the similarity between the documents. Post 1 provide the code and details of data and output is as follow:

The data.csv looks as:

idx         messages112  I have a car and it is blue114  I have a bike and it is red115  I don't have any car117  I don't have any bike

The output is:

    id     112    114    115    117    id                                 112  100.0   78.0   51.0   50.0    114   78.0  100.0   47.0   54.0    115   51.0   47.0  100.0   83.0    117   50.0   54.0   83.0  100.0

Now I would like to calculate the mean and standard deviation of the lower triangular of the similarity matrix (since both upper and lower are similar) without the identity data (100.0).

I tried to use the panda built-in mean and std as:

df_std = df.std()df_Mean = df.mean()

But this considers all the data in the output like identity and upper triangular.

I would like to know if there is any way that I can calculate the mean and standard deviation the way that I mentioned.

Best Solution

Use numpy.tril with k=-1 and make 0s np.nan:

import numpy as npltri = np.tril(df.values, -1)ltri = ltri[np.nonzero(ltri)]

Output:

array([[ 0.,  0.,  0.,  0.],       [78.,  0.,  0.,  0.],       [51., 47.,  0.,  0.],       [50., 54., 83.,  0.]])

And now you can do ltri.std(), ltri.mean():

ltri.std(), ltri.mean()# (14.361406616345072, 60.5)