How to compute and plot the pdf from the empirical cdf?

I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:

plt.plot(xvalues, yvalues) plt.show() 

enter image description here

I assume the data needs to be smoothed somehow in order to give a smooth pdf. I would like to plot the pdf. How can I do that? The raw data is at: http://dpaste.com/1HVK5DR .

asked Jul 12, 2019 at 19:34 21k 47 47 gold badges 150 150 silver badges 304 304 bronze badges

2 Answers 2

There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.

So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)

import matplotlib.pyplot as plt import numpy as np from data import xvalues, yvalues #load data from file print("#datapoints: <>".format(len(xvalues))) #don't use every point if your computer is not very fast xv = np.array(xvalues)[::5] yv = np.array(yvalues)[::5] #interpolate to have evenly space data xi = np.linspace(xv.min(), xv.max(), 400) yi = np.interp(xi, xv, yv) 

Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form

 c(t) = sum a(i) * phi(t - x(i)) #(not part of the program) 

where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.

Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just

 c(t) = sum a(i) * phi'(t - x(i)) 

where phi' is the derivative of phi . #(not part of the program)

Regarding sigma : It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma , the smoother the resulting function gets.

#set up rbf network rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes print("#rbfs: <>".format(rbf_nodes.shape[1])) #estimate width of kernels: sigma = 20 #greater = smoother, this is the primary parameter to play with sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1])) # kernel & derivative rbf = lambda r:1/(1+(r/sigma)**2) Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2 #compute coefficients of rbf network r = np.abs(xi[:, None]-rbf_nodes) A = rbf(r) coeffs = np.linalg.lstsq(A, yi, rcond=None)[0] print(coeffs) #evaluate rbf network N=1000 xe = np.linspace(xi.min(), xi.max(), N) Ae = rbf(xe[:, None] - rbf_nodes) ye = Ae @ coeffs #evaluate derivative N=1000 xd = np.linspace(xi.min(), xi.max(), N) Bd = Drbf(xe[:, None] - rbf_nodes) yd = Bd @ coeffs fig,ax = plt.subplots() ax2 = ax.twinx() ax.plot(xv, yv, '-') ax.plot(xi, yi, '-') ax.plot(xe, ye, ':') ax2.plot(xd, yd, '-') fig.savefig('graph.png') print('done')