TY - JOUR
T1 - MR plot: A big data tool for distinguishing distributions
AU - Ardakani, Omid
AU - Asadi, Majid
AU - Ebrahimi, Nader
AU - Soofi, Ehsan
N1 - Publisher Copyright:
© 2020 Wiley Periodicals LLC.
PY - 2020/6/9
Y1 - 2020/6/9
N2 - Big data enables reliable estimation of continuous probability density, cumulative distribution, survival, hazard rate, and mean residual functions (MRFs). We illustrate that plot of the MRF provides the best resolution for distinguishing between distributions. At each point, the MRF gives the mean excess of the data beyond the threshold. Graph of the empirical MRF, called here the MR plot, provides an effective visualization tool. A variety of theoretical and data driven examples illustrate that MR plots of big data preserve the shape of the MRF and complex models require bigger data. The MRF is an optimal predictor of the excess of the random variable. With a suitable prior, the expected MRF gives the Bayes risk in the form of the entropy functional of the survival function, called here the survival entropy. We show that the survival entropy is dominated by the standard deviation (SD) and the equality between the two measures characterizes the exponential distribution. The empirical survival entropy provides a data concentration statistic which is strongly consistent, easy to compute, and less sensitive than the SD to heavy tailed data. An application uses the New York City Taxi database with millions of trip times to illustrate the MR plot as a powerful tool for distinguishing distributions.
AB - Big data enables reliable estimation of continuous probability density, cumulative distribution, survival, hazard rate, and mean residual functions (MRFs). We illustrate that plot of the MRF provides the best resolution for distinguishing between distributions. At each point, the MRF gives the mean excess of the data beyond the threshold. Graph of the empirical MRF, called here the MR plot, provides an effective visualization tool. A variety of theoretical and data driven examples illustrate that MR plots of big data preserve the shape of the MRF and complex models require bigger data. The MRF is an optimal predictor of the excess of the random variable. With a suitable prior, the expected MRF gives the Bayes risk in the form of the entropy functional of the survival function, called here the survival entropy. We show that the survival entropy is dominated by the standard deviation (SD) and the equality between the two measures characterizes the exponential distribution. The empirical survival entropy provides a data concentration statistic which is strongly consistent, easy to compute, and less sensitive than the SD to heavy tailed data. An application uses the New York City Taxi database with millions of trip times to illustrate the MR plot as a powerful tool for distinguishing distributions.
KW - Bayes risk
KW - concentration measures
KW - distributional plots
KW - mean residual plot
KW - survival entropy
KW - taxi trip time
UR - https://digitalcommons.georgiasouthern.edu/economics-facpubs/127
UR - https://doi.org/10.1002/sam.11464
U2 - 10.1002/sam.11464
DO - 10.1002/sam.11464
M3 - Article
VL - 13
JO - Statistical Analysis and Data Mining: The American Statistical Association Data Science Journal
JF - Statistical Analysis and Data Mining: The American Statistical Association Data Science Journal
ER -