My Math Forum Distribution-free test for outliers

 August 27th, 2018, 02:52 AM #1 Newbie   Joined: Jun 2017 From: italia Posts: 13 Thanks: 2 Distribution-free test for outliers My data are obtained by subtracting two vector $V_{1}$ and $V_{2}$ in a 3D space: $$v=\sqrt{(V_{1_{x}}-V_{2_{x}})^2+(V_{1_{y}}-V_{2_{y}})^2+(V_{1_{z}}-V_{2_{z}})^2}$$ I don't know the distribution of $v$, but it vaguely resembles a very long-tailed log-normal distribution. Without any valid assumption, I assume an unknown distribution. My current method to find the outliers is based on the Chebyshevâ€™s inequality; I say that $v$ is an outlier if $v-\bar{v} > 10 s$ (sample average and sample standard deviation). Could that method be reasonably valid? I found a paper that explains the Bootlier test to find outliers: https://www.econstor.eu/bitstream/10.../735352828.pdf, but it's not clear to me how to write a working procedure for that test. Please, could anyone explain the Bootlier test in practical terms?
August 27th, 2018, 05:16 AM   #2
Senior Member

Joined: Oct 2009

Posts: 733
Thanks: 246

Quote:
 Originally Posted by Cristiano My data are obtained by subtracting two vector $V_{1}$ and $V_{2}$ in a 3D space: $$v=\sqrt{(V_{1_{x}}-V_{2_{x}})^2+(V_{1_{y}}-V_{2_{y}})^2+(V_{1_{z}}-V_{2_{z}})^2}$$ I don't know the distribution of $v$, but it vaguely resembles a very long-tailed log-normal distribution. Without any valid assumption, I assume an unknown distribution.
Do we know the distribution of $V_1$ and $V_2$. Does it look like anything to you? Normal or anything?

Quote:
 My current method to find the outliers is based on the Chebyshevâ€™s inequality; I say that $v$ is an outlier if $v-\bar{v} > 10 s$ (sample average and sample standard deviation).
This is definitely a valid method, but likely way too conservative. You won't flag many outliers this way.

The bootlier is a good test, and there are R scripts written to make it work. Definitely don't write your own script, the procedures you can find online usually do the job very well.
https://github.com/jodeleeuw/Bootlie...ter/bootlier.R

Notice the bootlier does work well with skewed distributions, but only if you have quite a generous amount of data points.

Another method you might want to try are isolation forests. This is also nondistributional and works quite well.
There are a lot of methods on finding outliers nondistributionally, but I really like these two.

August 27th, 2018, 06:32 AM   #3
Newbie

Joined: Jun 2017
From: italia

Posts: 13
Thanks: 2

Quote:
 Originally Posted by Micrm@ss Do we know the distribution of $V_1$ and $V_2$. Does it look like anything to you? Normal or anything?
I don't know how to check the distribution of the vectors.

Quote:
 The bootlier is a good test, and there are R scripts written to make it work. Definitely don't write your own script, the procedures you can find online usually do the job very well. https://github.com/jodeleeuw/Bootlie...ter/bootlier.R
I need to include the code in my program (written in C++), but no source code found and I hardly doubt that I can write the program without a step by step explanation.

Quote:
 Another method you might want to try are isolation forests. This is also nondistributional and works quite well. There are a lot of methods on finding outliers nondistributionally, but I really like these two.
I found 2 small packages in C++11, but they don't get compiled (I use MSVC++ 2013).

 Tags distributionfree, outliers, test

 Thread Tools Display Modes Linear Mode

 Similar Threads Thread Thread Starter Forum Replies Last Post stringnumargs Algebra 6 June 28th, 2015 04:29 PM Denis Elementary Math 0 November 18th, 2012 01:03 PM daivinhtran Advanced Statistics 0 September 12th, 2011 03:21 PM Relmiw Advanced Statistics 4 September 12th, 2011 05:44 AM G0Y Algebra 1 November 11th, 2008 02:24 PM

 Contact - Home - Forums - Cryptocurrency Forum - Top