Context

I have a database containing species records (e.g. 10 different species, with 100 rows by species ; in columns are quantitative variables). I want to compute Euclidean distances (considering all variables) between randomly sampled 20 row by species, between each species and species h. I want to bootstrap this calculation an increasing number of time to assess the effect of iteration augmentation on results linearity (to say: OK, we have reach linearity, results should be OK). The aim is to show a figure like that (1 line color = 1 species Euclidean distance to species h):

To explain the process, I illustrate it with distance calculation between species

*a*and species

*h*:

We define sets

*Ra*and

*Rh*as original species records.

\[R_\alpha = \left \{ 1,2,3,4,...,n | n\in \mathbb{N} \: and\: n\geq 21 \right \}\]

\[R_h = \left \{ 1,2,3,4,...,n | n\in \mathbb{N} \: and\: n\geq 21 \right \}\]

Then we define

*Sa*and

*Sh*as proper subsets of

*Ra*and

*Rh*composed of 20 records randomly sampled in

*Ra*and

*Rh*, without replacement, so that probability

*P(r)*for records to be selected is:

\[P(r)=\frac{(N-n)!}{N!}\]

\[S_\alpha \subset R_\alpha \: and\: S_h \subset R_h\:, with\: n(S)=20\]

Then we define the following function to compute the mean Euclidean distance between all records of

*Sa*and

*Sh*:

\[f(x,y)=\frac{1}{n'}\sum_{j=1}^{n'}\left ( \sqrt{\sum_{i=1}^{n}(y_i-x_i)^{2}} \right )_j\]

With n = 20 (variables) and n' = 20 (randomly sampled records ; size of

*Sa*and

*Sh*).

Then we define set

*D*, which contains Euclidean distances between records

*Sa*and

*Sh*:

\[d_{(\alpha ,h)}=\left \{ f(x,y)|x\in S_\alpha \: and\: y\in S_h \right \}\]

Finally, we define set

*B*containing number of iterations of the whole process, from sampling event (with replacement between each iteration, giving a probability

*P(r)=1/N*for records to be selected between iterations) to sed

*D*computation. The following formula f(x) allow computing set

*M*:

\[B\approx \left \{ 1*1.6^x | x\in \mathbb{N}_0\: and\: 0\leq x\geq 20\right \}\]

\[f(x)=\frac{1}{n''}\sum_{l=1}^{n''}x_{l}\]

\[M_{(\alpha ,h)}=\left \{ f(x)|x\in D\: and\: n''\in B \right \}\]

With

*B*= rounded values.

Mainly, I am not pretty sure that I have the right to build

*M(a,h)*this way...

Could you please tell me if it is OK to call functions this way in sets ? And if you spot mistakes in the process ?

Many thanks for your help !