Hi,
Can anyone tell me how to randomly draw subsets from an existing data set.
For instance, I have a variable with 100 observations, now I want to only keep 50 of it( randomly drawn). It will be good if the subset can mimic main features of the original data properly.
I did not find any programming deal with this specifically, can you tell how you deal this in general.
Many thanks!
3 Answers
0
GAUSS allows you to index into a matrix with a vector of indices, for example:
x = { 5 1, 2 9, 3 7, 6 4, 8 0 }; idx = { 2, 4, 5 }; z = x[idx, .];
After the code above, z will equal the second, fourth and fifth rows of x:
2 9 z = 6 4 8 0
Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:
//create a dataset for this example my_dataset = rndn(100, 5); //how many observations to draw at a time num_draws = 50; //create index for random draws //(edited to fix bug reported in this thread) idx = ceil(num_draws * rndu(num_draws, 1)); //draw sample my_sub_sample = my_dataset[idx, .];
0
Hi, Aptech
I do not think this program you provided works correctly, as I used it and I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.
And,
//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));
should be
//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?
Although the size will be correct by uisng the above code, this does not make sense actually.
Could you please explain more? Maybe I misunderstood your program.
Thank you very much!!
0
Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:
idx = ceil(num_draws * rndu(rows(num_draws), 1));
will draw a random sample of only one observation. What you actually want the line to read is:
idx = ceil(num_draws * rndu(num_draws, 1));
I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:
//Create 'num_draws' uniform random numbers between 0 and 1 r = rndu(num_draws, 1); //Change the scale of our uniform random numbers //from 0-1 to 0-'num_draws' r_scaled = num_draws * r; //Force the scaled uniform random numbers //to integers from 1-'num_draws' idx = ceil(r_scaled);
Let us know if this clears up the issue or if you have any more questions!
Your Answer
3 Answers
GAUSS allows you to index into a matrix with a vector of indices, for example:
x = { 5 1, 2 9, 3 7, 6 4, 8 0 }; idx = { 2, 4, 5 }; z = x[idx, .];
After the code above, z will equal the second, fourth and fifth rows of x:
2 9 z = 6 4 8 0
Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:
//create a dataset for this example my_dataset = rndn(100, 5); //how many observations to draw at a time num_draws = 50; //create index for random draws //(edited to fix bug reported in this thread) idx = ceil(num_draws * rndu(num_draws, 1)); //draw sample my_sub_sample = my_dataset[idx, .];
Hi, Aptech
I do not think this program you provided works correctly, as I used it and I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.
And,
//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));
should be
//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?
Although the size will be correct by uisng the above code, this does not make sense actually.
Could you please explain more? Maybe I misunderstood your program.
Thank you very much!!
Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:
idx = ceil(num_draws * rndu(rows(num_draws), 1));
will draw a random sample of only one observation. What you actually want the line to read is:
idx = ceil(num_draws * rndu(num_draws, 1));
I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:
//Create 'num_draws' uniform random numbers between 0 and 1 r = rndu(num_draws, 1); //Change the scale of our uniform random numbers //from 0-1 to 0-'num_draws' r_scaled = num_draws * r; //Force the scaled uniform random numbers //to integers from 1-'num_draws' idx = ceil(r_scaled);
Let us know if this clears up the issue or if you have any more questions!