3.6 Random subsets

when there are a lot of matches, e.g.
> A = "time";
> size A;
it is often desirable to look at a random selection to get a quick overview (rather than just seeing matches from the first part of the corpus); one possibility is to do a sort randomize and then go through the first few pages of random matches:
> sort A randomize;
however, this cannot be combined with other sort options such as alphabetical sorting on match or left/right context; it also doesn't speed up frequency lists, set target and other post-processing operations
as an alternative to randomized ordering, the reduce command randomly selects a given number or proportion of matches, deleting all other matches from the named query; since this operation is destructive, it may be necessary to make a copy of the original query result first (see above)
> reduce A to 10%;
> size A;
> sort A by word %cd on match .. matchend[42];
> reduce A to 100;
> size A;
> sort A by word %cd on match .. matchend[42];
this allows arbitrary further operations to be carried out on a representative sample rather than the full query result
set random number generator seed before reduce for reproducible selection
> randomize 42; (use any positive integer as seed)
a second method for obtaining a random subset of a named query result is to sort the matches in random order, and then take the first matches from the sorted query; the example below has the same effect as reduce A to 100; (though it will not select exactly the same matches)
> sort A randomize;
> cut A 100; (NB: this restores corpus order, as with the reduce command)
reproducible subsets can be obtained with a suitable randomize command before the sort; the main difference from the reduce command is that cut cannot be used to select a percentage of matches (i.e., you have to determine the number of matches in the desired subset yourself)
the most important advantage of the second method is that it can produce stable and incremental random samples
for a stable random ordering, specify a positive seed value directly in the sort command:
> sort A randomize 42;
different seeds give different, reproducible orderings; if you randomize a subset of A with the same seed value, the matches will appear exactly in the same order as in the randomized version of A:
> A = "interesting" cut 20; (just for illustration)
> B = A;
> reduce B to 10; (an arbitrary subset of A)
> sort A randomize 42;
> sort B randomize 42;
in order to build incremental random samples from a query result, sort it randomly (but with a fixed seed value to ensure reproducibility) and then take the first matches as sample #1, the next matches as sample #2, etc.; unlike two subsets generated with reduce, the first two samples are disjoint and together form a random sample of size :
> A = "time";
> sort A randomize 7;
> Sample1 = A;
> cut Sample1 0 99; (random sample of 100 matches)
> Sample2 = A;
> cut Sample2 100 199; (random sample of 100 matches)
note that the cut removes the randomized ordering; you can reapply the stable randomization to achieve full correspondence to the randomized query result A:
> sort Sample2 randomize 7;
> cat Sample2;
> cat A 100 199;
stability of the randomization ensures that random samples are reproducible even after the initial query has been refined or spurious matches have been deleted manually