Feed on
Posts

## Using Simple Statistics to Detect Cheating

Imagine you have a class of 110 students. Further imagine you assign the class a task to complete on their own and bring to class as part of an assignment to be completed in class. Suppose that the task you give consists of producing a chart from source data, and asking students to deliver it in a professional manner in class. This sounds rather simple, and it is. But just think of all of the choices a student must make when collecting data and then presenting it. The following list merely scratches the surface:

• the font
• the font size
• the scale of the axis
• units for the data
• the title of the data labels
• where to put data labels
• where to put legends
• the size of the chart itself
• the design of the lines (e.g. dashed or solid, how thick, whether to include markers)
• whether to use lines or some other indicator (such as bars)
• how to cite the data source
• where to put the data source,
• and so on …

Making the extreme simplifying assumption that every student must make only these 12 choices (there are literally dozens more) when producing a chart, and making an even more conservative assumption that each of the choices above is not continuous, but rather represents a discrete choice for each student (e.g. they only have two choices of fonts to choose from, or two styles of lines, etc.), what would be the probability that any two randomly chosen charts would look identical?

Ignoring the size of the class for the moment. In the example above, we are asking what the probability is that two people make the same 12 choices. If each choice has a 50% chance of being made (in reality, the probabilities are substantially lower), then there is a 0.02% chance (two in 10,000) of each outcome being chosen by a student (the calculation is simply 50% to the twelve power). The chance that two out of two students make the same choices is therefore (0.02%) x (0.02%) = 0.000004% or one in 25 million. Not very good at all.

In a class of 110 students, what is the chance that two of the 110 charts look alike? The maximum chance of this happening would be the sum of the probabilities of each individual choice, or something like 2%. In my class of 110, this means that there is a 2% chance that any two charts would look alike. If I were to detect 40 charts that were nearly identical, and 20 that were perfectly identical, it would be a fairly convincing sign that more than chance was at work in the production of such charts. Indeed, these are the types of ratios I typically find when I assign these sorts of projects.

In a future post I’ll reflect on why or whether it makes sense to put students in a position to “avoid randomness” on a regular basis.

### 6 Responses to “Using Simple Statistics to Detect Cheating”

1. Harry says:

Are you now teaching at Fader?

2. CJ says:

If the charts in question are using, say, the Excel default settings or the MATLAB default settings, I would imagine the chances of identical charts would be quite high.

3. wintercow20 says:

Even with everyone using the same version of Excel and all versions having the same default settings, the chances are still virtually zero. Most (all) of the decisions they had to make had little to do with the chart itself (such as the particular data series chosen). Your point is well taken of course – but even in excel default settings, copying and pasting charts into word processing programs should lead to slight differences – particularly for students that have not done much of this in the past.

4. Harry says:

I meant to write,”Faber.”

5. Edward says:

This explains the rational behind your rather unique testing format. I knew there had to be some method to the madness!

Though I’m somewhat surprised by the ratio of duplicitous assignments…perhaps Google images is responsible?

6. SusanC says:

Let’s not even get into people making the same typos. I get around this problem by assigning a take-home Excel problem (using some forecasting techniques) and let students work in groups of up to 4, but then ask them questions about in as part of the in-class midterm. It’s amusing to see the number of students who get 100% on the external assignment and then can’t answer even the most basic questions about what they supposedly did.

Also, it saves on grading time on grading. Why grade 100 individual assignments when you can grade 25, especially if many of those 100 are just mooching off their friends?