This project develops a system for detecting plagiarism in sets of student assignments written in Java. Plagiarism is viewed as a form of code obfuscation where students deliberately perform semantics preserving transformations of an original working version to pass it off as their own. In order to detect such obfuscations we assume we have a set of programs in which we attempt to find transformations that have been applied. We investigate tools for static analysis and transformation of Java programs to build a system for plagiarism detection.

Background

Plagiarism is a serious problem in Computer Science education. Students must write their own programs to gain an understanding of programming, but it is very easy to copy another student’s source code and modify it to pass it off as their own. A 2000 JISC survey into source-code plagiarism in UK higher education institutions confirmed that this is a definite problem and recommended the use of detection systems.

Faidhi and Robinson defined 6 levels of source code plagiarism, ranging from simple to complex:

  1. Comments: adding, removing or changing comments
  2. Identifiers: renaming variables and methods
  3. Code positions: moving declarations around
  4. Procedure combination: inlining methods
  5. Program statements: rearranging statements
  6. Control logic: e.g. changing for-loops to while-loops

However, detecting plagiarism is non-trivial because even non-plagiarised programs can look very similar due to common algorithms, formulaic assignments, shared identifier naming conventions, and code generated by tools such as GUI builders.

Approach

We evaluated five static analysis tools for Java (ANTLR, TXL, Eclipse JDT, JavaCC, and javac) through a series of tasks including counting variables, listing methods, generating call hierarchy graphs, and producing AST graphs. Eclipse JDT proved to be the best fit: it provided a complete, accurate Java parser with a convenient API, and it seemed appropriate to transform Java programs using a Java program.

Using Eclipse JDT, we built a plagiarism detection system that computes multiple similarity measures for each pair of programs:

  • Document length
  • Variable, method, method invocation, loop, and if-statement counts
  • Document fingerprinting
  • AST node count

These measures are combined using Euclidean distance to produce a single similarity score for each pair. Results are visualised to help markers quickly identify suspicious pairs.

Results

We tested the system on 8 real student assignment corpora (Maze, Rolling a Die, Leap Year, Drawing a Square, How Old Are You?, Guessing Game, Hangman, and Roman Numerals). Key findings:

  • Structural methods outperformed simple counting: the AST-based measure was the most effective in the automated tests.
  • Beginner assignments are inherently hard: simple programs like “Roll a Die” (only a few lines of code) produce naturally similar submissions, making plagiarism detection nearly impossible.
  • Teacher-provided code is a confounding factor: when students are given starter code as part of the assignment, all submissions contain identical sections, inflating similarity scores. Pre-processing to remove known shared code would improve accuracy.
  • The system successfully identified plagiarised pairs but suffered from false positives.

Conclusion

Eclipse JDT proved to be an excellent foundation for Java static analysis. The project confirmed that plagiarism detection in structured programming languages is non-trivial, and even harder in natural language. While our system was not as accurate as we would have liked, it provided a solid basis of knowledge for further work in static analysis, which continued into PhD research on dependence communities in source code.