
Dissertation Defence: Parameter Efficient Code Representation Learning
April 9 at 9:00 am - 1:00 pm

Iman Saberi Tirani, supervised by Dr. Fatemeh Fard, will defend their dissertation titled “Parameter Efficient Code Representation Learning” in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science.
An abstract for Iman Saberi Tirani’s dissertation is included below.
Examinations are open to all members of the campus community as well as the general public. This examination will be offered in hybrid format. Registration is not required to attend in person; however, please email fatemeh.fard@ubc.ca to receive the Zoom link for this exam.
ABSTRACT
Code representation learning is a specific area of machine learning and Natural Language Processing (NLP). Its primary objective is to concentrate on acquiring representations or embeddings for fragments of code, source code, or text associated with programming. The aim is to convert unprocessed code or textual programming-related information into vector representations that encapsulate the meaning, arrangement, and surroundings of the code. These representations later could be utilized in different target tasks such as code summarization and code generation. In this thesis, we have proposed different methods for (1) knowledge transformation, (2) knowledge aggregation, (3) knowledge injection, and (4) knowledge retrieving in code representations. By knowledge we mean vector embeddings belong to each sub-word, representing a semantic meaning in a given context.
Objective: There are four main objectives regarding learning code representation. First, conducting experiments on using adapters, trainable light-weight modules to modify models’ behavior, to adapt natural language models such as RoBERTa to transform their emebeddings for software engineering tasks (i.e. knowledge transformation). Second, using adapters to aggregate knowledge of different programming languages for the improvement of code representations of a target programming language (i.e. knowledge aggregation). Third, exploring the capacity of adapters for imposing syntactical information to the existing programming language models such as CodeBERT (i.e. knowledge injection), and finally exploring various techniques to enhance the accuracy of the retrieval component in Retrieval Augmented Generation (RAG) for code generation tasks.
Method: Regarding the initial objective, we assess how adapters perform in code summarization tasks when they are plugged into both natural language models and programming language models. Following that, we utilize AdapterFusion to combine knowledge from various programming languages for a specific target programming language. To enhance the effectiveness of AdapterFusion, we introduce a novel fusion adapter approach called Adv- Fusion. Furthermore, we introduce a new type of adapters known as Named Entity Recognition (NER) Adapters, which are designed to import syntactical information to the network. Finally, we propose a novel Programming Knowledge Graph (PKG) that organizes a code-related corpus into a structured knowledge graph. For code generation, the PKG retrieves the most relevant path corresponding to a given query and incorporates it into the prompt to enhance the accuracy of generation process.
Results: Our findings indicate that by employing adapters, we can effectively tailor natural language models such as RoBERTa for code summarization tasks, achieving results that are either superior or on par with dedicated programming language models such as CodeBERT. For example, RoBERTa equipped with adapters fine-tuned for Ruby yielded a blue score of 12.79, surpassing the 12.16 blue score obtained by fine-tuning CodeBERT specifically for Ruby.
As for our second objective, the application of AdvFusion in the context of Ruby code summarization led to an impressive blue score of 16.53, significantly outperforming the standard multilingual fine-tuning for Ruby, which achieved a blue score of 14.75.
Regarding the third objective, we conduct an evaluation of NER adapters in both code refinement and code summarization tasks. In the case of code refinement, employing CodeBERT with NER adapters demonstrated superior performance, achieving a blue score of 78.2 and an accuracy of 17.8, compared to the 77.42 blue score and 16.4 accuracy achieved without NER adapters for java language. Utilizing NER adapters for code summarization results in a substantial increase in the blue score, with a value of 23.34, which is notably higher compared to its performance of 18.07 in blue score when not using adapters with CodeBERT.
In addressing our last objective, We introduced PKG for code generation task and evaluated our approach using standard Python benchmarks. PKG enables us to retrieve code at a fine-grained level, focusing on highly relevant segments. our method demonstrates up to an 8% increase in accuracy on HumanEval and up to a 34% improvement on MBPP, two widely recognized Python benchmarks.