Proceedings of EUROMICRO 45th Conference on Software Engineering and Advanced Applications (SEAA)
Syntactic parsing is a time-consuming task in natural language processing particularly where a large number of text files are being processed. Parsing algorithms are conventionally designed to operate on a single machine in a sequential fashion and, as a consequence, fail to benefit from high performance and parallel computing resources available on the cloud.We designed and implemented a scalable cloud-based architecture supporting parallel and distributed syntactic parsing for large datasets. The main architecture consists of a syntactic parser (constituency and dependency parsing)and a MapReduce framework running on clusters of machines.The resulting cloud-based MapReduce parsing is able to build a map where syntactic trees of the same input file have the same key and collect into a single file containing sentences along with their corresponding trees. Our experimental evaluation shows that the architecture scales well with regard to number or processing nodes and number of cores per node.In the fastest tested cloud-based setup, the proposed design performs 7 times faster when compared to a local setup. In summary, this study takes an important step toward providing and evaluating a cloud-hosted solution for efficient syntactic parsing of natural language data sets consisting of a large number of files.
Y. Woldemariam, S. Pletschacher, C. Clausner, J. Bass , "A cloud-hosted MapReduce architecture for syntactic parsing", Proceedings of EUROMICRO 45th Conference on Software Engineering and Advanced Applications (SEAA)