深圳大学新葡的京集团350vip8888
College of Computer Science and Software Engineering, SZU

Speeding up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

ACM Transactions on Software Engineering and Methodology(TOSEM)

 

Yida Tao1    Shan Tang1    Yepang Liu2     Zhiwu Xu1     Shengchao Qin3

Shenzhen University1    Southern University of Science and Technology2     Teesside University, UK     Shenzhen University3

 

Abstract

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this paper, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative.We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

Fig. 1. Computing the magnitude of vectors using different NumPy APIs (excerpt from Stack Overflow post 9184560). One implementation is significantly faster than the other.

 

Fig. 2. Overview of the methodology.

 

Fig. 3. Code block from SO post 18365665. Line 4 and line 8 are not considered as consecutive since line 7 alters the input data.

 

Fig. 4. Code block from SO post 32340834. The 4 consecutive implementations can all be used to sum values in a dataframe. We extract 6 pairs of alternative implementations from this code block.

 

Fig. 5. An example of extracting alternative implementations using POS tagging, dependency parsing, and Semgrex matching at the sentence level.

 

Fig. 6. Coreference resolution is also used if a pronoun is identified as the compared entity.

 

Fig. 7. The code template for random testing.1 and 2 are for randomizing different types and values of the input;3 is the name of the input variable extracted from the candidate implementation pair, which is used to instantiate output variables in 4. 5 verifies the output equivalence.

 

Fig. 8. The random testing code generated for validating the implementation pair from Fig. 1.

 

Fig. 9. An example of API usage abstraction, which identifies a same alternative API usage pattern from two different implementation pairs.

 

Fig. 10. Avg. execution time of validated alternative implementation pairs under varied workload.

 

Fig. 11. Runtime speedups of alternative implementations.

 

Fig. 12. # of alternative implementations generated from each recurring alternative API pair.

 

Fig. 13. Distribution of the execution time of the original implementations.

 

Fig. 14. Avg. speedup/slowdown of alternative impl. generated from recurring alternative API pairs.

 

Fig. 15. Examples of alternative implementations whose performance differences are sensitive to the input data size.

 

Acknowledgements

We sincerely thank the editor and anonymous reviewers for their constructive suggestions and insightful comments on improving this manuscript.This work is partially supported by the National Key Research and Development Program of China under Grant No. 2019YFE0198100, the National Natural Science Foundation of China under Grants No. 61772347, 61972260, 61836005, 61932021, 61802164, and the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2019A1515011577. Shengchao Qin is the corresponding author.

 

Bibtex

 

@article{DBLP:journals/tosem/TaoTLXQ21,
  author    = {Yida Tao and
               Shan Tang and
               Yepang Liu and
               Zhiwu Xu and
               Shengchao Qin},
  title     = {Speeding Up Data Manipulation Tasks with Alternative Implementations:
               An Exploratory Study},
  journal   = {{ACM} Trans. Softw. Eng. Methodol.},
  volume    = {30},
  number    = {4},
  pages     = {49:1--49:28},
  year      = {2021},
 }

Downloads

XML 地图