Profiling Python

Yesterday I spent a little bit of time improving the speed of adding a new printer using system-config-printer. The main problem was that several bugs had conspired to make it search for all printer drivers three times instead of just once (oops). After fixing that I tried profiling it to see what was taking the most time.

I used the cProfile/pstats modules to do this. It’s quite easy to do. I replace the part where my program says:

gtk.main()

with:

import cProfile
cProfile.run("gtk.main()", "profile.out")

After running the program I used an interactive Python shell to display the profiling statistics, ordered by cumulative time taken in each function:

$ python
>>> import pstats
>>> p=pstats.Stats("profile.out")
>>> p.strip_dirs().sort_stats("cumulativ").print_stats()
Wed Jul 15 17:57:41 2009    profile.out

         1962150 function calls (1959323 primitive calls) in 21.377 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   21.377   21.377 <string>:1(<module>)
        1    3.968    3.968   21.377   21.377 {gtk._gtk.main}
        1    0.000    0.000   12.303   12.303 system-config-printer.py:4109(on_btnNPForward_clicked)
        1    0.000    0.000   12.303   12.303 system-config-printer.py:4112(nextNPTab)
        1    0.000    0.000    6.549    6.549 system-config-printer.py:4052(loadPPDs)
        1    0.001    0.001    6.549    6.549 system-config-printer.py:4018(fetchPPDs)
       60    6.515    0.109    6.515    0.109 {time.sleep}
        1    0.000    0.000    4.631    4.631 system-config-printer.py:3360(on_new_printer_activate)
        1    0.000    0.000    4.539    4.539 system-config-printer.py:3720(init)
        1    0.000    0.000    4.342    4.342 ppds.py:475(getPPDNameFromDeviceID)
        1    0.024    0.024    4.279    4.279 system-config-printer.py:4850(fillDeviceTab)
        1    0.000    0.000    4.208    4.208 system-config-printer.py:4538(fetchDevices)
        1    0.000    0.000    4.096    4.096 cupshelpers.py:511(getDevices)
        1    0.000    0.000    4.096    4.096 authconn.py:184()
        1    2.069    2.069    4.096    4.096 authconn.py:186(_authloop)
        4    0.083    0.021    3.985    0.996 ppds.py:816(_init_makes)
    11133    0.753    0.000    3.933    0.000 ppds.py:34(ppdMakeModelSplit)
   202792    0.463    0.000    2.056    0.000 re.py:144(sub)
...

As system-config-printer is an interactive graphical application, a lot of time was spent waiting for me to react to the interface. The functions at the top of the list are there for that reason. However, getPPDNameFromDeviceID is not interactive yet is taking an enormous amount of time: over four seconds! This time does not include the time taken to retrieve the list of available PPDs from CUPS (see loadPPDs), but is entirely spent on processing that data and trying to find a matching driver for the printer I was using.

In a lot of cases it would not take as long as this. In this instance, however, there is no exact match for my printer so it has to perform a lot more processing in order to find a “close” match. Part of that processing involves splitting the (often messy) ppd-make-and-model attributes for the PPDs into manufacturer names and model names.

The reason this was taking so long is apparent from the profiling, in conjunction with reading the code. It is the fact that regular expressions are used extensively to process the ppd-make-and-model attributes. This is confirmed by running p.print_callers(20):

...
Function                  was called by...
                              ncalls  tottime  cumtime
...
re.py:144(sub)            <-  202791    0.463    2.055  ppds.py:34(ppdMakeModelSplit)
...

One way of making this more efficient would be to "compile" the regular expressions once, then execute the compiled expressions on each call. However, by using simple string operations instead of regular expressions in that function, the cumulative time used by ppdMakeModelSplit was reduced from 3.93s to 0.98s.

The lesson: regular expressions are expensive!


Posted

in

by