Tag Archives: Python

Strings in Python 2 and Python 3

The goal of this post is to show you how to properly use encode and decode in python 2 and in python 3. This post will be based on small examples that will (hopefully) make you better understand how strings work in python 2 and python 3.

A bit of background on unicode and UTF-8:

Unicode has a different way of thinking about characters. In Unicode, the letter “A“ is a platonic ideal. It’s just floating in “heaven”. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 (in python “\u0639“).

UTF-8 is a system of storing your string of unicode code points (those magic “U+number“) in memory using 8 bit bytes.

One of the common questions for python 3 is when to use bytestring and when to use strings as an object? When you are manipulating string (e.g. “reversed(my_string)“) you always use string object and newer bytestring. Why? Here is an example:

my_string = "I owe you £100"
my_bytestring = my_string.encode()

>>> print(''.join([c for c in reversed(my_string)]))
001£ uoy ewo I
>>> print(''.join([chr(c) for c in reversed(my_bytestring)]))
001£Â uoy ewo I

 

The first print is what we expect but the second is not. And why is that? Well the “reversed“ function iterates over a sequence which in second case is bytestring which is b’I owe you \xc2\xa3100′. We can also verify this by checking the length of “my_bytestring“ and “my_string“:

>>> print(len(my_string))
14
>>> print(len(my_bytestring))
15

 

If I always just add “.encode()“ everything will be fine right? No! For start you should never call encode without specifying which encoding to use because then the interpreter will pick for you which will “almost” always be UTF-8 but there are some instances where this won’t be so and you will spent a lot of time finding this bug. So ALWAYS specify which encoding to use (e.g. “.encode(‘utf-8’)“). Example:

>>> print('I owe you £100'.encode('utf-8').decode('latin-1'))
I owe you £100

 

The other problem which is even bigger with “sprinkling” “.encode()“ is that if you already have encoded string you will get error (in python 3) or even worse (in python 2), you will do string operations on bytestring.

In python 2 “str“ is for strings of bytes and “unicode“ is for strings of unicode code points. The problem is that python 2 implicitly converts between types… sometimes. It allows you things like this:

>>> print((u'I owe you £100'.encode('utf-8') + 'Plus another $100').decode('latin-1'))
I owe you £100Plus another $100

 

This will quickly raise error when “Plus another $100“ becomes something that is not ASCII. If you try this in python 3 you get “TypeError: can’t concat bytes to str“.

If you need your code to run both on python 2 and python 3 then a rule of thumb is to first write a code for python 3 and then try it in python 2.
References:

Lessons Learned from PyMunich 2016

At the end of October there was a Python conference in Munich (PyMunich). For a regional conference it was quite big in my opinion. There were 3 tracks and more then 40 speakers.

As always I won’t cover all the talks just the ones that I found the most interesting and educational. After all this is the biggest reason why I go to these conferences.

The first talk I attended was by Dmitry Trofimov. He talked about profiling (“Profiling the unprofilable“). There are 2 approaches you can profile your code and it is important to know them both so you know which one to choose. The first one is statistical or sampler profiling (e.g. vmprof) and the second one is deterministic profiling (e.g. cprofile). For more details about the differences I strongly suggest to do some research on your own.

When you need to optimize your code you should be aware of the optimization levels. Often developers want to be smart and they go straight into optimizing their algorithms. But this doesn’t have the biggest impact. The biggest impact on the performance has the design (architecture). So this should be your biggest focus. After that you can start looking at algorithms and data structures and at the end line profiling. See “Effective Python” section in lessons learned from europython 2016 blog post for more details on this.

You can also use Cython for even better optimization (i.e. when you would need to write C code) but in most cases this isn’t necessary when building for the web because the bottleneck is network I/O. Stefan Behnel had a great talk about “Getting Native with Cython” and he showed us how easy it is to write pure python code and then transform it to Cython. If you have performance issues and you have already done all the optimization you could think of I strongly suggest to try with Cython. I realise that it is probably harder than Stefan showed us but still it is worth looking into in my opinion.

Encryption is awesome. We all like it, but are we all using it? I admit that I don’t have it on my site but I should. And now with Let’s Encrypt Certificate Authority there are no more reasons why any of us don’t use encryption on their site(s). Markus Holtermann who had a talk about SSL encryption (“SSL all the things“) pointed out a few things that we should probably all know:

  • SSL 2 and 3 are broken, so don’t use them,
  • also don’t use TLS 1.0/1.1,
  • get fresh certificate every 90 days,
  • disable (make redirect) http because it can leak some information you don’t want.

There are many open source tools that can help you achieve nice and tidy encryption on your site. One of them is `acme-tiny` (https://github.com/diafygi/acme-tiny). It is very small script (less than 200 lines) which means you can easily read every line of the code which you should because you need to trust this tool with your private keys.

The last talk I want to mention was by far my favourite one. Philip Bauer showed us how to debug like a pro (“Debug like a pro. How to become a better programmer through pdb-driven development“).

His bread and butter tool is `pdbpp` or `pdb++` which is a drop-in replacement for `pdb`. This means that you create break point just like with pdb but if you have pdb++ installed it will automatically get called instead.

Here are the basic commands for pdb that Philip highlighted:

  •  l[ist] (list source code of current file)
  • n[ext] (continue execution until next line)
  • s[tep] (execute the current line, stop at the first possible occasion)
  • r[eturn] (continue execution until the current function returns)
  • c[ontinue] (continue execution, only stop when a breakpoint is encountered)
  • w[here] (show stack trace, recent frame at bottom)
  • u[p] (move up the stack)
  • d[own] (move down the stack)
  • b[reakpoint] (set a new breakpoint. `tbreak` for temporary break points)
  • a[rgs] (print the argument list of the current function)

The nice thing about pdbpp is that it has a long list method (`ll`) which displays the whole function you are in (Note: ipdb also has long list method).

Other python debugging tricks you should know about are:

  • use ?for getting additional information lib/class/function/… (e.g. os?)
  • use ??for displaying the source code of the lib/class/function you want to inspect (e.g. os.path.join??)
  • pp(Pretty-print) is already in pdb so you should always use it
  • pp locals()will pretty print local variables

One of the best tricks is the `help` function which accepts object and returns generated help page for the object. !help(obj.__class__)command will generate help page which will contain all the methods including class methods and static methods with docstrings, method resolution order, data descriptors, attributes, data and other attributes inherited and much more.

Note: The reason you need to put ! before help function in pdbpp/ipdb is because is you don’t put ! you will call pdbpp/ipdb internal help function which is not the python build-in help function.

You can also use --pdboption when running unit tests with pytest or nosetest and this will cause to drop in a pdb whenever a test fails or errors. From there you can write a code that will pass a test, copy paste that code into your file and you are done. This is the basic principle of Test-Driven / Debug-Driven Development (TDD).

Any questions? Send us an email.

A dev’s MacBook from scratch

I’ve been a long time Apple user. I hate a lot about the company’s policy and how they treat their power users, but I love the tight integration between their software and hardware. Another thing to love is their migration tools. You buy new hardware, you click Restore from backup and you are done. Safari even opens up the tabs you had open on the old device. However recently, I’ve splurged on a new MacBook 12” and decided to set it up from scratch. For the fun of it. Here are some notes of how I’ve set it up for myself, for future reference and if someone is in a similar position.

Tips:

  • Don’t sign into iCloud during installation as that starts syncing everything to iCloud and you might not want that.
  • I moved over some files manually from a Time Machine external disk and they got “locked” i.e. I had to enter the admin password for any change to them. This is how I “unlocked” them: xattr -c -r FOLDER_WITH_LOCKED_ITEMS/ && chmod -RN FOLDER_WITH_LOCKED_ITEMS/

System configuration:

  • First off, update to the latest version of OS X, since every major update overwrites some system configuration and you don’t want to duplicate your work.
  • Turn on auto updates. Doh.
  • Go through all System preferences panes and see what works for you. Take your time to see what’s there, it pays off.
  • I disabled Location services, because I use VPNs a lot and then Location Services get totally confused.
  • Enable sending/receiving SMS and calls on OS X — a killer Apple feature for me.
  • Disabled Document Handoff because I don’t want all my docs in the cloud by default.
  • On a MacBook 12″ moving the Dock to the right makes the most sense in my eyes.
  • Set a nice “return for reward” message to be displayed on Locked screen. Something along the lines of “If you have found this laptop, please call me on MY NUMBER or send me an email to MY EMAIL and get a sweet reward! Thanks!”
  • Check Require an administrator password to access system-wide preferences. Doh.
  • Turn on FileVault and Firewall. Double-doh.
  • Firewall -> Advanced -> enable Stealth Mode. Though need to remember to turn it off when diagnosing network problems.

Finder preferences:

  • Show extensions.
  • When performing a search: Search the Current Folder, otherwise it searches the entire computer by default and almost kills Finder.
  • New Finder windows show: my home folder. I hate the “All My Files” default view. Absolutely hate it.

Various tools and apps:

  • Resilio Sync: fantastic app for sharing files among team members, based on bittorrent.
  • Slack: team communication, we use it religiously.
  • Crypho: secure team communication. I’m looking forward to the day when we can replace Slack with Crypho, so we have all communication secure, but as it is, Slack is just way more convenient for everyone to use.
  • LittleSnitch: allow/disable connections per app/port/protocol/address. Fantastic to prevent apps from contacting ads/tracking services and getting more insight into what goes on in the background.
  • Alfred: great productivity app, “replaces” Spotlight and then some!
  • Bartender: get that Menu Bar under control!
  • Flux: same as Redshift on Linux, adjusts screen colours for late night hacking sessions.
  • AppTrap: automatic cleanup of files that apps leave laying around after you delete them
  • iStat menus: to always be able to see what my system is doing with a glance.screen-shot-2016-10-05-at-20-44-24
  • Seashore: GIMP/Photoshop clone with a Mac-style UI. But seems an abandoned project, need to find a replacement …
  • Calibre: eBook management.
  • iBank: keeping my finances in check.
  • LibreOffice. And removed Apple’s Numbers & Pages.

Development environment:

  • Homebrew: the quintessential package manager for OS X.
  • Twitter: funny as it sounds, but Twitter is a great way to stay on top of latest patches/releases/news in tech.
  • Colloquy: a lot of Open Source still happens on IRC and this is how I keep in touch.
  • Chrome: been using it a few years now for browsing and development, but I want to switch back to Firefox soon. Extensions I cannot live without: BackStop, The Great Suspender, Send to Kindle, StayFocusd and Full Page Screen Capture.
  • Tunnelblick: the OS X OpenVPN client.
  • ExtFS for Mac: so I am able to mount ExtFS volumes (Linux drives, Raspberry PI SD cards, etc.)
  • pgAdmin3 and pgweb: admin interfaces for PostgreSQL, lately pgweb sees way more usage than pgAdmin3. Also sqlite browser for SQLite.
  • dotfiles: I keep a private git repo with all my “dotfiles” so history is tracked.
  • travis-cli & heroku-cli: working with Travis and Heroku from the comfort of the terminal window.
  • Vagrant: for simple virtualization needs, when I want to test out something without polluting my main environment.
  • Shush: a vital tool for any remote worker, to keep unwanted background noise from polluting teleconferencing.
  • Sublime Text: I’ve been a TextMate user for quite a while but I jumped ship when I saw how much faster ST is. That was years ago and I’m sticking with ST for now, got used to it and it works for me. I did migrate to ST3 recently though. The list of plugins I use:
    • GitGutter
    • SideBar Enhancements
    • Requirements Txt
    • Color Highlighter
    • CSS3
    • jQuery
    • SublimeLinter
    • SublimeLinter-annotations
    • SublimeLinter-pydocstyle (sudo pip2/3 install pydocstyle)
    • SublimeLinter-flake8 (sudo pip2/3 install flake8)
    • SublimeLinter-jshint (npm install -g jshint)
    • SublimeLinter-shellcheck (brew install shellcheck)
    • SublimeLinter-pyyaml (sudo pip3 install pyyaml)
    • SublimeLinter-json
    • BracketHighlighter
    • Jedi – Python Autocompletion
    • theme: SoDaReloaded Light.sublime-theme
    • pdb snippet: https://gist.github.com/phalt/72117041fbb7cf4c4697
    • starting ST from the current dir in console by typing subl -n .: https://www.sublimetext.com/docs/2/osx_command_line.html

Lessons learned from EuroPython 2016

This was my first EuroPython conference and I had high expectations because I heard a lot of good things about it. I must say that overall it didn’t let me down. I learned several new things and met a lot of new people. So lets dive straight into the most important lessons.

On Tuesday I attended “Effective Python for High-Performance Parallel Computing” training session by Michael McKerns. This was by far my favorite training session and I have learned a lot from it. Before Michael started with code examples and code analysis he emphasized two things:

  1. Do not assume what you hear/read/think. Time it and measure it.
  2. Stupid code is fast! Intelligent code is slow!

At this point I knew that the session is going to be amazing. He gave us a github link (https://github.com/mmckerns/tuthpc) where all examples with profiler results were located. He stressed out that we shouldn’t believe him and that we should test them ourselves (lesson #1).

I strongly suggest to clone his github repo (https://github.com/mmckerns/tuthpc) and test those examples yourself. Here are my quick notes (TL; DR):

  • always compile regular expressions
  • use local variables (true = True, local = GLOBAL)
  • if you know how many elements it will be in your list, create it with None elements and then fill it (L = [None] * N)
  • when inserting item on 0 index in a list use append then reverse (O(n) vs O(1))
  • use built-in functions, use built-in functions, use built-in functions!!! (they are written in C layer)
  • when extending list use .extend() and not +
  • searching in set (hash map) is a lot faster then searching in list (O(1) vs O(n))
  • constructing set is much slower then list so you usually don’t want to transform list into set and then search in it because it will be slower. But again you should test it
  • += doesn’t create new instance of an object so use this in loops
  • list comprehension is better than generator. for loop is better then generator and sometimes also than list comprehension (you should test it!)
  • importing is expensive (e.g. numpy is 0.1 sec)
  • switching between python arrays and numpy arrays is very expensive
  • if you start writing intelligente and complex code you should stop and rethink if there is more stupid way of achieving your goal (see lesson #2)
  • optimize the code you want to run in parallel. This is more important than to just run it in parallel.

Threading and multiprocessing:

  • you should always run analysis if/when threading/multiprocessing is faster. If you are using simple functions it will probably be slower
  • in parallel computing you need to catch and log errors
  • in parallel computing you always want your functions to return value
  • in parallel computing you never want your code to “die”. Always try to return reasonable default value even if an exception is raised. Slightly wrong is better than not getting an answer!
  • when using threading/multiprocessing use .map() and if you don’t care about the order use .imap_unordered(). It is the fastest because it returns the first available value.
  • if you have stop condition use .imap_unordered()
  • be aware of random module problems. Random seed gets copied to all processes. Result is “random doesn’t work”. You need to create random_seed function and ensure that you are in different random state.
  • is there any general rule when to use threads and when multiprocessing? Use threads if you have light jobs (i.e. they execute in 0-1 sec)

Another interesting talk was about code review (Another pair of eyes: Reviewing code well by Adam Dangoor). He pointed out that one of the most important things with the process of reviewing the code is to share knowledge. When you review others code you learn a lot especially if you take your time and try to really understand what he/she was trying to achieve. It is also recommended to always say something nice about the code especially when reviewing the code of junior developer. And when you think that the code you are reviewing has a bug, write a test that proves it.

EuroPython 2016 was really an amazing experience that every Python developer/scientist should experience. I’m really looking forward to EuroPython 2017!

Setuptools – run custom code in setup.py

A week or so ago I started developing an experimental Python package for one of our projects. At some point I realized that it would be convenient to automatically execute some additional initialization code during the package installation process (i.e. when “python setup.py install” is run).

This can be achived by subclassing the setuptools.command.install class and overriding its run() method, like this (in setup.py):

from setuptools import setup
from setuptools.command.install import install


class CustomInstallCommand(install):
    """Customized setuptools install command - prints a friendly greeting."""
    def run(self):
        print "Hello, developer, how are you? :)"
        install.run(self)


setup(
    ...

NOTE: We reference the parent class’ run method directly – we can’t use super(…).run(self), because setuptools commands are old-style Python classes and super() does not support them.

Now that we have a customized install class, we must tell the setuptools machinery to actually use it instead of the built-in version. We do this through the cmdclass parameter of the setup() function:

...

setup(
    ...

    cmdclass={
        'install': CustomInstallCommand,
    },

    ...
)

The value of the cmdclass parameter should be a dictionary whose keys are the names of the setuptools commands we’re customizing (‘install’ in our case), while the corresponding values are our custom command classes we have defined eariler (CustomInstallCommand in this example).

BONUS

Sometimes you will want to apply the the same modification to more than a single command class. For instance your package could also be installed in development mode (by running python setup.py develop), meaning that the setuptools.command.develop class should be overriden as well in order for your modifications of the installation procedure to have any effect in this scenario, too.

A straightforward approach would be to implement another class (e.g. CustomDevelopCommand) similar to the the existing CustomInstallCommand class, but this would violate the DRY principle (“don’t repeat yourself”). What you can do is to define a decorator which accepts command class as a parameter, modifies its run() method and returns a modified version of the class.

Here’s an example:

from setuptools import setup
from setuptools.command.develop import develop
from setuptools.command.install import install


def friendly(command_subclass):
    """A decorator for classes subclassing one of the setuptools commands.

    It modifies the run() method so that it prints a friendly greeting.
    """
    orig_run = command_subclass.run

    def modified_run(self):
        print "Hello, developer, how are you? :)"
        orig_run(self)

    command_subclass.run = modified_run
    return command_subclass

...

@friendly
class CustomDevelopCommand(develop):
    pass

@friendly
class CustomInstallCommand(install):
    pass


setup(
    ...

It’s very simple – we just replace the run() method of a command class with our customized version of it and then apply the decorator where necessary. If we later need to replace the greeting with something different, we only have to change the code in one place.

NOTE: Do not forget to provide the right value of the cmdclass parameter to the setup() function.

By the way – you might be looking at the decorator code and wondering why we explicitly store a reference (‘orig_run’) to the original run method. The reason is we can’t simply call command_subclass.run() in modified_run function directly, because that would cause an infinite loop!
Just look at the code carefully – at the end of the decorator, command_subclass.run becomes a reference to modified_run. If modified_run then calls command_subclass.run(self) in its body, it actually calls itself – again and again and again, until maximum recursion depth is exceeded. Explicitly storing a reference to the original run() method is thus not redunant at all, it’s simply necessary.