Tag Archives: Lessons Learned

Strings in Python 2 and Python 3

The goal of this post is to show you how to properly use encode and decode in python 2 and in python 3. This post will be based on small examples that will (hopefully) make you better understand how strings work in python 2 and python 3.

A bit of background on unicode and UTF-8:

Unicode has a different way of thinking about characters. In Unicode, the letter “A“ is a platonic ideal. It’s just floating in “heaven”. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 (in python “\u0639“).

UTF-8 is a system of storing your string of unicode code points (those magic “U+number“) in memory using 8 bit bytes.

One of the common questions for python 3 is when to use bytestring and when to use strings as an object? When you are manipulating string (e.g. “reversed(my_string)“) you always use string object and newer bytestring. Why? Here is an example:

my_string = "I owe you £100"
my_bytestring = my_string.encode()

>>> print(''.join([c for c in reversed(my_string)]))
001£ uoy ewo I
>>> print(''.join([chr(c) for c in reversed(my_bytestring)]))
001£Â uoy ewo I

 

The first print is what we expect but the second is not. And why is that? Well the “reversed“ function iterates over a sequence which in second case is bytestring which is b’I owe you \xc2\xa3100′. We can also verify this by checking the length of “my_bytestring“ and “my_string“:

>>> print(len(my_string))
14
>>> print(len(my_bytestring))
15

 

If I always just add “.encode()“ everything will be fine right? No! For start you should never call encode without specifying which encoding to use because then the interpreter will pick for you which will “almost” always be UTF-8 but there are some instances where this won’t be so and you will spent a lot of time finding this bug. So ALWAYS specify which encoding to use (e.g. “.encode(‘utf-8’)“). Example:

>>> print('I owe you £100'.encode('utf-8').decode('latin-1'))
I owe you £100

 

The other problem which is even bigger with “sprinkling” “.encode()“ is that if you already have encoded string you will get error (in python 3) or even worse (in python 2), you will do string operations on bytestring.

In python 2 “str“ is for strings of bytes and “unicode“ is for strings of unicode code points. The problem is that python 2 implicitly converts between types… sometimes. It allows you things like this:

>>> print((u'I owe you £100'.encode('utf-8') + 'Plus another $100').decode('latin-1'))
I owe you £100Plus another $100

 

This will quickly raise error when “Plus another $100“ becomes something that is not ASCII. If you try this in python 3 you get “TypeError: can’t concat bytes to str“.

If you need your code to run both on python 2 and python 3 then a rule of thumb is to first write a code for python 3 and then try it in python 2.
References:

Lessons Learned from PyMunich 2016

At the end of October there was a Python conference in Munich (PyMunich). For a regional conference it was quite big in my opinion. There were 3 tracks and more then 40 speakers.

As always I won’t cover all the talks just the ones that I found the most interesting and educational. After all this is the biggest reason why I go to these conferences.

The first talk I attended was by Dmitry Trofimov. He talked about profiling (“Profiling the unprofilable“). There are 2 approaches you can profile your code and it is important to know them both so you know which one to choose. The first one is statistical or sampler profiling (e.g. vmprof) and the second one is deterministic profiling (e.g. cprofile). For more details about the differences I strongly suggest to do some research on your own.

When you need to optimize your code you should be aware of the optimization levels. Often developers want to be smart and they go straight into optimizing their algorithms. But this doesn’t have the biggest impact. The biggest impact on the performance has the design (architecture). So this should be your biggest focus. After that you can start looking at algorithms and data structures and at the end line profiling. See “Effective Python” section in lessons learned from europython 2016 blog post for more details on this.

You can also use Cython for even better optimization (i.e. when you would need to write C code) but in most cases this isn’t necessary when building for the web because the bottleneck is network I/O. Stefan Behnel had a great talk about “Getting Native with Cython” and he showed us how easy it is to write pure python code and then transform it to Cython. If you have performance issues and you have already done all the optimization you could think of I strongly suggest to try with Cython. I realise that it is probably harder than Stefan showed us but still it is worth looking into in my opinion.

Encryption is awesome. We all like it, but are we all using it? I admit that I don’t have it on my site but I should. And now with Let’s Encrypt Certificate Authority there are no more reasons why any of us don’t use encryption on their site(s). Markus Holtermann who had a talk about SSL encryption (“SSL all the things“) pointed out a few things that we should probably all know:

  • SSL 2 and 3 are broken, so don’t use them,
  • also don’t use TLS 1.0/1.1,
  • get fresh certificate every 90 days,
  • disable (make redirect) http because it can leak some information you don’t want.

There are many open source tools that can help you achieve nice and tidy encryption on your site. One of them is `acme-tiny` (https://github.com/diafygi/acme-tiny). It is very small script (less than 200 lines) which means you can easily read every line of the code which you should because you need to trust this tool with your private keys.

The last talk I want to mention was by far my favourite one. Philip Bauer showed us how to debug like a pro (“Debug like a pro. How to become a better programmer through pdb-driven development“).

His bread and butter tool is `pdbpp` or `pdb++` which is a drop-in replacement for `pdb`. This means that you create break point just like with pdb but if you have pdb++ installed it will automatically get called instead.

Here are the basic commands for pdb that Philip highlighted:

  •  l[ist] (list source code of current file)
  • n[ext] (continue execution until next line)
  • s[tep] (execute the current line, stop at the first possible occasion)
  • r[eturn] (continue execution until the current function returns)
  • c[ontinue] (continue execution, only stop when a breakpoint is encountered)
  • w[here] (show stack trace, recent frame at bottom)
  • u[p] (move up the stack)
  • d[own] (move down the stack)
  • b[reakpoint] (set a new breakpoint. `tbreak` for temporary break points)
  • a[rgs] (print the argument list of the current function)

The nice thing about pdbpp is that it has a long list method (`ll`) which displays the whole function you are in (Note: ipdb also has long list method).

Other python debugging tricks you should know about are:

  • use ?for getting additional information lib/class/function/… (e.g. os?)
  • use ??for displaying the source code of the lib/class/function you want to inspect (e.g. os.path.join??)
  • pp(Pretty-print) is already in pdb so you should always use it
  • pp locals()will pretty print local variables

One of the best tricks is the `help` function which accepts object and returns generated help page for the object. !help(obj.__class__)command will generate help page which will contain all the methods including class methods and static methods with docstrings, method resolution order, data descriptors, attributes, data and other attributes inherited and much more.

Note: The reason you need to put ! before help function in pdbpp/ipdb is because is you don’t put ! you will call pdbpp/ipdb internal help function which is not the python build-in help function.

You can also use --pdboption when running unit tests with pytest or nosetest and this will cause to drop in a pdb whenever a test fails or errors. From there you can write a code that will pass a test, copy paste that code into your file and you are done. This is the basic principle of Test-Driven / Debug-Driven Development (TDD).

Any questions? Send us an email.

Writing The Docs – Prague 2016

On September 19th and 20th Write the Docs Meeting took place in Prague. This year I had the pleasure to attend. More than 250 people came which is about 40% more compared to last year.  On my surprise the majority of the people were actual tech writers or ‘documentarians’ as they called themselves (well there were several talks were they pointed out that they actually don’t have a good, recognisable name).

All of the speakers were tech writers so there wasn’t much correlation with the actual coding or development from the software point of view. Nonetheless there were quite a few tips that I have picked up.

One of the first talks that I found interesting was about writing as a non-native speaker (by Szabó István Zoltán aka Steve). Although his main focus was on language differences and how they affect non-native speaker when he needs to write something (e.g. documentation) he also gave a few tips on how to write in generally. He said you should “Write drunk; edit sober” which I think is a very interesting idea (unfortunately I’m not drunk while writing this). The more technical suggestions were: First do the writing part and then the editing part. In the writing part you should:

  1. Create the structure,
  2. estimate the word number
  3. focus on the flow, don’t mind the grammar.

And in the editing part you should:

  1. Self-edit,
  2. grammar checkers,
  3. read out loud,
  4. send to the editor.

He also suggested in order to improve your language skills you should start writing your own blog, attend IRL events / conferences and of course read lots of books.

Another interesting talk was about screenshots. Apparently there are some people who can’t create proper screenshots so tech writers must do them themselves. He suggested that we should have screenshot policy style guide which I think is actually a good idea.

There was also one talk about documentation quality and what actually documentation quality is. We have structural and functional quality. Structural is about grammar, style, navigation, etc. Functional quality answers the following questions:

  1. Does it do what it’s supposed to do?
  2. Does it satisfy requirements?
  3. Achieve what it sets out to achieve for users?

Functional quality brings value and this should be our goal. Our goal as a technical writer shouldn’t be that our text is full with nice words that makes average reader harder to understand. It should be plain and simple so that it transports the information as quickly as possible. We should be avare that every sentence has a “user journey” so when you read a sentence your brain actually needs to interpret its information. And the more complex the sentence is the harder is for our brains to do that. We usually label this as “light” (e.g. comics) and “heavy” (e.g. Tolstoy, War and Peace) reading. Let us look at this sentence: “The defendant examined by the lawyer was unreliable”. When we read this sentence we first think that the defendant was doing the examination but then after we read the whole sentence our brain realizes that actually the lawyer was doing the examination. This is called “temporal ambiguity”.

The last talk I want to mention is about (mental) checklists. I found this talk interesting because people (including me) often confuse checklist with todo list. But they are very different. Checklists contains tasks that are repeatable. Checklists are used on a daily basis in a lot of industries. E.g. when airplane pilot takes off or lands a plane he uses checklist to check if he did everything he is suppose to do before taking off or landing.

We have 2 types of checklists: Read-do and do-confirm. They are pretty self explanatory so I will not discuss the differences here. One of the things about checklists we should always have in mind is that if we are using them we should keep them up to date. One of the worst thing that could happen to a checklist is that it contains incorrect items and that in the current process we are actually checking for different things (because we found a better way of doing things but we didn’t update the checklist). That is why I still prefer automatization over checklists (e.g. when deploying, I prefer to use TravisCI).

 

To wrap up this post I would like to mention a comment that someone (unfortunately I don’t remember his name) on the conference said about github projects. He said that when you are looking at some github project one of the most important things about that project is README file. Then we discussed how many README files are incomplete and/or not updated. In fact every user that opens a github project will first look at README file and if he doesn’t see it there is little chance that he will actually look into the sourcecode (you know this is true).

 

So to help you out, here is a nice README.rst template that you can use for your projects:

$project
========

$project will solve your problem of where to start with documentation,
by providing a basic explanation of how to do it easily.

Look how easy it is to use:

    import project
    # Get your stuff done
    project.do_stuff()

Features
--------

- Be awesome
- Make things faster

Installation
------------

Install $project by running:

    install project

Contribute
----------

- Issue Tracker: github.com/$project/$project/issues
- Source Code: github.com/$project/$project

Support
-------

If you are having issues, please let us know.
We have a mailing list located at: [email protected]

License
-------

The project is licensed under the BSD license.

 

And if you want to read more about writing documentation this is a good place to start. I must warn you that if you choose the blue pillow there is a high chance that you will start writing better documentation :).