measuring puppet |
2013-01-24
|
For one of work's projects, we'll soon be working on scaling our platform more, which will require deploying a bunch more machines. For this project, we basically have a local dev platform, an online dev platform, a preproduction platform, and a production platform.
Right now, all these platforms have exactly one host. There are two puppetmasters, one for the dev platforms and one for the pre/pro platforms.
Since deploying a bunch more machines is going to require a lot more puppet running, I want to work on removing as much friction as I can from my puppet work. I do the runs manually as we upgrade platforms during deployment, and a run typically takes well over a minute. For me, that's too long - it causes me to waste time, lose focus, task switch, and forget I should be following up on puppet runs. It makes finetuning puppet modules a chore as I hack on them.
So I wanted to start by trimming some of the obvious fat before I segment my puppet config into separately testable pieces. I would have expected puppet apply to actually have something to help with that, but it doesn't.
After thinking it through, I realized I wanted some kind of tool that would timestamp output of puppet apply --debug so I could see which things it does take more time than others.
I wasn't sure what to google for, but timestamp stdout seemed to bring up some results, and I hit on http://joeyh.name/code/moreutils/ which includes a tool called 'ts' and is a simple pipe filter that timestamps lines going to stdout.
That was almost good enough. What I really wanted though was to know how much time elapsed since printing the last line. My perl is rusty, but I managed to quickly cook up a patch that makes it print incremental timestamps.
Now I can do a puppet run like this:
puppet apply --modulepath=`pwd`/modules:`pwd`/dev/modules manifests/site.pp --debug | ts -i "%H:%M:%.S" | egrep -B 1 "^00:00:0[^0]"
00:00:00.001066 debug: Executing 'test -e ../commit && ( test xorigin/master == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/master^0 | head -n 1` == `cat ../commit` ) )'
00:00:02.646908 debug: Service[postfix](provider=redhat): Executing '/sbin/service postfix status'
--
00:00:00.000987 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.871258 debug: Exec[git-checkout-/var/www/partner-test](provider=posix): Executing check 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:00.000942 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.886606 debug: Prefetching aliases resources for mailalias
--
00:00:00.000957 debug: Executing '/usr/sbin/semanage fcontext -l | grep -q '^/home/git/dev(/.*)?''
00:00:01.750281 debug: /Stage[main]/Dexter::Apache/Selinux::Set_fcontext[home-httpd]/Exec[semanage-/home/git/dev-httpd_sys_content_t]/unless: /usr/sbin/semanage: Broken pipe
--
00:00:00.000855 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:02.064475 debug: /Schedule[puppet]: Skipping device resources because running on a host
--
00:00:00.001048 debug: Executing '/usr/sbin/semanage fcontext -l | grep -q '^/srv/merchant(/.*)?''
00:00:01.750129 debug: /Stage[main]/Partner::Install/Selinux::Set_fcontext[srv-merchant-httpd]/Exec[semanage-/srv/merchant-httpd_sys_content_t]/unless: /usr/sbin/semanage: Broken pipe
--
00:00:00.000861 debug: Executing 'test -e ../commit && ( test xmaster == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify master^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.841316 debug: Exec[git-checkout-/var/www/merchant](provider=posix): Executing check 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:00.000955 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.858206 debug: Service[httpd](provider=redhat): Executing '/sbin/service httpd status'
Some explanation is in order.
The puppet apply is straightforward if you know puppet a little - it will apply a manifest and spit out a lot of debug info.
The output gets piped into ts, which will do incremental timestamping (with -i, which is what my patch adds) according to the specified format (ts by default uses seconds precision, but can do microsecond precision if you use %.S in the format).
Then I grep for all lines that take at least 1 second to be output, and display the line before that too (since puppet is generating output either before or after a possibly long-running task, so either on the line before or the line that took too long).
In the first section, I doubt service postfix status is to blame, so it's probably my convoluted git updating that takes too long. I need to rework that module so it doesn't fetch on every run.
In the third section, semanage is to blame. Hm, maybe I need to find a different way to look up whether the particular fcontext rule I want to add is already there. I've considered converting it to facts, although that sounds like it would be stretching facts a little - that's a lot of info to store in a fact.
The others are repeats of both, so I know where to start trimming the fat now!
And when all > 1 sec items are gone, time to shave off more below that.
If you want to try out ts with incremental timestamping, it's available in the rebuilt moreutils rpm in my package repositories for CentOS 6 and F16/17/18.
If any puppetmaster (hah!) has good tips on how to debug and measure the catalog generation step (the one on the master), let me know!
Nice, going to have to try this out as well.
For semanage fcontext, I’ve put up my puppet class here: https://gist.github.com/4628864
Yes, it abuses grep but as you’ve covered, I’ll gladly accept the maintenance burden in exchange for not doing multiple 5-second-per-context-checks.
Comment by Stijn Hoop — 2013-01-24 @ 23:37
Yeah, pretty much have the same thing, including the comment about the ‘right’ way to do it :)
Comment by Thomas — 2013-01-25 @ 18:38
Thomas,
Take a look at the output of using –evaltrace on your puppet runs.
This *should* provide you with all of the information that you need.
Also, if you have a lot of Service objects, they will take quite a bit of time and there’s not much you can do about it. Additionally, I’ve found that Exec and Service *seem* to use the same execution queue and therefore moving your Exec statements to native types can really help your execution time. See the puppetlabs-stdlib code for examples.
Finally, get rid of any recursive File statements working on large directories as they have been shown to be massively inefficient.
Good Luck!
Comment by Trevor Vaughan — 2013-01-25 @ 00:40
Thanks for all the tips ! Trying –evaltrace as we speak.
OK, I see what it is doing. The git command score high there. My biggest surprise is a bunch of File resources taking around 0.4 seconds to evaluate; e.g.
Any idea why that would be ? That seems really slow for a simple resource!
Comment by Thomas — 2013-01-25 @ 18:33
Hi there,
I read your post quickly, but… I have a few comments, and hopefully they will be helpful to you (they’re free anyways):
1) AFAICT, when puppet is running something, it actually prints the output of what it *just* ran, *after* it ran it. I always found this a little stupid, because in some of my puppet scripts I have some very potentially long running ‘exec’s, and I only know which one is taking forever *after* it finishes. So I have to use ps and other tools to debug when needed. This might be useful in case you think one statement is taking timelist[x] seconds when it’s actually taking timelist[x-1]… Right?
2) ‘puppet apply’ has a –tags argument which you can use to specify partial amounts of puppet code to run. I think. I don’t actually use this, although it might be useful, and if you like, feel free to test it out and let me know how it went. Sadly puppet doesn’t have properly working manpages (at least for me) and I think they’ve started keeping it all on their website ever since the license change away from the GPL. (http://docs.puppetlabs.com/man/apply.html) Hopefully I’m wrong about their intentions, because I use and love puppet a lot.
3) Personally when I started using puppet, I was annoyed because my programs would normally compile or just run (eg python) right away! Now I have to sit there for 30 seconds. My solution, I don’t mean this in a rude way, but I got more patient! Now these sorts of things are more appropriate again: http://xkcd.com/303/
Hope this helped!
James
PS: In case it interests you, I’m actually working on puppet code and infrastructure to deploy GNOME and related components. It’s all going to be Free Software when I finish a v0.1 I’ll be writing about it on my blog.
Comment by James — 2013-01-25 @ 04:25
Hi,
1) in normal mode, yes. Or rather – I try my best to have a normal run not output *anything* so that I can easily see if something broke during refactoring. However, as you can see from the command, I run it in debug mode for the timing, and in debug mode it’s telling you before it performs an Exec.
2) I know I need to go back to tags and understand better, because it may be a solution for breaking up puppet configs. For now, I’m moving towards making it possible to apply just one module at a time for testing purposes from the command line with puppet apply.
3) When I’m deploying instead of actually changing my puppet config, I’m fine with it taking 30 seconds. It’s when I’m developing new puppet code or refactoring that 30 seconds is simply too long. Another reason why I’ve gone towards setting up virtual machines to test modules in and forcing myself to test modules in isolation before putting them together.
Comment by Thomas — 2013-01-25 @ 18:32