Present Perfect


Picture Gallery
Present Perfect

Puppet/puppetdb/storeconfigs validation issues

Filed under: puppet,sysadmin — Thomas @ 21:31


Over the past year I've chipped away at setting up new servers for apestaart and managing the deployment in puppet as opposed to a by now years old manual single server configuration that would be hard to replicate if the drives fail (one of which did recently, making this more urgent).

It's been a while since I felt like I was good enough at puppet to love and hate it in equal parts, but mostly manage to control a deployment of around ten servers at a previous job.

Things were progressing an hour or two here and there at a time, and accelerated when a friend in our collective was launching a new business for which I wanted to make sure he had a decent redundancy setup.

I was saving the hardest part for last - setting up Nagios monitoring with Matthias Saou's puppet-nagios module, which needs External Resources and storeconfigs working.

Even on the previous server setup based on CentOS 6, that was a pain to set up - needing MySQL and ruby's ActiveRecord. But it sorta worked.

It seems that for newer puppet setups, you're now supposed to use something called PuppetDB, which is not in fact a database on its own as the name suggests, but requires another database. Of course, it chose to need a different one - Postgres. Oh, and PuppetDB itself is in Java - now you get the cost of two runtimes when you use puppet!

So, to add useful Nagios monitoring to my puppet deploys, which without it are quite happy to be simple puppet apply runs from a local git checkout on each server, I now need storedconfigs which needs puppetdb which pulls in Java and Postgres. And that's just so a system that handles distributed configuration can actually be told about the results of that distributed configuration and create a useful feedback cycle allowing it to do useful things to the observed result.

Since I test these deployments on local vagrant/VirtualBox machines, I had to double their RAM because of this - even just the puppetdb java server by default starts with 192MB reserved out of the box.

But enough complaining about these expensive changes - at least there was a working puppetdb module that managed to set things up well enough.

It was easy enough to get the first host monitored, and apart from some minor changes (like updating the default Nagios config template from 3.x to 4.x), I had a familiar Nagios view working showing results from the server running Nagios itself. Success!

But all runs from the other vm's did not trigger adding any exported resources, and I couldn't find anything wrong in the logs. In fact, I could not find /var/log/puppetdb/puppetdb.log at all...

fun with utf-8

After a long night of experimenting and head scratching, I chased down a first clue in /var/log/messages saying puppet-master[17702]: Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB

I traced that down to puppetdb/char_encoding.rb, and with my limited ruby skills, I got a dump of the offending byte sequence by adding this code:

Puppet.warning "Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB"
File.open('/tmp/ruby', 'w') { |file| file.write(str) }
Puppet.warning "THOMAS: is here"

(I tend to use my name in debugging to have something easy to grep for, and I wanted some verification that the File dump wasn't triggering any errors)
It took a little time at 3AM to remember where these /tmp files end up thanks to systemd, but once found, I saw it was a json blob with a command to "replace catalog". That could explain why my puppetdb didn't have any catalogs for other hosts. But file told me this was a plain ASCII file, so that didn't help me narrow it down.

I brute forced it by just checking my whole puppet tree:

find . -type f -exec file {} \; > /tmp/puppetfile
grep -v ASCII /tmp/puppetfile | grep -v git

This turned up a few UTF-8 candidates. Googling around, I was reminded about how terrible utf-8 handling was in ruby 1.8, and saw information that puppet recommended using ASCII only in most of the manifests and files to avoid issues.

It turned out to be a config from a webalizer module:

webalizer/templates/webalizer.conf.erb: UTF-8 Unicode text

While it was written by a Jesús with a unicode name, the file itself didn't have his name in it, and I couldn't obviously find where the UTF-8 chars were hiding. One StackOverflow post later, I had nailed it down - UTF-8 spaces!

00004ba0 2e 0a 23 c2 a0 4e 6f 74 65 20 66 6f 72 20 74 68 |..#..Note for th|
00004bb0 69 73 20 74 6f 20 77 6f 72 6b 20 79 6f 75 20 6e |is to work you n|

The offending character is c2 a0 - the non-breaking space

I have no idea how that slipped into a comment in a config file, but I changed the spaces and got rid of the error.

Puppet's error was vague, did not provide any context whatsoever (Where do the bytes come from? Dump the part that is parseable? Dump the hex representation? Tell me the position in it where the problem is?), did not give any indication of the potential impact, and in a sea of spurious puppet warnings that you simply have to live with, is easy to miss. One down.

However, still no catalogs on the server, so still only one host being monitored. What next?

users, groups, and permissions

Chasing my next lead turned out to be my own fault. After turning off SELinux temporarily, checking all permissions on all puppetdb files to make sure that they were group-owned by puppetdb and writable for puppet, I took the last step of switching to that user role and trying to write the log file myself. And it failed. Huh? And then id told me why - while /var/log/puppetdb/ was group-writeable and owned by puppetdb group, my puppetdb user was actually in the www-data group.

It turns out that I had tried to move some uids and gids around after the automatic assignment puppet does gave different results on two hosts (a problem I still don't have a satisfying answer for, as I don't want to hard-code uids/gids for system accounts in other people's modules), and clearly I did one of them wrong.

I think a server that for whatever reason cannot log should simply not start, as this is a critical error if you want a defensive system.

After fixing that properly, I now had a puppetdb log file.

resource titles

Now I was staring at an actual exception:

2016-10-09 14:39:33,957 ERROR [c.p.p.command] [85bae55f-671c-43cf-9a54-c149cede
c659] [replace catalog] Fatal error on attempt 0
java.lang.IllegalArgumentException: Resource '{:type "File", :title "/var/lib/p
uppet/concat/thomas_vimrc/fragments/75_thomas_vimrc-\" allow adding additional
config through .vimrc.local_if filereadable(glob(\"~_.vimrc.local\"))_\tsource
~_.vimrc.local_endif_"}' has an invalid tag 'thomas:vimrc-" allow adding additi
onal config through .vimrc.local
if filereadable(glob("~/.vimrc.local"))
source ~/.vimrc.local
'. Tags must match the pattern /\A[a-z0-9_][a-z0-9_:\-.]*\Z/.
at com.puppetlabs.puppetdb.catalogs$validate_resources.invoke(catalogs.
clj:331) ~[na:na]

Given the name of the command (replace catalog), I felt certain this was going to be the problem standing between me and multiple hosts being monitored.

The problem was a few levels deep, but essentially I had code creating fragments of vimrc files using the concat module, and was naming the resources with file content as part of the title. That's not a great idea, admittedly, but no other part of puppet had ever complained about it before. Even the files on my file system that store the fragments, which get their filename from these titles, happily stored with a double quote in its name.

So yet again, puppet's lax approach to specifying types of variables at any of its layers (hiera, puppet code, ruby code, ruby templates, puppetdb) in any of its data formats (yaml, json, bytes for strings without encoding information) triggers errors somewhere in the stack without informing whatever triggered that error (ie, the agent run on the client didn't complain or fail).

Once again, puppet has given me plenty of reasons to hate it with a passion, tipping the balance.

I couldn't imagine doing server management without a tool like puppet. But you love it when you don't have to tweak it much, and you hate it when you're actually making extensive changes. Hopefully after today I can get back to the loving it part.

measuring puppet

Filed under: puppet — Thomas @ 20:58


For one of work's projects, we'll soon be working on scaling our platform more, which will require deploying a bunch more machines. For this project, we basically have a local dev platform, an online dev platform, a preproduction platform, and a production platform.

Right now, all these platforms have exactly one host. There are two puppetmasters, one for the dev platforms and one for the pre/pro platforms.

Since deploying a bunch more machines is going to require a lot more puppet running, I want to work on removing as much friction as I can from my puppet work. I do the runs manually as we upgrade platforms during deployment, and a run typically takes well over a minute. For me, that's too long - it causes me to waste time, lose focus, task switch, and forget I should be following up on puppet runs. It makes finetuning puppet modules a chore as I hack on them.

So I wanted to start by trimming some of the obvious fat before I segment my puppet config into separately testable pieces. I would have expected puppet apply to actually have something to help with that, but it doesn't.

After thinking it through, I realized I wanted some kind of tool that would timestamp output of puppet apply --debug so I could see which things it does take more time than others.

I wasn't sure what to google for, but timestamp stdout seemed to bring up some results, and I hit on http://joeyh.name/code/moreutils/ which includes a tool called 'ts' and is a simple pipe filter that timestamps lines going to stdout.

That was almost good enough. What I really wanted though was to know how much time elapsed since printing the last line. My perl is rusty, but I managed to quickly cook up a patch that makes it print incremental timestamps.

Now I can do a puppet run like this:

puppet apply --modulepath=`pwd`/modules:`pwd`/dev/modules manifests/site.pp --debug | ts -i "%H:%M:%.S" | egrep -B 1 "^00:00:0[^0]"
00:00:00.001066 debug: Executing 'test -e ../commit && ( test xorigin/master == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/master^0 | head -n 1` == `cat ../commit` ) )'
00:00:02.646908 debug: Service[postfix](provider=redhat): Executing '/sbin/service postfix status'
00:00:00.000987 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.871258 debug: Exec[git-checkout-/var/www/partner-test](provider=posix): Executing check 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:00.000942 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.886606 debug: Prefetching aliases resources for mailalias
00:00:00.000957 debug: Executing '/usr/sbin/semanage fcontext -l | grep -q '^/home/git/dev(/.*)?''
00:00:01.750281 debug: /Stage[main]/Dexter::Apache/Selinux::Set_fcontext[home-httpd]/Exec[semanage-/home/git/dev-httpd_sys_content_t]/unless: /usr/sbin/semanage: Broken pipe
00:00:00.000855 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:02.064475 debug: /Schedule[puppet]: Skipping device resources because running on a host
00:00:00.001048 debug: Executing '/usr/sbin/semanage fcontext -l | grep -q '^/srv/merchant(/.*)?''
00:00:01.750129 debug: /Stage[main]/Partner::Install/Selinux::Set_fcontext[srv-merchant-httpd]/Exec[semanage-/srv/merchant-httpd_sys_content_t]/unless: /usr/sbin/semanage: Broken pipe
00:00:00.000861 debug: Executing 'test -e ../commit && ( test xmaster == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify master^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.841316 debug: Exec[git-checkout-/var/www/merchant](provider=posix): Executing check 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:00.000955 debug: Executing 'test -e ../commit && ( test xorigin/release-1.4.x == `cat ../commit` || ( git fetch -a; test `git rev-parse --verify origin/release-1.4.x^0 | head -n 1` == `cat ../commit` ) )'
00:00:01.858206 debug: Service[httpd](provider=redhat): Executing '/sbin/service httpd status'

Some explanation is in order.

The puppet apply is straightforward if you know puppet a little - it will apply a manifest and spit out a lot of debug info.

The output gets piped into ts, which will do incremental timestamping (with -i, which is what my patch adds) according to the specified format (ts by default uses seconds precision, but can do microsecond precision if you use %.S in the format).

Then I grep for all lines that take at least 1 second to be output, and display the line before that too (since puppet is generating output either before or after a possibly long-running task, so either on the line before or the line that took too long).

In the first section, I doubt service postfix status is to blame, so it's probably my convoluted git updating that takes too long. I need to rework that module so it doesn't fetch on every run.

In the third section, semanage is to blame. Hm, maybe I need to find a different way to look up whether the particular fcontext rule I want to add is already there. I've considered converting it to facts, although that sounds like it would be stretching facts a little - that's a lot of info to store in a fact.

The others are repeats of both, so I know where to start trimming the fat now!

And when all > 1 sec items are gone, time to shave off more below that.

If you want to try out ts with incremental timestamping, it's available in the rebuilt moreutils rpm in my package repositories for CentOS 6 and F16/17/18.

If any puppetmaster (hah!) has good tips on how to debug and measure the catalog generation step (the one on the master), let me know!