Nov 25, 2015

INNACCESSIBLE_BOOT_DEVICE - aka: how much I hate intel rapid raid


One beer and 2 episodes of DBZ Super, I am finally back to a functional Windows 10 machine. Also known as wasting two hours by beating your head against a useless recovery capabilities of Windows 10 and Intels incompetence at writing drivers.

Hardware:
GA-EX58-DS4 with 4 drives in Raid 5.

Scenario:
Upgraded from Windows 7 to Windows 10, and because I am an idiot, upgraded Intel Rapid Raid driver from 14.5 to 14.6

After the reboot I got the :( INNACCESSIBLE_BOOT_DEVICE . Instantly the memories of how much I hate upgrading intel drivers rushed back into my head, and I remember every time I swore never to upgrade their piece of %#@ drivers again...

Solution:
Most of what I did I borrowed from janbambas.cz

1: Boot into Windows 10 recovery mode. Advanced -> very Advanced -> Command prompt
2: Go to <system drive>\Windows\System32\drivers  (example: C:\Windows\System32\drivers)
3: Make a backup of existing driver files: ias*
3a: mkdir bad_intel
3b: copy ias*.*   bad_intel
4: get a piece of paper and a pen.
5: go to C:\Windows\System32\DriverStore\FileRepository and search for all folders with older version of a driver in it (dir /s iastora.sys)
6: write down a couple of characters to uniquely identify each folder. (example: iastorac.inf_amd64_47ebd65d436e75d0 - take the _47)
In my case I had 5 folders..
7: Start the recovery process:
7a: Take a look at timestamps iaStorA.sys in all of the FileRepository folders you got from search.
7b. Copy over the newest one over to C:\Windows\System32\drivers
7c. Exit command prompt (literally type exit)
7d. Click the button to continue booting to windows 10 and cross your fingers.
7e. If it doesn't work, go back to start and repeat with the next file. This is why you have paper, so you dont forget where you are in the process)

This process worked for me on 3rd file, which was from a few months ago.

Good luck, and if this works, have a beer in the honor of janbambas.cz

useful links:
* http://www.janbambas.cz/inaccessible_boot_device-on-windows-10-boot-after-update-of-the-intel-rapid-storage-techonology-driver/

* https://communities.intel.com/thread/78198?wapkw=intel+matrix+storage+and+windows+10+bsod+at+boot

Oct 21, 2015

Provisioning windows box with Chef-provisioning on azure from a mac

After spending about half a day trying to get vagrant-azure to work it became very clear, that as of this writing the driver is just not mature enough. It works pretty good for Ubuntu/Linux but the moment you try to provision windows boxes, it sets your laptop on fire.

Instead of wasting any more time on it, I decided to give v1 and v2 provisioning drivers a chance, followed by Test Kitchen. IIRC they all use different drivers, and while all are pretty solid at provisioning Linux boxes, support for WinRM is very spotty.


Authentication:

First challenge is to authenticate successfully via provisioning driver. While Vagrant accepts subscription id and path to .pem as parameters, provisioning needs a azureProfile.json.

To get that file generated, I installed azure-cli via brew `brew cask install azure-cli`

Next, import azure creds with `azure account import ../../Projects/Azure/myazure.publishsettings`
This command will generate the missing azureProfile.json in ~/.azure

Next, validate it works with `azure account list`

Chef-Provisioning piece:

Get a name of the box (ami) you'll be using: `azure vm image list | grep -i Win2012`

Next, hack up the simplest recipe that'll spin up a box:

`knife cookbook create azure_old`
content of recipe/default.rb:

require 'chef/provisioning/azure_driver'
with_driver 'azure'
machine_options = {
    :bootstrap_options => {
      :cloud_service_name => 'alexvinyar', #required
      :storage_account_name => 'alexvinyar', #required
      :vm_size => "Standard_D1", #required
      :location => 'West US', #required
      :tcp_endpoints => '80:80' #optional
    },
    :image_id => 'b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04_2-LTS-amd64-server-20150706-en-us-30GB', #required
    # :image_id => 'a699494373c04fc0bc8f2bb1389d6106__Windows-Server-2012-R2-20150916-en.us-127GB.vhd', #next step
    # Until SSH keys are supported (soon)
    :password => 'Not**RealPass' #required
}
machine 'toad' do
  machine_options machine_options
  ohai_hints 'azure' => { 'a22' => 'b33' }
end
Finally, run chef-zero (chef client in local mode): `chef-client -z -r azure_old`

If the above recipe fails, dont fail. Check the output, and see if it gets past the authentication piece. If it does, it's just a matter of getting chef-provisioning syntax correct.

Once the run finishes (Azure is slow) connect to the box with `ssh root@12.23.34.45` for centos or ubuntu@ip for ubuntu boxes.

Now the Windows piece

With the `azure vm image list | grep -i Win2012` command I got a list of boxes, and once the test run with ubuntu succeeds, I move on to Windows.

This is where I took a break and had a beer. But I published this post anyway because I'll finish it eventually.





Useful links:
http://azure.microsoft.com/en-us/documentation/articles/xplat-cli/
http://brew.sh/
https://unindented.org/articles/provision-azure-boxes-with-vagrant/


chef-base repo and workstation cookbook


A "chef-base" or "chef-repo" is a git repository which maps 1:1 to Chef organization hosted on the Chef server.  An organization in Chef server 12 is analogous to a single Chef server. Each of these "chef-base" Git repositories becomes the system of record for the global Chef objects (Environments, Roles, Data Bags) in a given organization.  This Git repository typically* does not contain cookbooks.

To setup chef-base a user should first create an empty git repository on VSO / GitHub / GitLab / etc..
It makes things slightly easier if none of the files are initialized, including readme and gitignore.

Next, user should execute "chef generate repo <name of github repo>" command. This will generate the skeleton for the repo.
The resulting skeleton folder should be pushed it its entirety to git repo.

Workstation cookbook

* One exception to not having cookbooks in chef-base is the workstation cookbook. 
The workstation cookbook is a shared cookbook for anyone using chef in an organization and provides a standardized way to work with chef. It also allows rapid on-boarding of new team members and ability to safely experiment with a new tools. 
It works well in Vagrant, but there is a major limitation, you can't run Test Kitchen inside a Vagrant. For best results, encourage teams to leverage internal or external cloud VM, where kitchen runs will create additional VMs in the same cloud.
A Vagrantfile can be placed in the root of the cookbook. This vagrant file has a couple of purposes:
  • responsible for creating / destroying the workstation VM
  • kicking off chef-client run
  • easy access into the box via vagrant login
  • mounting the local chef-base as a folder in a VM
.gitignore file should be modified to exclude all cookbooks with exception of the workstation cookbook.

Places to learn more:
<add yours here> or in the comments.

Oct 3, 2015

Random observations of a new publically facing Chef website.

First time using speakr.chef.co – musings and observations

I hope I won't hurt anyone's feelings by below, below is what I see as an engineer. Every time I see similar pages, I make a conscious choice to overlook these defects, it could be because I trust the site, or because I found the thing I need.

There is no way in hell I would know how to write an existing page, or actually implement the changes I noted. But what I find most fascinating about my job, is there is a guy somewhere in the company – every company - who knows exactly what comma to change to address the issue. If I were a business, I would seek these guys out, and reward them with titles, work from home schedules, “work on your own problem”, etc... It's just so un-economic and un-business like to lose them.

To business:

The experience has been an exercise in patience, but only due to an unfortunate coincidence of API incompatibility:

                The GeekWire even was announced using the Seattle address which excluded ZipCode:
                "Oct. 1-2, 2015, Sheraton Seattle, 1400 Sixth Ave."
                ( URL: http://www.geekwire.com/events/geekwire-summit-2015/ )


Executive Summary: Overall Conclusion:
This experience instantly demonstrated the inferiority of this form of entry, as compared to the auto context/syntax entry offered by modern companies. If this is an internally developed tool for anything other than a personal project, it should be replaced with a real tool meant for the job.


Error 1:
The speakr input fields request ZipCode as a mandatory field.

Result 1:
I had to visit google maps, enter the partial address to get the ZipCode to unblock myself.
Pretty sure my mom would get past this now.


Error2:
As @echohack says - Default matter. There is non-primary field that requests event start time. The defaults of the all 4 fields are set to 23:00. Meaning the entries are valid data type, but values for start date are totally off.

Musing:
I think an 8am is a nice default for "start time" on "start date".

Possible scenario: a study of booking data found that most people fly in a day before, and they actually do want the start time to be 11pm for previous day for a networked dinner.
After digesting things over, above doesn’t make sense, because this isn’t an expense system. An event system should specify actual start time.

Result 2:
Had to make a couple of extra clicks to change the start time.


Error 3:
On initial event creation webpage threw errors: "Invalid start date", "Invalid end date". Clicking on start/end date fields again and resubmitting the form resulted in successful creation message.

Result / Assumptions
The drop off rate here is probably very high. I actually almost gave up here.
I wonder if there is monitoring or metrics in place to see this kind of drop off. Unlikely, but I do wonder if there is an easy to implement “business flow” monitoring solution for that like Zabbix.

Personal research todo: I wonder if paid version of google analytics is significantly faster at page load times than free one.


Error 4:
Allowed creation of events which have already occurred.

Possible scenario:
Could be a feature too I guess.

Musings:
Might be a good idea to check if there is an anti-spam mechanism on event creation button.
Wonder if vanilla code coverage would pick something like this up, or if you need something like Fortify.


Error 5:
After successful event creation, that event would not show up in search results on events.chef.io.
Possible causes is the refresh job on events jobs is not triggered instantly, the page is not yet hooked up to events, past events are ignored as a result of a conscious choice (possibly even from business), or something else entirely.


Overall Conclusion:
This experience instantly demonstrated the inferiority of this form of entry, as compared to the auto context/syntax detection offered by modern companies. If this is an internally developed tool for anything other than a personal project, it should be replaced with a real tool meant for the job.



Sep 21, 2015

Continuous key rotation with Chef

Lets see if I can get this down on paper in a meaningful way.. Players: a) some server (has to be Chef Server) - aka: Key Master. b) the rest of the infrastructure Tools needed: a) chef vault b) admin key for the Key Master c) sublime text The flow: Key master converges a recipe that does a global search for all of the nodes. For each node it generates a new key pair. It rotates the key and places the new key into a vault with search criteria of only itself and the node. Each node on converge accesses the vault and retrieves a new key. Marks the vault as converged or deletes the vault after consumption. Faults: What happens if the node doesn't converge for a long time? How does key rotation actually work? Can a node even converge if the key has been rotated? >> probably this is the way << Perhaps the node has to generate a key and set the search criteria to itself and Key Master. Key Master consumes the key and runs ctl command. Do Nodes continue to fail converges until Key Master updated the key? How does key rotation actually work? Result: Ever converge the node rotates its own key. Same model can probably be done for SSH keys. Final thought(s): What does it actually buy? I don't know, but many customers ask about it. Should it be done? Should each node have a unique, individual vault? Most likely, if you really think about it, there isn't a reason. Node's should be grouped and each group should run off a same vault. Having 1 vault per node with identical info is meaningless. Especially, if there is an admin who has access to all of the nodes anyway.

Aug 14, 2015

Pulling out Pega space requirements out of prpcUtils.xml

The idea is to decouple Pega managed file from your own automation. If your pega team decides to makes changes, you don't want to own these changes, but you do want the deployments to continue running regardless of who made the change. In this particular case, it's the space requirements for deployment.

Step one is to parse the XML and pull out the space requiremnt (via ruby 'cause we're running deployments via Chef)
Step two is to use that value in some meaningful way.

Jul 11, 2015

Weblogic + Chef + Automation in General - thoughts and reflections

This is a brain dump written on an airplane in a rather sleep deprived state. Since alcohol is not free on domestic flights, I opted out for coffee and pounded away on the keyboard for a few minutes before legroom and awkward sleeping positions stopped being an issue.



To start, I will say that I am not an expert on WebLogic, which means I am not burdened by years of learning and prefect the art of managing it. I have come to learn it on Monday, and today is my first Friday after having being exposed to this technology.

So far I hear that my approach to managing WebLogic would work perfectly with Jboss, JVM, or Tomcat… but would absolutely not work here.

What I do know so far is that WL has a central API that has ability to manage the entire cluster of boxes. It also has the ablity to act as a load balaner, as well as the source of information and a central registry.

That last point is very powerful, and from what I have seen so far the most underutilized aspect of WL. Everyone is interested in the centralized functionality completely ignoring the ability to decentralize it.

Lets break it down.

The common approach that I have seen "sold" so far, is to run all commands from the Admin (central) server. The central server will take care of all distribution of packages, starting and stopping of the cluster and all other deployment related functionality. Great. But just how useful is that in actuality?

WL allows you to have a Domain which is distributed across multiple physical machines. Domain can have multiple clusters distributed across multiple physical machines. Each cluster can have multiple applications installed. Which means, a physical machine can have a whole bunch of MS (JVM) running on it, each belonging to a different domain and a different app.

That's a lot of moving pieces. So when we try to automate a system like that, we will never talk about ONLY WL, we will also talk about patches, modifying property files, modifying port numbers on the host, auto scaling up and down, and a holst of other admin functions which have nothing to do with WL, but which have to account for the fact that WL is distributed across.

After a week of disecting WL, I presently belive that the admin server is a fantastic service discovery tool for WL management.

With regarding to managing WL with Chef:
Chef-client runs on each physical server.
Physical servers are groups by environemnts - prod / dev / test
Each servers run_list includes applications which are running on that server or in that environment.
The recipe for that application pulls information for the current environment from some construct, like attribute or databag or environment where the node resides.
That info includes :
the admin server for a particular app that recipe is responsible for.
Application version number

The recipe has a set of actions (LWRP) - deploy, undeploy, start, stop, etc..
Each chef-client run executes independently on each of the physical server
The Runlist is either a list of applications on that server, or an LWRP with a list of admin servers and application data bags
Each LWRP or Recipe hits the API and finds out what MS are running locally on the host where recipe is being executed.
Each recipe executes an LWRP which hits the API and performs the needed stop commands (stop, query, etc..)
Chef client at the machine level Pulls down the EAR file locally and tells the API location of the EAR. This ensures that EAR is physically located on the host and is accessible to the MS
It then starts the MS via the API
It then does whatever local changes need to be performed on the MS - server level config - thereby ensuring that all of the changes are in fact done, idempotent, and consistent.

If the API / MS allows multi version support, this style of deployments would allow zero-down-time deploys.
If the WebLogic API is not built to support a lot of API calls, perhaps there is a way to optimize the MS or load balance multiple Admin servers - however, if having lets say 40 physical servers made 20 api calls over a course of a 15 minute deployment is too much for the Admin server to handle, perhaps it's a good time to look at automating WebLogic out of the company

Feb 25, 2015

Single chef-client run with multiple reboots on Windows

To teach is to learn...

...or something along these lines. "How do I manage reboots with chef-client on Windows" is a question I hear every so often. 

So, this time around, I decided to buckle down and write down as many ways as I could remember to reboot a server and continue a chef-client run. No mucking around with the run_list, or messing around with multiple run_lists, definitely no manual steps, and most definitely no knife exec.

Here is my brain child - input and feedback are most welcome!

https://github.com/vinyar/chef_win_reboots


In my experience I found a couple of common situations where Windows needs to be defibrillated -
  • something has been installed and reboot is needed
  • a bunch of somethings have been installed and reboot is needed
  • something needs to be installed and a reboot is pending
  • a series of somethings needs to be installed and they have various reboot state requirements
  • a week has passed since a reboot has been performed
  • server joined a domain

With Chef managing your infrastructure there is a new reboot scenario:
  • reboot immediately without aborting a chef-client run

The patterns in the Github repo allow users to manage reboots at the resource level, or as a wrapper cookbook pattern.

A real example can be seen in pattern two - which was really the genesis for this repo from way back when - https://github.com/vinyar/chef_win_reboots/blob/master/reboot_demo/recipes/pattern2.rb

Patterns with cats: