Bear with me. I've not gone completely insane.
When I first started doing IT stuff professionally I found it reasonably common that the go to excuse from other technical people, when they didn't know the answer and didn't know how to find out the answer and didn't want to admit that they didn't know, was that the program or data had become corrupt. To being with this made me laugh, especially when it was something completely innocuous, such as a shortcut being wrong (seriously, that happened).
Over time I began to find utter disdain for these technical staff who instead of admitting that they didn't know the answer, or didn't spend the time understand the problem properly before reaching for this excuse.
About 2 years ago I saw this excuse disappear. I rejoiced. Unfortunately it was not so. I simply had failed to notice that I just wasn't dealing with these sorts of problems any more.
The last few months I've had to get in touch with one particular customer's third party support for their financials, stock and reporting software a few times. This company is now also our partner, of sorts.
Apparently it is now cool to blame everything on file system permissions. Unhelpful when it's just their software not being configured at all, or even better, a prerequisite of their software not installed.
Not knowing is not a crime in the company that I work at (unless the staff member has been told repeatedly). Not admitting to it, or not asking for help is, and I do wish that was more common.
I'm afraid that for this blog entry you're going to have to sit through some "bullshit-back-story".
For whatever reason our live cluster, running under System Center Virtual Machine Manager (SCVMM) 2008 R2 had gained duplicate copies of 3 virtual machines, and all the duplicate VMs (Virtual Machines) were marked as missing.
SCVMM is a relatively new "toy" for me, however I'm already starting to feel the love-hate relationship growing, probably indicated by the fact that I'm referring to it as "scum" to my co-workers and friends. My point being forgive me if this is common knowledge. It doesn't appear to be.
Delving deeper the Failover Cluster Manager MMC was showing only a single copy of the virtual machines in question. That narrowed it down to just SCVMM being the problem child. All attempts to perform a repair, or even an attempt at removal (despite my better judgement) in SCVMM itself resulted in a bullshit error message. Frankly a restore at 23:00 isn't exactly what I'd want to have been doing any way, so I was somewhat pleased by that.
Poking further there appeared to be a MS KB (KB2308590) that directly addressed this problem No joy.
So, doing what I always do in these situations, I started prodding at the database that powers SCVMM, using SQL Studio Manager. Yeah it's a GUI, but it was nearly midnight, and it was easier.
If you're using the default database you'll want to connect to COMPUTERNAME\MICROSOFT$VMM$. Otherwise it'll be where ever you specified at install. The table that interests most of all is the tbl_WLC_VObject. If you select with a where statement to find your problem machines you should fine that you have duplicate entries. Carefully choose the entry that is not running, and delete it. Luckily our duplicates had no tags, and no owners, so it was actually fairly easy to figure out which ones to remove. Close and reload the SCVMM console and you should find that you have a less scary looking SCVMM administration console.
I'm sure that there are some bits left behind in other tables that are referencing the VMs, however I'm going to bed. It can be tidied in the morning.
At work we've been virtualising stuff for a long time, perhaps longer than most, but I'm still coming across companies and sysadmins that either don't know about virtualisation (inconceivable!) or have written it off as a fad. That really does surprise me.
The feature that many of these people and companies are missing, in my opinion, is that virtualisation is not just a solution (for business needs). It should be a core part of your toolbox. You need to tear down that physical SQL server and upgrade the OS and application version numbers? It's a core box, but the customer doesn't have the budget for another one? Having a back out solution is highly desirable? Virtualisation.
Run your backups as normal, convert that bad boy to a virtual machine, test it, and then start ripping that physical server a new install. Everything goes tits up (and it wasn't your fault), come Monday morning you won't have an upset customer and you can continue to work in (relative) peace and look awesome to your customer.
If you know about virtualisation, but don't do this because you think it's hard work then I suggest you need to re-evaluate the state of P2V tools. If you're running Hyper-V and the source install is also Windows then I highly recommend Sysinternals Disk2VHD. Occasionally you need to re-create the VHD or massage its output with testdisk, but you always test first.
Clonezilla is another great tool, and not just for P2V. There are a whole host of other tools designed just for this purpose. Some free (in both senses of the word), and some not so.
Unfortunately this isn't one of those success stores. But then again if I wrote about those I'd be hitting a few thousand posts a year, and plus they're really boring to write about.
We began the project by powering up some virtual machines and test importing the configuration from ISA 2006 to Forefront TMG 2010, and all appeared fine. The ruleset was there, the VPN configurations were there, and so on. Test data seemed to pass through nicely.
The migration went through and we put the box live, decommissioning the old ISA 2006 hardware. Everything seemed fine until larger quantities of traffic started passing through the box. The logging was showing a lot of packets getting dropped on the floor, but with no source, destination or protocol, active FTP and SIP traffic was also being problematic, and the box would randomly decide to stop passing everything, like the service had stopped. The irritating thing was that it simply wasn't consistent.
After poking into the configuration we started noticing that a lot of problems were evident in the configuration;
- The domain controllers computer set had entries that were flat out wrong and not present in the ISA configuration
- The Web Proxy Auto-Discovery Protocol (WPAD) file was wrong
- DNS was starting to go down VPN tunnels, but there were no DNS addresses configured on the interfaces
- And a whole host of other niggly issues
After fixing these the box was still randomly dropping things, but as the data flow increased (and not to extreme levels - we're talking a 10Mbit/s leased line here) so did the drop outs. At this point it was starting to become more than an irritation and more of a service affecting problem. I elected to rebuild it with non-R2 Windows Server 2008, and to manually create the configuration from documentation. Although I would've loved to have got to the bottom of the problem rolling back would've been as much of a pain at this point, and the customer was rightly beginning to get fidgetity.
So why non-R2 Windows Server 2008? A couple of reasons; All our other deployments of TMG 2010 are on non-R2 and are stable, we noticed our original test box for this project was non-R2, and there are also rumblings of other people having issues with R2 on a couple of technet threads. Although I'm not 100% convinced that R2 is to blame here frankly we didn't need R2, and I only wanted to do this the once as the whole job needed to be done out of working hours.
Since the OS rebuild and manual build of the configuration, touch wood, it seems to be a lot more stable. No more weird packets getting logged, no more weird FTP or SIP problems, no more random drop outs.
My thoughts on TMG 2010 aren't favourable at this point, but it's not just because of the problems. Ostensibly it feels like ISA 2006 with a few interesting bits bolted on, but unless you require ISA or TMG in your environment, I wouldn't recommend it. There's still no real IPv6 support, without SP1 it feels very wobbly, and for a few features that you might not need its an expensive upgrade.
Realistically you can pull off the same feature set with a different combination of products; a "real" firewall, and an internal proxy server, for example. This isn't to say that you shouldn't put TMG 2010 in anywhere. It does have some very useful features, but just look at your options carefully. Perhaps you don't need to upgrade. Perhaps you may find a better fit solution.
I've recently taken on a new role at work, and as part of that I've now got a big thing for change management and documentation.
I should cover a bit of background. At work we're a bit different from normal IT departments. Mostly because we're not a department, although we are treated as such by many of our customers. Ultimately we look after multiple, distinct, systems in multiple areas of business, in multiple locations - none of which are inter-linked at all. This makes it exceptionally important to document and to pass on information. It's unacceptable for us to say "X has gone to lunch, he's the only guy who knows your system... Can it wait?".
One thing that we've always done is to document everything on our online helpdesk software. Even if the customer phones it in, it has to go into the helpdesk. This is great for change management and finding culpability, but it's terrible for keeping configuration files, and information on the overall architecture of systems. Over the last year we've been supplementing this with a wiki (the excellent Dokuwiki to be exact) to help record this sort of information. Combined with regular group briefings (read: informal chats) it's generally been working reasonably well, especially now we're coming to the end of a major re-organisation of the data in there.
My main issue with what we currently have is keeping up to date configuration files for routers, switches, daemons, and so on. Particularly in combination with lots of rapid changes. Its all well and good to have procedures stating that "you must 'check in' the most recent change", but if it's too busy and it gets forgotten then you're screwed. I really want to automate this process. There are some tools out there for this; Kiwi's Cattools, Ziptie (abandoned, which I realised far too late after dicking about with it for an hour and wondering why some things didn't work), RANCID.
These'll work fine to varying degrees, but here's my niggling problem - I'd love to be able to stick something else next to whatever system we deploy, in order to push configuration changes back. With RANCID I can do this, but we've still got very much of an anti-*ix sentiment, and although it is changing very slowly, in the short-to-medium term it would cause the same problem that I'm trying to get rid of - knowledge partitioning. Hiring someone else with the knowledge we need isn't an option at the moment (we've just taken on an additional member of staff who doesn't have the knowledge or skills).
It's got me thinking. What trick am I missing here? I know I should be just worried about configuration files right now, but the part of me that loves hacking something together really wants to find or to put something together to solve both problems in one go. However, realistically this isn't something I have the time to do right now.
How do you do it?