* Functional testing took 17 months, and began very late due to project delays.
* The board was told that there were only 800 bugs, and yet the independent review found that there were more like 2000
* Non-functional testing seems to be rushed
* the Sabadell COO reported SABIS, one of its child companies and major IT contractor in the migration doesn't have the capacity to respond and solve incidents after go-live
And the TSB board decided to Go Live anyway.
Looks like a classic IT failure story - project is late, save costs on operations, get tired of testing and fixing and decide that it is good enough to be released. But also some very weird incentives - SABIS was apparently not the best vendor for the job, but they are a part of Sabadell, the bank that bough TSB.
Looking at the "Integrated Master Plan" Gantt chart of the project on page 44 is Waterfall as hell and they still failed to plan for Non-functional testing.
These some first impressions, it looks like a interesting read.
In UK/EU banking the way to get promoted is to be a yes man. Business has complete control and abhors IT. Impossible deadlines and hands-off approach make atrocious code bases. And this becomes a never ending feedback loop going on for decades. There’s never time to pay for technical debt. And now with the replacement of commercial staff for IT staff there’s a growing divide. Stay away.
I have experience of this and can confirm. Banking is not a sector for young smart people because it will wear you down. It's a constant battle of change approval boards, fighting their risk adverse fear of causing service distribution while at the same time expecting things to be delivered. Save these types of roles for when you are old and want a boring job.
It’s the same in the US Federal government IT space, especially when dealing with archaic legacy systems. Innovation is nearly impossible and change approval boards reign. Everyone is risk-averse because of previous breaches, even to the point where engineers are not allowed to run the code they wrote themselves. You heard that right. We can write code, but we can’t run it ourselves.
Had similar experience talking to ... someone at fidelity. I can't remember his title, but we were trying to negotiate some data exchange with them. He asked for our development workflow/process, and balked when he saw that a "developer" would also be the same person who had "access" to the database with customer information.
"Developers can not touch the database"
Umm... someone needs to, and we're a small team.
"Developers can not touch the database"
"Why not?"
"They can exfiltrate data".
"So could someone else who has access to the database"
"Developers can not touch the database"
"If we give a non-developer access to the database, by definition, they have access, and therefore the ability to exfiltrate data"
"Developers can not touch the database"
"Can you provide some examples of how other companies who've passed your inspection are making any sort of database changes/upgrades without someone having access to the database?"
This is a pretty normal requirement. I believe what they are saying is that you should not be able to access sensitive customer data by virtue of “just” being a developer, and you should audit who has access to such data and why. And by “touch” they mean unfiltered access, not “you cannot do anything with the database and therefore cannot work”.
This is pretty basic information hygiene, and if you don’t have enough care to implement even that then why should they partner with you? Sounds like a liability waiting to happen.
They would not give us any indication of what they expected.
> I believe what they are saying is that you should not be able to access sensitive customer data by virtue of “just” being a developer, and you should audit who has access to such data and why.
It should be pretty straightforward for them to provide a "here's our initial base minimum requirements guidelines" and work from there.
The parameters they laid out were similar to "anyone who would be capable of running any code on the database should not have access to any data". Every clarifying question about "what about X? what about Y?" was met with basically, "no, that person could exfiltrate the data". Which, yes, anyone who has access to the data could, theoretically, exfiltrate the data.
I showed a demonstration (from laptop) and the guy got really upset that I had 'customer data' on the screen and in my database. It took several minutes of back and forth to explain to him that it was "faked" data.
It sounds like your team has no clue how to approach these discussions and lost out on the partnership. Fidelity don’t know you from Adam, and you seemingly gave no impressions that you had any idea what they wanted or how this process works.
They shouldn’t have to spell out how to protect data to you, and simply by asking “what about X” as if you are the only one to ever say this to them or notice that there are of course ways data could be exfiltrated betrayed your ignorance, not theirs.
Did you expect them to say “my god! Of course! Why didn’t I see this before? This regulation we are legally and contractually bound to obey is not perfect, so hey why don’t you just ignore it.”?
And to top it off you started showing realistic looking data without being upfront and clear to them that it was synthetic? Come on.
Not the same as "we're doing this and... " being cut off and being told "this is 100% wrong, and we will not tell you any scenarios that will bring you in to compliance with our view of secure".
We had plenty of secure process in place. The moment he heard the word "developer" and "database", he basically shut down and refused any meaningful engagement.
It is not his job to educate on secure practices. It was his job to inspect what we had in place, and he stopped doing that the moment he heard those two words.
"Would having a non-developer administrator connect and process the changes?"
"Developers can not have any access to any production data".
That's not a meaningful response to the question.
> And to top it off you started showing realistic looking data without being upfront and clear to them that it was synthetic? Come on.
I'm not sure how much more 'up front' you can be besides saying "we develop with faked sample data, this is faked and unrelated to or connected with the production system" at the start of that walkthrough. Perhaps if we'd opened up a Northwind database it might have looked more plausible?
No one was expecting them to ignore anything. We were expecting someone to ask "are you doing X? how are you doing Y?", then followup with "you need to remediate these aspects of your system before we proceed further". That's what I would expect.
To be fair, it's not their job to educate you on good security practices. This was an obvious red flag (developers should indeed not have unfettered access to production databases), so if you fail at a basic separation of duties on the team, there's no point in progressing any further.
If you're expecting some "detailed guidelines" so you can "fix" your system to pass their audit, you're missing the point entirely.
Also when showing anything that remotely looks like customer PII, it's your job to clarify that this is not real data. Given the precedent, I can see why they thought you were being careless and showing real production data.
Wait a minute. How would they be showing production data if they couldn't access the database were, presumably, this data was stored?
It sounds a lot more like the smaller company was interfacing with someone who was not technical or at the very least did not have a clear understanding of the internal processes of the big company, and expected the smaller company to have prior experience of this thing and to guide them through it.
Most likely, the process would have involved the samaller company communicating with someone inside the larger company, like a DBA, and asking the large company DBA to make the changes they needed. In large financial corps, company developers are not allowed to touch the database either, in fact where I worked, we were not even allowed to create a database for testing an internal-facing project. Processes are that anal.
> How would they be showing production data if they couldn't access the database were, presumably, this data was stored?
The entire anecdote was about them telling the representative that they DO in fact have access to the production DB. Why would you expect him to act as if they didn't have access after being told they do?
In large companies, especially large companies that are not technology companies, IT people have very clearly delineated roles, so a "developer" and a "DBA" are different people and when a "developer" needs a database, he or she must ask a "DBA" to make one for them.
In small companies, a "developer" is, by comparison anyway, a jack of all trades, who is probably expected and required to know how to do every possible job, including database management. So when the large company said "no developer" they possibly meant that, internally, their developers could not touch databases and only DBAs were allowed to, and they expected the smaller company to have similar processes in place. Which is quite unrealistic.
At the end of the day, paranoid as that may sound, the big company person was probably getting stuck on the role description of "developer" and would not have freaked out if the smaller company had used the term "DBA".
I don't know how to fulfill this that makes every auditor happy, but so far what we're doing is, we have a restricted-access role on the db that denies access to certain sensitive tables, create a time-restricted temporary user for the developer, and then give them access to the read-replica.
This sounds like the right approach, and it’s similar to other companies I’ve worked at. The time limited nature is nice and presumably there is a audit log of some kind tied to that.
While I get the 'atmosphere' of the meeting just by reading this, there are quite a few valid points raised. Developer should not have access to real user data. Devs should be provided with DB schema definitions,info on data volumes and any other relevant information or metadata that is required to develop new features/migrate/etc. When it comes to databases with extremely sensitive data,i.e. banking, healthcare, criminal records, there must be an additional layer security that'd log anyone accessing particular records.
Separation of duties regulation. Nothing wrong with it. The idea is to have a wall so that you can't have one person able to write and push code to production.
Now if you mean "run code" at all that's another story.
Unfortunately, the "finance is only for old and boring people" also leads to a certain prestige for finance - startups are for kids, finance is for adults. Many in finance almost pride themselves on how much time they spend in meetings, and how long it takes to release anything.
For startups, the myth is that success comes from a "move fast and break things" culture, so they wear IT disasters with a badge of honour. In finance, it's the opposite - with enough planning and governance, huge budgets, and long timelines, they believe you can guarantee success on any project.
> Modern technology companies make tons of changes all of the time. As a result they become good at them.
Really? That hasn't been my experience. Nearly every time some moderately large update to any modern software goes out it seems to break a bunch things. What it doesn't break it usually just ditches entirely.
Unfortunately in a lot of countries in Europe that's the only sector that does pay a premium in comparison to other sectors. Try getting £100K+ dev job in London and suddenly all the potential employers ( with a few exceptions) end up being either in the City or Canary Wharf.
You could replace banking with any other large enterprise vertical like healthcare, telecom, government and be very correct. The problem is that IT is a cost center, and cost centers are as a rule terrible to work in regardless of industry, even if you're at a tech company.
Very true. Work somewhere where the CEO has some level of understanding and appreciation for your work and life will be much better.
Corporate IT is an especially bad place. You can only lose. When things go well they start cutting costs by letting people go and offshoring jobs. And when things go badly you get the blame. The IT department at my company is hard to work with and often drive me crazy. But when I look at their structure it’s obvious that it doesn’t allow for people to a good job.
My theory is that the code never gets fixed in these pathological companies. They just get higher costs until people switch away to a bank that is more efficient.
I think that’s especially common in regulated industries. On the one hand top management are afraid of compliance issues but on the other hand they don’t understand and respect why things cost money and time. So they are very susceptible to hiring sweet talking people who pretend to be able to solve this.
The number alone is meaningless. Some bugs are UX inconveniences, while others can destroy data, as we saw in this case.
The real problem at these banks is the multitude of systems and datafeeds that impact systems downstream a hundred systems long.
The real solution to these big project failures is, in my experience, the phase deployments. With the first step being retiring old systems and consolidating data repositories to reduce knock-on effects.
"making individuals within a company responsible for what goes wrong within that company’s IT systems" seems patently ridiculous, especially without "paying individuals in proportion to value generated by IT systems". Otherwise, you have an unlimited downside, and a hard capped upside, which, personally, means I quit IT and start serving coffee for a living.
It gets a bit murkier with executive positions, in my opinion, since the unlimited upside does start coming in to play.
If companies want to hold engineers resposible to that extend, they would also need to give them a way out. Meaning that if a developer, regardless of deadlines, do not believe that his code is production ready, then you either delay launch or the responsibility now shifts to the manager who choose to ignore the developers concern.
As developers and engineers we’re responsible for letting management know if a project is not on track. What they choose to do with that information is their problem.
I'd imagine they'd rush delivery of items (going lower quality) and work just like we do to get something finished before the city enacts a fine for delays. Look at Mopac in Austin..
This is exactly why software developers shouldn't call themselves engineers. Real engineers (structural, mechanical) are legally bounded to denounce unsafe or incorrect design or practices
There's a whole lot of EE's and ChE's that'll be disappointed to learn they're not "real" engineers.
And even for structural and mechanical engineers, not everything requires reporting. Even when the project is subject to it, any number of institutional and social factors can make that essentially impossible.
The idea is that this would apply to executive positions. Right now, CFOs are required to sign off quarterly on the company's financial statements. If those financial statements are later found to be misleading or fraudulent, the CFO is personally responsible, and can face fines or jail times. What the article is proposing is that IT be held to a similar standard. A CIO should be held personally liable if an IT failure causes an outage that materially affects the finances of the firm or causes customers to lose money.
Nice idea in theory but it can (and will) be gamed. For instance, the CIO will have an auditor sign off on a master checklist of yes/no items which the engineers will then be forced to self-certify any changes against. Any failures after that are "provably" due to engineering not being accurate in their assessment of checklist compliance.
What items might we see on such a list? Oh, I don't know... 100% test coverage, perhaps?
In the UK this is called Governance. Fancy word for what is, effectively, liability avoidance.
The question with this kind of thing is what happens when you are ordered to do something that is outside your personally acceptable risk. Including threats of retaliation.
I’m not sure how that related to your original question. If your boss is instructing you to do something that you deem to be unacceptably risky, make sure they are aware of and take ownership of the risks. Large and especially regulated organisations (like banks) will have processes to track things like risk ownership. If you communicate risks like this, then not only have you covered your own ass, you’ve also done the responsible thing of notifying the correct stake holders. If they choose to act irresponsibly in light of that, then that’s on them.
I’ve done a lot of work in banks and this story sounds very, very familiar to me. Change the details of the system and it could be any number of projects that I’ve personally worked on. I’ve been brought into a number of projects like this at the ‘near completion’ stage, and each time I’ve reported on what I thought the risks were, suggested how they should address them, and advised them to delay delivery until they do. Some of those projects worked out well, some of them completely bombed, but I’ve never been put in a situation where I was made even partially accountable for somebody else’s poor decision making. So if your question is “how do I protect my own interests when my boss wants to be an idiot?”, then that’s the serious answer for how you do it.
I think the labour market would change the compensation too in this regulation scenario. Maybe it would go well, maybe badly (lawyers and liability insurance imposing conditions...)
Companies should give people the option to "sign off" that a project will succeed, and then ONLY those people should get the reward or punishment. This would be for everyone not just IT, and it'd require people to asses others skills and trust them on the job.
I'm thinking it would apply to the CTO and similar high-level employees? It wouldn't make sense to punish random leaf-node engineers; they don't know the full state/readiness of the system and aren't the one making the go/no go decision anyway.
There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.
Software is hard and most developers aren't very good. As the complexity of software increases, the number of developers and teams that can manage it becomes vanishingly small; to a point where no amount of money can help.
The best chance for a project like this to succeed is to break it down and migrate over the course of years (which might not always be possible).
I witnessed the outsourcing of 3000 IT in a large company in the mid 2000. The contract was great for the outsourcing company and signed off by a bunch of bozos who had no idea how IT runs.
The idea was to have tickets for everything and "golden" ones vichy cost a fortune but are super high priority.
IT was mad at being outsourced and played the game. "super important" thing coming? Ticket. "just a cable, man"? Ticket.
What the idiots in management discovered, is how THEIR company actually works. How important the human relationship is.
One of my colleagues in IT had the pleasure to tell a board member who came to be serviced right oway to fuck off (his words) and to open a ticket, if he knows what this is
Then he picked up the phone as someone was calling and turned away.
> IT was mad at being outsourced and played the game.
This is called "work to rule", and anybody who doesn't know the risk and repercussions of such an event happening, should not be in business management in the XXI century. Unfortunately, a lot of people try to pretend that trade unions never existed, and in so doing they lose an excellent occasion to learn how workers actually think and act.
“One of my colleagues in IT had the pleasure to tell a board member who came to be serviced right oway to fuck off (his words) and to open a ticket, if he knows what this is Then he picked up the phone as someone was calling and turned away.”
This is great. Usually the board member would get preferential treatment and never experience how things really are.
There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.
I wouldn't call it a straw man. There are more complex mechanics in outsourcing situations. Often outsourcing is not complete. They still keep a handful of inside workers supervising consultants. And often those inside workers are woefully incompetent, ruining the work of "well paid consultants". That happens because the outsourcing only saves those with good connections: the idiot relative of someone with power.
I've also heard many managers in the consulting firms telling us to never go out of our way to fix problems. If the customers want it fixed, they must pay more hours. Except of course when the customers get very upset and demand unpaid overtime to reach some arbitrary deadline.
Every time someone I work with casually raises the possibility of outsourcing, I say I'm open to the idea. Then I share that link and ask them to review it before we meet to discuss our options.
This is accurate. I have unfortunately been involved in a situation very similar to the one the article describes. Release is signed off on, we go live and millions of records are suddenly corrupted affecting thousands of people and essentially shutting down an industry for a while. Engineers are only as good as the QA department and the QA department is only as good as the time and resources they are provided. This is compounded when a company has to deliver in order to get paid so they eventually reach a point; where hamstrung by a bad original contract, they are under massive pressure to deploy and just pull the trigger.
Software is very hard, many engineers are not sufficiently experienced at what they are doing and its hard even for those that are. Further complicating this is a cost issue, quite often large scale applications utilize resources that cost a lot of money on the production side and so to save money, utilize a considerably smaller data and infrastructure set and on the testing and development side. Queries are optimized against this smaller data set. The application goes live and suddenly an application tested against a cherry picked data set of x records is trying to handle 10,000 times the data. Then bad things happen.
> There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.
Sure but there are undoubtedly more failures involving cost leaders doing the same job as highly paid consultants. It is foolish to suggest otherwise or that there isn't a correlation. Your diminishing returns example might be true, but also doesn't discount the original statement.
Good engineers helps but it doesn't solve organizational problems. A program manager who knows what questions to ask and what success and failure look like could have fixed this. Same for a business sponsor who understands the cost of a screw up is far higher than the cost of taking your time and being careful.
> I’m super curious how customers ended up seeing other people’s accounts.
I don’t know anything about this particular case, but the usual cause of this is over-zealous caching (usually by an intermediate load balancer or the like) that returns cached pages while ignoring the session data. The result is that one person logs in and gets their account info, and then everyone else who tries to view the same page gets that person’s info instead of their own. For example, this happened to Steam in 2015: https://arstechnica.com/gaming/2015/12/valve-explains-ddos-i...
Of course, there are also other possibilities, but this is the most likely. Alternatives that come to mind (though I’ve never seen any of these actually happen):
- A bad RNG creating the same session token for multiple users
- Concurrency bug in the web server returning results to the wrong connection
- Messed-up migration causing account IDs to be associated with the wrong credentials (maybe IDs from the old and new systems got mixed up)
Read the long report by Slaughter and May (see top HN comment) with 250 pages of details, it is quite insightful and more than enough gory details.
New architecture, new platform, implementation takes too long, migration and testing insufficiently planned and when push comes to shove even the too low non functional testing goals not reached but still gone live.
With big bang approaches one has to keep in mind that complexity grows non-linear and at a certain size it becomes very hard. This migration was at a size that was getting hard but the approach taken and the way it was executed was clearly (see report) not on the level required. Big bang on smaller scale is the fastest and cheapest. Big bang on larger scale starts having big risks.
A more incremental approach may have been better and cost effective with much lower risks. A honest cost comparison may have shown that - provided the big bang approach would have been properly planed with sufficient large scale integration, migration and proofing milestones and buffers for risks inherent in this one-shot approach.
Agreed. The root cause bring lack of regression testing is laughable. Sure, testing would have and should have caught this, but what exactly was "this"?
> The paper also suggests a potential change to regulation—making individuals within a company responsible for what goes wrong within that company’s IT systems. “When you personally are liable, and you can be made bankrupt or sent to prison, it changes [the situation] a great deal, including the amount of attention paid to it,” Warren says. “You treat it very seriously, because it’s your personal wealth and your personal liberty at stake.”
Being sent to prison for buggy code... that's quite extreme. Are people even fired for buggy code? If we want to get more extreme than firing, how about losing 1 year's compensation in addition to being fired.
The negligence on this migration sounds breathtaking: they have a four-nines availability requirement and decided to do a big bang migration with what sounds like virtually non-existent, or at least grossly inadequate, testing and no ability to rollback.
People, often contractors, do get fired all the time for buggy code. Companies get sued for buggy code. However, the issue here isn't really the code, it's the management of the project. Bugs happen all the time, and at every stage, during development which is why large migrations like this need to be done very carefully with ridiculous amounts of testing and rollback procedures. Even when you think you've got everything covered, it's amazing the things that can go wrong on go-live.
Such a move would cause a dramatic shift, but not a good one: it would cause the best and most experienced people to leave for less risky work.
It’s like any contract negotiation, you’re always able to push for more favorable terms, but eventually you’re just weeding out people who aren’t smart enough to walk away.
It might also put some brakes on rampant "disrupt" capitalist overdrive we're in. I honestly can't think of any tech that I NEED to progress badly enough to compromise customer safety.
Bear in mind that the ultimate root cause of this mess wasn't "capitalist overdrive" but an ill-thought-out intervention by the EU forcing Lloyds TSB to split up because they thought there wasn't enough competition in British banking, without properly considering the practical implications of their systems having become completely integrated over the two decades that they were one company. To put this in some context, the article meantions problems with internet banking - that didn't exist yet when the two were last independent companies. The first, very broken, version of SSL had only been released a few months before their merger.
Yes, EU is the cause of all British ills... </sarcasm>
Or perhaps it was the usual incompetent planning of the owners of TSB - banks splitting up is not a new or impossible thing, you know. And complexity of IT systems is the most absurd argument against splitting up LloydsTSB - why would any regulator consider this at all?
„Too big to fail” is the last thing we need. And I’m not sure how the SSL part in consumer-facing part of IT relates to a botched internal data migration.
In a situation in which the consequences are so grave, you end up never vouching for the software as fit for purpose. Rather you claim best efforts to ensure fitness, and what you do vouch for is the testing you performed. If people at the top end of the pay scale signed off on your test plan, and you fully completed all testing successfully, it’s very difficult to find you blameworthy.
Our local Post Office is seriously lacking man power. They pay roughly minimum wage. On top of that, since the postman also delivers money and other special mail (such as serving legal notices), they would be legally liable if the job is not done right.
Is it any wonder that they have been trying to find another person for years!?
Look it in another way. Almost every worker in the not software world is legally liable for his work. For the chef in the kitchen (no poison in food) to construction worker (the building should not falling apart) or electricians (proper insulation and ground, etc).
I work in industry automation and I really would see if software would be simple again. Todays software depend on so many external things.
In Germany, and I guess in most countries, almost no worker is legally liable, as long as they are not actively ignoring the safety regulations applying to their work. So to make a software developer legally liable, you first have to set up a clear framework of "safety regulations" by which they can be judged.
For the chef in the kitchen (no poison in food) to construction worker (the building should not falling apart) or electricians (proper insulation and ground, etc)
Yes and no. Those people are all responsible for doing their own jobs to a certain standard. If a building falls down it is extremely unlikely that it was due to poor workmanship by an individual bricklayer. The root cause will be in the design and the bricklayer isn’t in a position to challenge the design.
And then who was really responsible for that bug? The engineer who didn't care to write tests? The QA person who didn't ensure it worked? Or the manager who pushed their workforce to the brink? As we have seen with Volkswagen, it is the bottom of the chain which tends to be blamed.
If you follow the professional liability model used in other fields of engineering (civil, etc), the buck stops at whoever sealed the report / letter of assurance / drawings / etc). We usually call them the Engineer of Record (EOR) in my field, and they are usually someone at the top.
I'm not advocating for that model in software, but it does provide a very transparent way of determining who is responsible. As the EOR, you may not design every member or do every calculation, but ultimately it's your ass on the line. I suspect it also changes your behavior when you know you are the one taking responsibility.
The industry I used to work in used an Independent Assurance Reviewer (IAR) role. The IAR signed off a go-no-go review, whose primary content was identification of the risks that the business would incur if it proceeded past a specific review gate. If an IAR failed a review, the business could still proceed, but the responsibility for subsequent failure then explicitly lay with the leadership rather than the engineers / reviewers. This would be rare though, in most cases IAR failures would lead to mitigation of the identified risks.
In an ideal case, the IAR was identified at the outset of a piece of work and was then involved throughout the project life cycle (waterfall or agile). Review points included bid release and contract acceptance as well as design and test readiness reviews.
Most people want to blame the top but if you go through the legal system, liability will likely be on the bottom rung of the company. If people are personally liable, they will probably need to find their own lawyers and upper management will easily find ways to shift responsibility.
“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”
Neither https://www.bankofengland.co.uk/-/media/boe/files/prudential... nor https://www.fca.org.uk/publication/consultation/cp19-32.pdf seem to contain phrases that actually say what's quoted in the article. The closest I can see is that these documents suggest that senior management is supposed to be aware of what they are responsible and accountable for. Putting blame on the other end of the chain would be counterproductive anyway. There's pretty much no way the root cause of such a large scale failure would be in some minor technical role.
The part you quoted does not talk about buggy code. Bugs will always exist, the idea is to prepare for them. Migrate systems incrementally, run versions in parallel, have tested rollback strategies at every step, so on and so forth. Not cut corners on backups and contingency plans. Don't rush to launch.
Buggy code is not the problem, but a buggy process is.
If contracts had to include terms that held developers personally liable then it would simply force them to buy personal indemnity insurance. This might not be a bad thing as it is how things are done in the medical world but I wouldn't expect to start seeing developers going to prison.
The paper also suggests a potential change to regulation—making individuals within a company responsible for what goes wrong within that company’s IT systems
We all know how that looks in practice - Experian’s senior management pointing the finger at a single lowly sysadmin. Or VW blaming their emissions scandal on a single engineer.
There is one individual responsible: the CEO. That’s why he or she is paid the big bucks. Maybe make all the C-level jointly responsible. They can share cells afterwards.
yeah, this is obscene. Too many people participate in such systems, and placing blame on anyone is impossible.
Also, that's the whole reason LLCs exist as one of the pillars of our economy. Having so much risk would make any endeavor so risky as to not be worth it.
This sounds like a very bad idea. Why would anyone go into IT if you run the very real risk of going to prison for making a mistake. Software deployments would slow to a once a year thing, 3 months to code and 9 months of testing. I would not be surprised to see the suicide rate in developers sky rocket as they dealt with the stress.
You mean we might stop releasing crap software every sprint and developers would actually be careful and responsible? I fail to see any downsides to this.
You weren't paying attention when we were doing all paperwork by hand.
No more software = computers are bricks. You go back to manual operations. Emails? Replaced by letters.
Writing "I fail to see any downsides to this" is pretty much equivalent to "of course we should ban cars, that way we stop car accidents; I fail to see any downsides to this".
I just checked - the former CIO (who surprisingly is still there, but in a different role ) was a SMF18 according to the FCA, so he was personally liable for failures in his area- looking at the Slaughter and May report findings I would suspect they are working on an enforcement action and have been for some time.
The culprit was insufficient testing, plus lacks of restorable backups, on multiple levels.
Last level of backup: there's 5 million records, use a few dozen reams of paper and print out the account totals for every single account.
For all of the complex systems you can think of (planes, living organisms, etc), the reason things usually go so well is that there are multiple levels of checks and balances. Everything is usually veering towards entropy, and fail-safe systems try to ensure when things go wrong that they fail safely.
Blaming testing feels like a cop-out - it's easier to say 'if we tested a little more everything would have been fine' rather than 'our leadership was too incompotent to recognise that the plan was fundamentally flawed from day one, that releasing in the way we did maximized the risk to customers, or that we decided not to listen to any concerns which would have compromised the timeline'
Exactly. Had they done all the proper testing, they would have likely discovered all these issues, it's not a stretch to think that the management that allowed the project to get to the state it was in would have also pushed forward with the go-live despite the bugs that were discovered in testing. And just because they allocated time to testing doesn't mean they would have allocated sufficient time to remediation and re-testing. I've seen it happen where bugs were found late stage in a big-bang style migration, they were remediated, but there wasn't enough time left to conduct another massive mock go-live to make sure it worked. So you hope and pray...
Banks handle balance discrepancies all the time all day long, all operations that cross a bank boundary are subject to reconciliation further down the line, with massive expert systems handling most of the obvious cases automatically and operators handling the flagged transactions
Working as a test automation engineer, I'm not surprised at all. There is small to no investment in testing in most companies because it's just considered a cost. I cannot count the times where I asked for better servers for testing, put more time and people in creating a better testing infrastructure, but the answers was always "we will do it! We definitely need to get better at testing!". Still waiting for any action. Most of managers I met just want to meet the deadline they set and ask: "can't you test it in another way? How is possible that it takes so much time?". They don't give a damn about testing, they just want a pat on the shoulder that everything is fine and give the responsibility to someone else if something goes wrong.
Scientist: “Well, we’ll need to test it first. We should run some trials —“
Manager: “We definitely need to test more often! But I already agreed to have it out by next Tuesday, so maybe we can just release it and aim for more testing next time.”
> "Perhaps the easiest way to avoid outages is to simply to make fewer changes."
The exact opposite is true. When a process happens infrequently, it's more likely that the people involved will make mistakes and overlook steps. The correct answer is to make changes more often, and to develop robust processes and automation around those changes.
Wanted to state the same. I get angry when I have clients talking about their "major release days" every three months and how its a good operating procedure to have this dedicated day to change all their systems at once.
Robust IT needs decoupling and small incremental changes. This will outperform any coordinated release scheduling in terms of reliability, if implemented correctly.
Enterprise tend to favor big bang migrations on a specific date because somebody higher on up set a particular date and everything falls into place with a Gantt chart running waterfall. The reality is that it falls onto a few technical folks to triage a large amount of teams, including the ones from the company they're trying to break away from (which introduces friction). This includes significant risk to the project.
"TSB chose April 22 for the migration because it was a quiet Sunday evening in mid-spring."
This might've gone better if TSB chose months prior to April 22nd for the long duration migration and testing to be completed, and a period of weeks or months for going live post migration. The F5 load balancer (hardware commodity) could've slowly cut over the traffic 10% at a time to the new migration site to get a feel for user experience. Coordination with the TSB network team would be necessary to accomplish that.
It is a tough spot though, I hope the team learned something from that.
> The F5 load balancer (hardware commodity) could've slowly cut over the traffic 10% at a time to the new migration site to get a feel for user experience.
I doubt an F5 load balancer would work in this specific case. But there definitely should've been a software router-adapter that routed requests to two systems and converted their replies to a single format. This would've let them migrate their customers in batches instead of a big bang cutover.
That's what I've always done when migrating data to a new banking system.
What I've seen work well, for multiple migrations, (at an admittedly smaller scale) is using shadow writes/reads and a source-of-truth toggle.
The API layer made requests to both the old DB and a new DB that had been populated during a small window of scheduled downtime.
We spent a couple weeks/months running checks in production that the old DB and new DB were returning identical results, though still returning the old DB's results as source of truth. Eventually, we flipped the source of truth to the new DB, and some time later decomissioned the old DB.
Great approach imo. Once you get flipped over to the new source of truth you have to make sure the business prioritizes decommissioning the old DB, though. I've seen certain departments (looking at you, BI) treat the old database as a permanent backwards compatibility layer.
I think part of the problem is that TSB had extremely limited access to the existing system, which affected their understanding of how that existing system worked, the kinds of testing they could do, and their ability to do incremental migrations.
That might've been the case, but it can't be the excuse.
They were spending €100mil a year on maintaining the current infrastructure. The team's incentive is to save a large portion of that through the migration, and somebody higher up should've given them unlimited access to do so.
They were paying €100mil a year to a direct competitor who owned and operated the current infrastructure for them. The amount of money they were paying was, if anything, a direct incentive not to give them the access required for the migration.
> They were spending €100mil a year on maintaining the current infrastructure.
As I understood they were made to pay that as the infrastructure was being used of another bank
The lack of testing is a symptom of a deeper root cause: cost cutting. The audacious thinking that you can just shuffle around developers that have been on the project for multiple years, outsource them to cheaper countries, count that as a win because the balance sheet looks better for that year, and then don't expect this to happen. Honestly, there is a lot of hidden knowledge that NEVER gets fully transferred in these kind of knowledge transfers and you'll only notice it once shit really starts hitting the fan.
I have been out of big bank IT for eight years now. Clearly I don’t miss it. For one programme we had to produce upfront estimates. Clearly anything we produced would be wildly inaccurate. But we tried. The IT director then cut the estimate in half and added another hundred people to it. Cue complete chaos and a massive delay. That was the calibre of senior management back then. I doubt it’s much better today.
I didn't read everything, but their erronous mistake is the resemblance in "The Phoenix project" book[0], it's awful but realistically can happen at many corporates.
Having done a large data migration for medical records, I would be sweating bullets doing this with presumably fluctuating transactional bank data. Out of curiosity, does anyone have recommendations for their favorite data migration tool?
IMHO it is unreasonable to expect these sorts of things to be possible to do using a big bang approach. The complexity exceeds cognitive capabilities and disaster strikes. Therefore, it is not optimal to see these projects as data migration but functionality migration ones. You move some customers to another set of functionality. Thus, my recommended tool is to use none
100% - what I always do is start the new logic/system so it processes in parallel with the old one and at the same time backfill the old data into the new system. When everything is proven working, feature flag to switch the new system as the main one, old one keeps running. After a period of quarantine, switch off the old system.
I don't understand the requirement of doing the "nuclear button" migration, except maybe a shortsighted way of trying to save costs.
We do this too and I think it's a great approach. One thing I will say is you should try and decommission the old system as soon as feasible - running both has additional testing and development costs, particularly if for some reason you need to add new features to both.
We generally aim to leave the old system toggled off for a release or two (allowing us to switch back in the case of a serious defect) and then rip all that old code out in the subsequent release.
Don't do migration in one go. Do it gradually. Make sure you can run both systems at the same time. It will take longer than you thought doing it in one go would take, but less time than actually doing it in one go and fixing all the problems.
It's a bit glib, because data varies so much, but my personal favourite data migration tool is Perl.
I've migrated a lot of "stuff" from one place to another, from filesystems, to databases, to backups, and there is usually some perl running in the background.
Of course the scale is very different. The biggest thing I've personally migrated was in the low Tb of data. Certainly nothing so large/necessary as banking stuff.
Depends significantly on the back-end data store, for some our in-house systems we’ve had good experiences with GoldenGate [1] to do a continuous sync between systems to enable a rollback scenario, and there are equivalent packages available for lots of datastores. Some of the larger finance packages also include export and snapshot functionality but this wont help with maintaining parallel systems.
The risk however is evaluating consistency between the systems before doing the rollback which in this case would probably require a more advanced testing capability than what was available...
I got pushed a graphical flow migration took down my throat by my manager. On the argument that in that way others would be able to read at reuse it because it was standard. In reality it was used once and I suffered all its quirks, also no version control was possible.
There's not a lot of details in that article. But it seems that they were dealing with a moderate amount of data (about 100 GB ?), which is trivial to backup many times. They do not explain what's the big deal with that.
Very interesting article. But I miss a story about what happened in the days after the disastrous migration. Could customers find a correct balance on their accounts eventually?
Potentially dumb question: I've done front-end (iOS) my entire career. If I wanted to go from 0 to being able to do a SQL DB migration in production without disaster, what should I do?
I saw https://www.masterywithsql.com/ posted on HN a few months ago, and I'm planning to finally start it over break. But what do I use after that?
That‘s a good start, but may even be an overkill for your goal.
The trick is not in the application of highly complex SQL. The trick is in the PROCESS — making it robust, testable, traceable, and reversible for every object despite the complex dependencies. And then indeed validating every step went well, with the validation depth depending on the objects criticality (OK return code vs check of record counts vs check of totals on fields vs check of totals per category etc)
If you‘re really into it, consider the use of automation and tracing IDs for individual loads.
If you‘re worried about the time it takes despite automation: work on the bulk of the data, but use DB triggers to record delta from your snapshot, then treat that „sidecar“ accordingly.
Then: do at least one end-to-end test.
Learn from the mistakes at every single stage, and act on what you‘ve learned.
This is 100% the way to learn this, and a great answer to the original question. Should also emphasize though that for a migration of the scale discussed in this article doing a SQL backup + restore is almost certainly not the right approach.
At this kind of scale + uptime requirement, you'd want to do a migration like this gradually - hydrating the new system with data and keeping it up-to-date with changes for a period of weeks/months while doing testing + validation (and ideally also doing a gradual cutover, although that might not be realistic given banking infrastructure/application design).
Probably the relevant approaches to look into after reviewing basic backup/restore are things like log-shipping or change-data-capture, although choosing the right approach would be highly dependent on the underlying technology, architecture, and requirements...
You‘re absolutely right, but we all can be talking about different subjects here:
- database restore
- database migration (your case)
- data migration to a different platform and presumably different data model (the bank‘s case, or my answer here)
Not sure which case the question was reffering to, though...
Probably someone will throw a better answer for DB migration, but the problem is not DB migration here, TBH db part seems to me the easiest part but what they did is migrating entire stack not by part by part, without pilots and using poor testing. Migrating whole stack at one click is nuclear I believe poor planning had a great role to force them to get it done.
Commonly-held factoid statistic says they have a 50% chance of going out-of-business within 6 months since they lost so much vital customer data. I hope they made backups and didn’t just rely on snapshots, database transactions or journaling.
IT professional prime directive #0: don’t lose vital data.
Such a weak article with so much filler. The first several paragraphs were interesting and then he goes on to say in the 60s only employees used the bank computers, but nowadays (to paraphrase) "there's this thing called online banking".
Article quoted 2500 person years, over a period of a couple of years, to move from a Lloyds system to a clone of an existing Banco Sabadell system.
Does that seem like a lot of effort for 5 million accounts? Maybe worth it since they were already paying EUR 100 million per year to license the old Lloyds system.
I’d bet they moved from one big honking RDBMS to another. Curious if old and new were different DB vendors, if anyone here has insight.
Important safety tip. If you hear any vendor executive brag about the large number of people it took to do an IT project, seek out a different vendor.
I don't get how small-value transactions could turn into large-value transactions on migration, though. Hollerith cards? Wrong column numbers in the COBOL program? Decimal point / comma confusion? Whisky Tango Foxtrot?
There was a recent post about in-flight entertainment/information systems on the front page (clickbaited with "widewine" in the title), which showed a (presumably somewhat recently designed) JSON object containing flight data information.
I think the date was "00YYMMDD" or something like that, time zone offsets were represented as a float-respresented-as-string, but to indicate a negative offset they added 80000 or so...
So if that's what happens to systems designed today, imagine how legacy cruft from decades ago must look. I would not be surprised if the answer to "Hollerith cards? Wrong column numbers in the COBOL program? Decimal point / comma confusion?" was "all of the above, and then some".
Oh, terrific, some ancient program in C, where somebody cobbled in a poorly designed and untested C version of JSON.stringify(). That is 21st century craniorectal inversion, not decades-old tech debt.
I started out working in that era (no COBOL, but all the rest of it). At least some of us were suspicious of data-in-a-few-characters and character column-number based records (ummm, punchcards). Some of us checked all kinds of things on input to try to reject garbage. Spaces in the middle of numbers? BOOM. Something unexpected in the "record type" letter? BOOM. The card reader actually had a diversion output tray where we could spit out the rejected cards.
Best software practices for safety critical systems seem essential, I don't think it's crazy to think that some sort of regulation or licensing would help enforce those practices.
https://www.tsb.co.uk/news-releases/slaughter-and-may/slaugh... - the report itself
Interesting notes from the summary:
* Functional testing took 17 months, and began very late due to project delays.
* The board was told that there were only 800 bugs, and yet the independent review found that there were more like 2000
* Non-functional testing seems to be rushed
* the Sabadell COO reported SABIS, one of its child companies and major IT contractor in the migration doesn't have the capacity to respond and solve incidents after go-live
And the TSB board decided to Go Live anyway.
Looks like a classic IT failure story - project is late, save costs on operations, get tired of testing and fixing and decide that it is good enough to be released. But also some very weird incentives - SABIS was apparently not the best vendor for the job, but they are a part of Sabadell, the bank that bough TSB.
Looking at the "Integrated Master Plan" Gantt chart of the project on page 44 is Waterfall as hell and they still failed to plan for Non-functional testing.
These some first impressions, it looks like a interesting read.