An IT migration corrupted 1.3B customer records | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		An IT migration corrupted 1.3B customer records (increment.com)
		327 points by fagnerbrack on Dec 24, 2019 \| hide \| past \| favorite \| 187 comments

dimitar on Dec 24, 2019 | [–]

https://www.tsb.co.uk/news-releases/slaughter-and-may/ - The bank has commissioned an independent report

https://www.tsb.co.uk/news-releases/slaughter-and-may/slaugh... - the report itself

Interesting notes from the summary:

* Functional testing took 17 months, and began very late due to project delays.

* The board was told that there were only 800 bugs, and yet the independent review found that there were more like 2000

* Non-functional testing seems to be rushed

* the Sabadell COO reported SABIS, one of its child companies and major IT contractor in the migration doesn't have the capacity to respond and solve incidents after go-live

And the TSB board decided to Go Live anyway.

Looks like a classic IT failure story - project is late, save costs on operations, get tired of testing and fixing and decide that it is good enough to be released. But also some very weird incentives - SABIS was apparently not the best vendor for the job, but they are a part of Sabadell, the bank that bough TSB.

Looking at the "Integrated Master Plan" Gantt chart of the project on page 44 is Waterfall as hell and they still failed to plan for Non-functional testing.

These some first impressions, it looks like a interesting read.

alecco on Dec 24, 2019 | | [–]

In UK/EU banking the way to get promoted is to be a yes man. Business has complete control and abhors IT. Impossible deadlines and hands-off approach make atrocious code bases. And this becomes a never ending feedback loop going on for decades. There’s never time to pay for technical debt. And now with the replacement of commercial staff for IT staff there’s a growing divide. Stay away.

drcross on Dec 24, 2019 | | | [–]

I have experience of this and can confirm. Banking is not a sector for young smart people because it will wear you down. It's a constant battle of change approval boards, fighting their risk adverse fear of causing service distribution while at the same time expecting things to be delivered. Save these types of roles for when you are old and want a boring job.

temporallobe on Dec 24, 2019 | | | [–]

It’s the same in the US Federal government IT space, especially when dealing with archaic legacy systems. Innovation is nearly impossible and change approval boards reign. Everyone is risk-averse because of previous breaches, even to the point where engineers are not allowed to run the code they wrote themselves. You heard that right. We can write code, but we can’t run it ourselves.

lowercased on Dec 24, 2019 | | | [–]

Had similar experience talking to ... someone at fidelity. I can't remember his title, but we were trying to negotiate some data exchange with them. He asked for our development workflow/process, and balked when he saw that a "developer" would also be the same person who had "access" to the database with customer information.

"Developers can not touch the database"

Umm... someone needs to, and we're a small team.

"Developers can not touch the database"

"Why not?"

"They can exfiltrate data".

"So could someone else who has access to the database"

"Developers can not touch the database"

"If we give a non-developer access to the database, by definition, they have access, and therefore the ability to exfiltrate data"

"Developers can not touch the database"

"Can you provide some examples of how other companies who've passed your inspection are making any sort of database changes/upgrades without someone having access to the database?"

"Developers can not touch the database"

And that ended it apparently.

orf on Dec 24, 2019 | | | [–]

This is a pretty normal requirement. I believe what they are saying is that you should not be able to access sensitive customer data by virtue of “just” being a developer, and you should audit who has access to such data and why. And by “touch” they mean unfiltered access, not “you cannot do anything with the database and therefore cannot work”.

This is pretty basic information hygiene, and if you don’t have enough care to implement even that then why should they partner with you? Sounds like a liability waiting to happen.

lowercased on Dec 24, 2019 | | | [–]

> This is pretty basic information hygiene

They would not give us any indication of what they expected.

> I believe what they are saying is that you should not be able to access sensitive customer data by virtue of “just” being a developer, and you should audit who has access to such data and why.

It should be pretty straightforward for them to provide a "here's our initial base minimum requirements guidelines" and work from there.

The parameters they laid out were similar to "anyone who would be capable of running any code on the database should not have access to any data". Every clarifying question about "what about X? what about Y?" was met with basically, "no, that person could exfiltrate the data". Which, yes, anyone who has access to the data could, theoretically, exfiltrate the data.

I showed a demonstration (from laptop) and the guy got really upset that I had 'customer data' on the screen and in my database. It took several minutes of back and forth to explain to him that it was "faked" data.

orf on Dec 24, 2019 | | | [–]

It sounds like your team has no clue how to approach these discussions and lost out on the partnership. Fidelity don’t know you from Adam, and you seemingly gave no impressions that you had any idea what they wanted or how this process works.

They shouldn’t have to spell out how to protect data to you, and simply by asking “what about X” as if you are the only one to ever say this to them or notice that there are of course ways data could be exfiltrated betrayed your ignorance, not theirs.

Did you expect them to say “my god! Of course! Why didn’t I see this before? This regulation we are legally and contractually bound to obey is not perfect, so hey why don’t you just ignore it.”?

And to top it off you started showing realistic looking data without being upfront and clear to them that it was synthetic? Come on.

lowercased on Dec 24, 2019 | | | [–]

Not the same as "we're doing this and... " being cut off and being told "this is 100% wrong, and we will not tell you any scenarios that will bring you in to compliance with our view of secure".

We had plenty of secure process in place. The moment he heard the word "developer" and "database", he basically shut down and refused any meaningful engagement.

It is not his job to educate on secure practices. It was his job to inspect what we had in place, and he stopped doing that the moment he heard those two words.

"Would having a non-developer administrator connect and process the changes?"

"Developers can not have any access to any production data".

That's not a meaningful response to the question.

> And to top it off you started showing realistic looking data without being upfront and clear to them that it was synthetic? Come on.

I'm not sure how much more 'up front' you can be besides saying "we develop with faked sample data, this is faked and unrelated to or connected with the production system" at the start of that walkthrough. Perhaps if we'd opened up a Northwind database it might have looked more plausible?

No one was expecting them to ignore anything. We were expecting someone to ask "are you doing X? how are you doing Y?", then followup with "you need to remediate these aspects of your system before we proceed further". That's what I would expect.

lima on Dec 24, 2019 | | | [–]

That's because the person you talked to probably did not know, either. They were basically reading off a checklist.

guiambros on Dec 24, 2019 | | | | [–]

To be fair, it's not their job to educate you on good security practices. This was an obvious red flag (developers should indeed not have unfettered access to production databases), so if you fail at a basic separation of duties on the team, there's no point in progressing any further. If you're expecting some "detailed guidelines" so you can "fix" your system to pass their audit, you're missing the point entirely.

Also when showing anything that remotely looks like customer PII, it's your job to clarify that this is not real data. Given the precedent, I can see why they thought you were being careless and showing real production data.

YeGoblynQueenne on Dec 24, 2019 | | | [–]

Wait a minute. How would they be showing production data if they couldn't access the database were, presumably, this data was stored?

It sounds a lot more like the smaller company was interfacing with someone who was not technical or at the very least did not have a clear understanding of the internal processes of the big company, and expected the smaller company to have prior experience of this thing and to guide them through it.

Most likely, the process would have involved the samaller company communicating with someone inside the larger company, like a DBA, and asking the large company DBA to make the changes they needed. In large financial corps, company developers are not allowed to touch the database either, in fact where I worked, we were not even allowed to create a database for testing an internal-facing project. Processes are that anal.

celticmusic on Dec 24, 2019 | | | [–]

> How would they be showing production data if they couldn't access the database were, presumably, this data was stored?

The entire anecdote was about them telling the representative that they DO in fact have access to the production DB. Why would you expect him to act as if they didn't have access after being told they do?

YeGoblynQueenne on Dec 24, 2019 | | | [–]

Oh, my mistake- sorry, I misread that.

YeGoblynQueenne on Dec 24, 2019 | | | | [–]

In large companies, especially large companies that are not technology companies, IT people have very clearly delineated roles, so a "developer" and a "DBA" are different people and when a "developer" needs a database, he or she must ask a "DBA" to make one for them.

In small companies, a "developer" is, by comparison anyway, a jack of all trades, who is probably expected and required to know how to do every possible job, including database management. So when the large company said "no developer" they possibly meant that, internally, their developers could not touch databases and only DBAs were allowed to, and they expected the smaller company to have similar processes in place. Which is quite unrealistic.

At the end of the day, paranoid as that may sound, the big company person was probably getting stuck on the role description of "developer" and would not have freaked out if the smaller company had used the term "DBA".

Torgo on Dec 24, 2019 | | | | [–]

I don't know how to fulfill this that makes every auditor happy, but so far what we're doing is, we have a restricted-access role on the db that denies access to certain sensitive tables, create a time-restricted temporary user for the developer, and then give them access to the read-replica.

orf on Dec 24, 2019 | | | [–]

This sounds like the right approach, and it’s similar to other companies I’ve worked at. The time limited nature is nice and presumably there is a audit log of some kind tied to that.

cosmodisk on Dec 26, 2019 | | | | [–]

While I get the 'atmosphere' of the meeting just by reading this, there are quite a few valid points raised. Developer should not have access to real user data. Devs should be provided with DB schema definitions,info on data volumes and any other relevant information or metadata that is required to develop new features/migrate/etc. When it comes to databases with extremely sensitive data,i.e. banking, healthcare, criminal records, there must be an additional layer security that'd log anyone accessing particular records.

xkcd-sucks on Dec 25, 2019 | | | | [–]

No no no, this is where you request a larger budget, change everyone's title to DBA, and have DBAs write the code (but phrase "code" as "queries")

triangleman on Dec 24, 2019 | | | | [–]

Separation of duties regulation. Nothing wrong with it. The idea is to have a wall so that you can't have one person able to write and push code to production.

Now if you mean "run code" at all that's another story.

Aeolun on Dec 24, 2019 | | | | [–]

How would they even stop you from doing that? It just sounds like a challenge for a workaround to me.

blowski on Dec 24, 2019 | | | | [–]

Unfortunately, the "finance is only for old and boring people" also leads to a certain prestige for finance - startups are for kids, finance is for adults. Many in finance almost pride themselves on how much time they spend in meetings, and how long it takes to release anything.

For startups, the myth is that success comes from a "move fast and break things" culture, so they wear IT disasters with a badge of honour. In finance, it's the opposite - with enough planning and governance, huge budgets, and long timelines, they believe you can guarantee success on any project.

UK-Al05 on Dec 24, 2019 | | | [–]

The culture is don't make changes at any cost. As a result you don't make many changes, and as a result you're bad at them.

Modern technology companies make tons of changes all of the time. As a result they become good at them.

Looking at the reports they learned EXACTLY the wrong lessons. It's not to make fewer changes, it's to become good at changes.

AnIdiotOnTheNet on Dec 24, 2019 | | | [–]

> Modern technology companies make tons of changes all of the time. As a result they become good at them.

Really? That hasn't been my experience. Nearly every time some moderately large update to any modern software goes out it seems to break a bunch things. What it doesn't break it usually just ditches entirely.

UK-Al05 on Dec 24, 2019 | | | [–]

The trick is not to have large updates...

Move a bit at time.

AnIdiotOnTheNet on Dec 24, 2019 | | | [–]

Modern tech companies can't even seem to do relatively small updates without fucking them up. Just look at Windows Update for fuck sake.

I think modern developers just make a whole lot of bullshit excuses for why they are crap at their job.

hoffs on Dec 24, 2019 | | | [–]

> small updates > Just look at Windows Update for fuck sake.

Yup, those updates that change the operating system that is run on billion of devices in extremely varying hardware and configurations. Very simple.

jyounker on Dec 24, 2019 | | | | [–]

Those generally aren't what are considered to be small updates. Organizations doing small updates to are doing them multiple times a day. (e.g. FANGs)

My experience is those is that those lessons (e.g. the need for extensive automatic testing) are learned the hard way.

xgb84j on Dec 24, 2019 | | | | [–]

It's also the risk distribution of one large vs. many small changes that heavily favors many small changes.

cosmodisk on Dec 26, 2019 | | | | [–]

Unfortunately in a lot of countries in Europe that's the only sector that does pay a premium in comparison to other sectors. Try getting £100K+ dev job in London and suddenly all the potential employers ( with a few exceptions) end up being either in the City or Canary Wharf.

devonkim on Dec 24, 2019 | | | | [–]

You could replace banking with any other large enterprise vertical like healthcare, telecom, government and be very correct. The problem is that IT is a cost center, and cost centers are as a rule terrible to work in regardless of industry, even if you're at a tech company.

Ididntdothis on Dec 24, 2019 | | | [–]

Very true. Work somewhere where the CEO has some level of understanding and appreciation for your work and life will be much better.

Corporate IT is an especially bad place. You can only lose. When things go well they start cutting costs by letting people go and offshoring jobs. And when things go badly you get the blame. The IT department at my company is hard to work with and often drive me crazy. But when I look at their structure it’s obvious that it doesn’t allow for people to a good job.

bhouston on Dec 24, 2019 | | | | [–]

My theory is that the code never gets fixed in these pathological companies. They just get higher costs until people switch away to a bank that is more efficient.

Ididntdothis on Dec 24, 2019 | | | | [–]

I think that’s especially common in regulated industries. On the one hand top management are afraid of compliance issues but on the other hand they don’t understand and respect why things cost money and time. So they are very susceptible to hiring sweet talking people who pretend to be able to solve this.

Aeolun on Dec 24, 2019 | | | [–]

> * The board was told that there were only 800 bugs, and yet the independent review found that there were more like 2000

If there were only 2000 bugs in a full bank migration, they were doing pretty well.

Our fairly small single page app is going on 300 now after some 5 weeks of testing.

koheripbal on Dec 24, 2019 | | | [–]

The number alone is meaningless. Some bugs are UX inconveniences, while others can destroy data, as we saw in this case.

The real problem at these banks is the multitude of systems and datafeeds that impact systems downstream a hundred systems long.

The real solution to these big project failures is, in my experience, the phase deployments. With the first step being retiring old systems and consolidating data repositories to reduce knock-on effects.

Aeolun on Dec 25, 2019 | | | [–]

I thought the same, but then figured the ratio is probably around the same.

adjkant on Dec 24, 2019 | | | | [–]

It sounds to me like there were 2000 reported bugs. You likely had better coverage due to smaller scope.

theonemind on Dec 24, 2019 | | [–]

"making individuals within a company responsible for what goes wrong within that company’s IT systems" seems patently ridiculous, especially without "paying individuals in proportion to value generated by IT systems". Otherwise, you have an unlimited downside, and a hard capped upside, which, personally, means I quit IT and start serving coffee for a living.

It gets a bit murkier with executive positions, in my opinion, since the unlimited upside does start coming in to play.

mrweasel on Dec 24, 2019 | | [–]

If companies want to hold engineers resposible to that extend, they would also need to give them a way out. Meaning that if a developer, regardless of deadlines, do not believe that his code is production ready, then you either delay launch or the responsibility now shifts to the manager who choose to ignore the developers concern.

As developers and engineers we’re responsible for letting management know if a project is not on track. What they choose to do with that information is their problem.

ignoramous on Dec 24, 2019 | | | [–]

Individuals do get the stick, and it does make me uncomfortable. For example, here's an Etsy engineer on how a bunch of people were fired for fudging the scale-out: https://mobile.twitter.com/mcfunley/status/11947137113378529...

Discussion: https://news.ycombinator.com/item?id=21849977 (but not many folks discussing the layoffs).

pjc50 on Dec 24, 2019 | | | | [–]

This kind of requires a Toyota "stop the line" button: https://leanbuilds.wordpress.com/tag/stop-the-line/

As it says there, in the short term productivity drops, as you'd expect - in the long run, quality and productivity increase.

melenaboija on Dec 24, 2019 | | | | [–]

Totally agree, do construction companies let start using bridges if the civil engineer says is not ready to do so?

swozey on Dec 24, 2019 | | | [–]

I'd imagine they'd rush delivery of items (going lower quality) and work just like we do to get something finished before the city enacts a fine for delays. Look at Mopac in Austin..

nobodys__fool on Dec 24, 2019 | | | | [2 more]

This is exactly why software developers shouldn't call themselves engineers. Real engineers (structural, mechanical) are legally bounded to denounce unsafe or incorrect design or practices

AlotOfReading on Dec 24, 2019 | | | [–]

There's a whole lot of EE's and ChE's that'll be disappointed to learn they're not "real" engineers.

And even for structural and mechanical engineers, not everything requires reporting. Even when the project is subject to it, any number of institutional and social factors can make that essentially impossible.

quanticle on Dec 24, 2019 | | | [–]

The idea is that this would apply to executive positions. Right now, CFOs are required to sign off quarterly on the company's financial statements. If those financial statements are later found to be misleading or fraudulent, the CFO is personally responsible, and can face fines or jail times. What the article is proposing is that IT be held to a similar standard. A CIO should be held personally liable if an IT failure causes an outage that materially affects the finances of the firm or causes customers to lose money.

pjbster on Dec 24, 2019 | | | [–]

Nice idea in theory but it can (and will) be gamed. For instance, the CIO will have an auditor sign off on a master checklist of yes/no items which the engineers will then be forced to self-certify any changes against. Any failures after that are "provably" due to engineering not being accurate in their assessment of checklist compliance.

What items might we see on such a list? Oh, I don't know... 100% test coverage, perhaps?

In the UK this is called Governance. Fancy word for what is, effectively, liability avoidance.

trasz on Dec 24, 2019 | | | [–]

I disagree, it’s a very good idea - as long as the individuals held responsible are the ones actually responsible, ie the upper management.

pjc50 on Dec 24, 2019 | | | [–]

The question with this kind of thing is what happens when you are ordered to do something that is outside your personally acceptable risk. Including threats of retaliation.

CydeWeys on Dec 24, 2019 | | | [–]

I would sooner quit than do something that could send me to jail.

AmericanChopper on Dec 24, 2019 | | | | [–]

The large enterprise answer is simply analyse the risk and get the decision maker to own them.

pjc50 on Dec 24, 2019 | | | [–]

It seems that the "ownership" was the negative press coverage eventually persuading him to voluntarily not take his £2m bonus: https://www.independent.co.uk/news/business/news/tsb-boss-pa...

AmericanChopper on Dec 24, 2019 | | | [–]

I’m not sure how that related to your original question. If your boss is instructing you to do something that you deem to be unacceptably risky, make sure they are aware of and take ownership of the risks. Large and especially regulated organisations (like banks) will have processes to track things like risk ownership. If you communicate risks like this, then not only have you covered your own ass, you’ve also done the responsible thing of notifying the correct stake holders. If they choose to act irresponsibly in light of that, then that’s on them.

I’ve done a lot of work in banks and this story sounds very, very familiar to me. Change the details of the system and it could be any number of projects that I’ve personally worked on. I’ve been brought into a number of projects like this at the ‘near completion’ stage, and each time I’ve reported on what I thought the risks were, suggested how they should address them, and advised them to delay delivery until they do. Some of those projects worked out well, some of them completely bombed, but I’ve never been put in a situation where I was made even partially accountable for somebody else’s poor decision making. So if your question is “how do I protect my own interests when my boss wants to be an idiot?”, then that’s the serious answer for how you do it.

fulafel on Dec 24, 2019 | | | [–]

I think the labour market would change the compensation too in this regulation scenario. Maybe it would go well, maybe badly (lawyers and liability insurance imposing conditions...)

rubber_duck on Dec 24, 2019 | | | [–]

It would be handled by insurance and it would suck as it would add extra layer of bullshit to employees.

Also IT skills are very transferable between industries so your best employees would avoid sectors with high risk without increased compensation

irq-1 on Dec 24, 2019 | | | [–]

Companies should give people the option to "sign off" that a project will succeed, and then ONLY those people should get the reward or punishment. This would be for everyone not just IT, and it'd require people to asses others skills and trust them on the job.

CydeWeys on Dec 24, 2019 | | | [–]

I'm thinking it would apply to the CTO and similar high-level employees? It wouldn't make sense to punish random leaf-node engineers; they don't know the full state/readiness of the system and aren't the one making the go/no go decision anyway.

abritishguy on Dec 24, 2019 | | | [–]

Personal accountability only applies to execs and is already a thing in financial services in the UK post financial crisis.

3fe9a03ccd14ca5 on Dec 24, 2019 | | [–]

> Perhaps the easiest way to avoid outages is to simply to make fewer changes

Sure, here’s another option: hire good engineers and pay them the market rate for good engineering.

We all know what really happened here: difficult engineering work was outsourced to a cost leader.

Even if they had better “regression testing” (as the author laughably says was missing) this project was doomed with this leadership.

latch on Dec 24, 2019 | | [–]

Outsourcing to save money is a straw man.

There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.

Software is hard and most developers aren't very good. As the complexity of software increases, the number of developers and teams that can manage it becomes vanishingly small; to a point where no amount of money can help.

The best chance for a project like this to succeed is to break it down and migrate over the course of years (which might not always be possible).

BrandoElFollito on Dec 24, 2019 | | | [–]

I witnessed the outsourcing of 3000 IT in a large company in the mid 2000. The contract was great for the outsourcing company and signed off by a bunch of bozos who had no idea how IT runs.

The idea was to have tickets for everything and "golden" ones vichy cost a fortune but are super high priority.

IT was mad at being outsourced and played the game. "super important" thing coming? Ticket. "just a cable, man"? Ticket.

What the idiots in management discovered, is how THEIR company actually works. How important the human relationship is.

One of my colleagues in IT had the pleasure to tell a board member who came to be serviced right oway to fuck off (his words) and to open a ticket, if he knows what this is Then he picked up the phone as someone was calling and turned away.

toyg on Dec 24, 2019 | | | [–]

> IT was mad at being outsourced and played the game.

This is called "work to rule", and anybody who doesn't know the risk and repercussions of such an event happening, should not be in business management in the XXI century. Unfortunately, a lot of people try to pretend that trade unions never existed, and in so doing they lose an excellent occasion to learn how workers actually think and act.

Ididntdothis on Dec 24, 2019 | | | | [–]

“One of my colleagues in IT had the pleasure to tell a board member who came to be serviced right oway to fuck off (his words) and to open a ticket, if he knows what this is Then he picked up the phone as someone was calling and turned away.”

This is great. Usually the board member would get preferential treatment and never experience how things really are.

narag on Dec 24, 2019 | | | | [–]

Outsourcing to save money is a straw man.

There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.

I wouldn't call it a straw man. There are more complex mechanics in outsourcing situations. Often outsourcing is not complete. They still keep a handful of inside workers supervising consultants. And often those inside workers are woefully incompetent, ruining the work of "well paid consultants". That happens because the outsourcing only saves those with good connections: the idiot relative of someone with power.

I've also heard many managers in the consulting firms telling us to never go out of our way to fix problems. If the customers want it fixed, they must pay more hours. Except of course when the customers get very upset and demand unpaid overtime to reach some arbitrary deadline.

klenwell on Dec 24, 2019 | | | [–]

On the "complex mechanics in outsourcing situations", this HN thread is my touchstone:

https://news.ycombinator.com/item?id=15831784

Every time someone I work with casually raises the possibility of outsourcing, I say I'm open to the idea. Then I share that link and ask them to review it before we meet to discuss our options.

narag on Dec 25, 2019 | | | [–]

Heh, I recognize each and every situation mentioned and could add a handful more.

bogdan on Dec 24, 2019 | | | | [–]

> There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.

this reminds me of https://www.theregister.co.uk/2019/04/23/hertz_accenture_law...

wonderwonder on Dec 24, 2019 | | | | [–]

This is accurate. I have unfortunately been involved in a situation very similar to the one the article describes. Release is signed off on, we go live and millions of records are suddenly corrupted affecting thousands of people and essentially shutting down an industry for a while. Engineers are only as good as the QA department and the QA department is only as good as the time and resources they are provided. This is compounded when a company has to deliver in order to get paid so they eventually reach a point; where hamstrung by a bad original contract, they are under massive pressure to deploy and just pull the trigger.

Software is very hard, many engineers are not sufficiently experienced at what they are doing and its hard even for those that are. Further complicating this is a cost issue, quite often large scale applications utilize resources that cost a lot of money on the production side and so to save money, utilize a considerably smaller data and infrastructure set and on the testing and development side. Queries are optimized against this smaller data set. The application goes live and suddenly an application tested against a cherry picked data set of x records is trying to handle 10,000 times the data. Then bad things happen.

reroute1 on Dec 24, 2019 | | | | [–]

> There are countless stories of large scale failures involving non-consultants as well as very well paid consultants.

Sure but there are undoubtedly more failures involving cost leaders doing the same job as highly paid consultants. It is foolish to suggest otherwise or that there isn't a correlation. Your diminishing returns example might be true, but also doesn't discount the original statement.

tootie on Dec 24, 2019 | | | [–]

Good engineers helps but it doesn't solve organizational problems. A program manager who knows what questions to ask and what success and failure look like could have fixed this. Same for a business sponsor who understands the cost of a screw up is far higher than the cost of taking your time and being careful.

UK-Al05 on Dec 24, 2019 | | | [–]

Yeah.

The lesson is to get good at changes, not to avoid them. Making fewer changes makes you bad at them over the long term.

tnolet on Dec 24, 2019 | | [–]

Apart from the opening paragraphs, there is zero information what actually went wrong here.

I’m super curious how customers ended up seeing other people’s accounts. That seems like a massive major flaw in logic somewhere.

Sadly, the article goes into “general history and commentary”-mode way too quick.

wolfgang42 on Dec 24, 2019 | | [–]

> I’m super curious how customers ended up seeing other people’s accounts.

I don’t know anything about this particular case, but the usual cause of this is over-zealous caching (usually by an intermediate load balancer or the like) that returns cached pages while ignoring the session data. The result is that one person logs in and gets their account info, and then everyone else who tries to view the same page gets that person’s info instead of their own. For example, this happened to Steam in 2015: https://arstechnica.com/gaming/2015/12/valve-explains-ddos-i...

Of course, there are also other possibilities, but this is the most likely. Alternatives that come to mind (though I’ve never seen any of these actually happen):

- A bad RNG creating the same session token for multiple users

- Concurrency bug in the web server returning results to the wrong connection

- Messed-up migration causing account IDs to be associated with the wrong credentials (maybe IDs from the old and new systems got mixed up)

bonzini on Dec 24, 2019 | | | [–]

> Messed-up migration causing account IDs to be associated with the wrong credentials (maybe IDs from the old and new systems got mixed up)

I would put money on this one.

heisenbit on Dec 24, 2019 | | | [–]

Read the long report by Slaughter and May (see top HN comment) with 250 pages of details, it is quite insightful and more than enough gory details.

New architecture, new platform, implementation takes too long, migration and testing insufficiently planned and when push comes to shove even the too low non functional testing goals not reached but still gone live.

With big bang approaches one has to keep in mind that complexity grows non-linear and at a certain size it becomes very hard. This migration was at a size that was getting hard but the approach taken and the way it was executed was clearly (see report) not on the level required. Big bang on smaller scale is the fastest and cheapest. Big bang on larger scale starts having big risks.

A more incremental approach may have been better and cost effective with much lower risks. A honest cost comparison may have shown that - provided the big bang approach would have been properly planed with sufficient large scale integration, migration and proofing milestones and buffers for risks inherent in this one-shot approach.

james_s_tayler on Dec 25, 2019 | | | [–]

The problem with the big bang rewrite is it violates Gall's law

>All complex systems that work evolved from simpler systems that worked

darkerside on Dec 24, 2019 | | | [–]

Agreed. The root cause bring lack of regression testing is laughable. Sure, testing would have and should have caught this, but what exactly was "this"?

Buge on Dec 24, 2019 | | [–]

> The paper also suggests a potential change to regulation—making individuals within a company responsible for what goes wrong within that company’s IT systems. “When you personally are liable, and you can be made bankrupt or sent to prison, it changes [the situation] a great deal, including the amount of attention paid to it,” Warren says. “You treat it very seriously, because it’s your personal wealth and your personal liberty at stake.”

Being sent to prison for buggy code... that's quite extreme. Are people even fired for buggy code? If we want to get more extreme than firing, how about losing 1 year's compensation in addition to being fired.

blihp on Dec 24, 2019 | | [–]

The negligence on this migration sounds breathtaking: they have a four-nines availability requirement and decided to do a big bang migration with what sounds like virtually non-existent, or at least grossly inadequate, testing and no ability to rollback.

People, often contractors, do get fired all the time for buggy code. Companies get sued for buggy code. However, the issue here isn't really the code, it's the management of the project. Bugs happen all the time, and at every stage, during development which is why large migrations like this need to be done very carefully with ridiculous amounts of testing and rollback procedures. Even when you think you've got everything covered, it's amazing the things that can go wrong on go-live.

Aeolun on Dec 24, 2019 | | | [–]

In projects like these, I think the only thing that ever exists is a virtual ability to roll back. I’ve never actually seen that happen successfully.

gav on Dec 24, 2019 | | | [–]

Such a move would cause a dramatic shift, but not a good one: it would cause the best and most experienced people to leave for less risky work.

It’s like any contract negotiation, you’re always able to push for more favorable terms, but eventually you’re just weeding out people who aren’t smart enough to walk away.

thrax on Dec 24, 2019 | | | [–]

It might also put some brakes on rampant "disrupt" capitalist overdrive we're in. I honestly can't think of any tech that I NEED to progress badly enough to compromise customer safety.

makomk on Dec 24, 2019 | | | [–]

Bear in mind that the ultimate root cause of this mess wasn't "capitalist overdrive" but an ill-thought-out intervention by the EU forcing Lloyds TSB to split up because they thought there wasn't enough competition in British banking, without properly considering the practical implications of their systems having become completely integrated over the two decades that they were one company. To put this in some context, the article meantions problems with internet banking - that didn't exist yet when the two were last independent companies. The first, very broken, version of SSL had only been released a few months before their merger.

futurix on Dec 24, 2019 | | | [–]

Yes, EU is the cause of all British ills... </sarcasm>

Or perhaps it was the usual incompetent planning of the owners of TSB - banks splitting up is not a new or impossible thing, you know. And complexity of IT systems is the most absurd argument against splitting up LloydsTSB - why would any regulator consider this at all?

tzmudzin on Dec 24, 2019 | | | | [–]

„Too big to fail” is the last thing we need. And I’m not sure how the SSL part in consumer-facing part of IT relates to a botched internal data migration.

_ZeD_ on Dec 24, 2019 | | | [–]

I'm non payed enough to be legally liable for the code I do at my job.

...and moreover, what about the usage of third party software? (worse if it's free software...)

JackFr on Dec 24, 2019 | | | [–]

In a situation in which the consequences are so grave, you end up never vouching for the software as fit for purpose. Rather you claim best efforts to ensure fitness, and what you do vouch for is the testing you performed. If people at the top end of the pay scale signed off on your test plan, and you fully completed all testing successfully, it’s very difficult to find you blameworthy.

fierarul on Dec 24, 2019 | | | | [–]

+1

Our local Post Office is seriously lacking man power. They pay roughly minimum wage. On top of that, since the postman also delivers money and other special mail (such as serving legal notices), they would be legally liable if the job is not done right.

Is it any wonder that they have been trying to find another person for years!?

_trampeltier on Dec 24, 2019 | | | | [–]

Look it in another way. Almost every worker in the not software world is legally liable for his work. For the chef in the kitchen (no poison in food) to construction worker (the building should not falling apart) or electricians (proper insulation and ground, etc). I work in industry automation and I really would see if software would be simple again. Todays software depend on so many external things.

_ph_ on Dec 24, 2019 | | | [–]

In Germany, and I guess in most countries, almost no worker is legally liable, as long as they are not actively ignoring the safety regulations applying to their work. So to make a software developer legally liable, you first have to set up a clear framework of "safety regulations" by which they can be judged.

_trampeltier on Dec 24, 2019 | | | [–]

Yes every job has it regulations, except for pure software. Just software guys seems to have no rules at all.

fierarul on Dec 24, 2019 | | | [–]

I assume you never saw a 'workmanlike manner' clause in an independent contractor contract?

Any industry, including ours, has best practices.

goatinaboat on Dec 24, 2019 | | | | [–]

For the chef in the kitchen (no poison in food) to construction worker (the building should not falling apart) or electricians (proper insulation and ground, etc)

Yes and no. Those people are all responsible for doing their own jobs to a certain standard. If a building falls down it is extremely unlikely that it was due to poor workmanship by an individual bricklayer. The root cause will be in the design and the bricklayer isn’t in a position to challenge the design.

cube2222 on Dec 24, 2019 | | | | [–]

At least in Poland an employee can only be held liable up to an amount of 3 monthly salaries.

aledalgrande on Dec 24, 2019 | | | [–]

And then who was really responsible for that bug? The engineer who didn't care to write tests? The QA person who didn't ensure it worked? Or the manager who pushed their workforce to the brink? As we have seen with Volkswagen, it is the bottom of the chain which tends to be blamed.

ggcdn on Dec 24, 2019 | | | [–]

If you follow the professional liability model used in other fields of engineering (civil, etc), the buck stops at whoever sealed the report / letter of assurance / drawings / etc). We usually call them the Engineer of Record (EOR) in my field, and they are usually someone at the top.

I'm not advocating for that model in software, but it does provide a very transparent way of determining who is responsible. As the EOR, you may not design every member or do every calculation, but ultimately it's your ass on the line. I suspect it also changes your behavior when you know you are the one taking responsibility.

KineticLensman on Dec 24, 2019 | | | [–]

The industry I used to work in used an Independent Assurance Reviewer (IAR) role. The IAR signed off a go-no-go review, whose primary content was identification of the risks that the business would incur if it proceeded past a specific review gate. If an IAR failed a review, the business could still proceed, but the responsibility for subsequent failure then explicitly lay with the leadership rather than the engineers / reviewers. This would be rare though, in most cases IAR failures would lead to mitigation of the identified risks.

In an ideal case, the IAR was identified at the outset of a piece of work and was then involved throughout the project life cycle (waterfall or agile). Review points included bid release and contract acceptance as well as design and test readiness reviews.

larnmar on Dec 24, 2019 | | | | [–]

Really? I’ve noticed the opposite — people tend to want to blame people at the top, or the whole company, or whoever has the deepest pockets.

xmprt on Dec 24, 2019 | | | [–]

Most people want to blame the top but if you go through the legal system, liability will likely be on the bottom rung of the company. If people are personally liable, they will probably need to find their own lawyers and upper management will easily find ways to shift responsibility.

joshschreuder on Dec 24, 2019 | | | [–]

“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”

– Thomas John Watson Sr., IBM

tastroder on Dec 24, 2019 | | | [–]

Neither https://www.bankofengland.co.uk/-/media/boe/files/prudential... nor https://www.fca.org.uk/publication/consultation/cp19-32.pdf seem to contain phrases that actually say what's quoted in the article. The closest I can see is that these documents suggest that senior management is supposed to be aware of what they are responsible and accountable for. Putting blame on the other end of the chain would be counterproductive anyway. There's pretty much no way the root cause of such a large scale failure would be in some minor technical role.

lopsidedBrain on Dec 24, 2019 | | | [–]

The part you quoted does not talk about buggy code. Bugs will always exist, the idea is to prepare for them. Migrate systems incrementally, run versions in parallel, have tested rollback strategies at every step, so on and so forth. Not cut corners on backups and contingency plans. Don't rush to launch.

Buggy code is not the problem, but a buggy process is.

spuz on Dec 24, 2019 | | | [–]

If contracts had to include terms that held developers personally liable then it would simply force them to buy personal indemnity insurance. This might not be a bad thing as it is how things are done in the medical world but I wouldn't expect to start seeing developers going to prison.

goatinaboat on Dec 24, 2019 | | | [–]

The paper also suggests a potential change to regulation—making individuals within a company responsible for what goes wrong within that company’s IT systems

We all know how that looks in practice - Experian’s senior management pointing the finger at a single lowly sysadmin. Or VW blaming their emissions scandal on a single engineer.

There is one individual responsible: the CEO. That’s why he or she is paid the big bucks. Maybe make all the C-level jointly responsible. They can share cells afterwards.

mathdev on Dec 24, 2019 | | | [–]

Perhaps flogging would achieve the same goal while being financially neutral.

totony on Dec 24, 2019 | | | [–]

yeah, this is obscene. Too many people participate in such systems, and placing blame on anyone is impossible.

Also, that's the whole reason LLCs exist as one of the pillars of our economy. Having so much risk would make any endeavor so risky as to not be worth it.

licensetocode on Dec 24, 2019 | | | [–]

"We've made the cars drive themselves, but I guess figuring out who's responsible in an organization is just an unsolvable problem."

totony on Dec 26, 2019 | | | [–]

Blame is on the company, and not the individual employees.

Criminal negligence also exists, but the way it was phrased didnt seem to imply it.

Responsibility in an organisation is the orgabisation's problem, and people can be fired/sued.

jacquesm on Dec 24, 2019 | | | | [–]

Sure, but those protect the shareholders, not the people doing the work.

sadfklsjlkjwt on Dec 24, 2019 | | | [–]

Employment in general protects workers. If a worker makes a mistake that costs money that is the employers problem.

wonderwonder on Dec 24, 2019 | | | [–]

This sounds like a very bad idea. Why would anyone go into IT if you run the very real risk of going to prison for making a mistake. Software deployments would slow to a once a year thing, 3 months to code and 9 months of testing. I would not be surprised to see the suicide rate in developers sky rocket as they dealt with the stress.

licensetocode on Dec 24, 2019 | | | [–]

You mean we might stop releasing crap software every sprint and developers would actually be careful and responsible? I fail to see any downsides to this.

mdpopescu on Dec 25, 2019 | | | [–]

You weren't paying attention when we were doing all paperwork by hand.

No more software = computers are bricks. You go back to manual operations. Emails? Replaced by letters.

Writing "I fail to see any downsides to this" is pretty much equivalent to "of course we should ban cars, that way we stop car accidents; I fail to see any downsides to this".

andylynch on Dec 24, 2019 | | | [–]

I just checked - the former CIO (who surprisingly is still there, but in a different role ) was a SMF18 according to the FCA, so he was personally liable for failures in his area- looking at the Slaughter and May report findings I would suspect they are working on an enforcement action and have been for some time.

bboreham on Dec 24, 2019 | | | [–]

I believe Sarbanes-Oxley already makes individuals responsible for what the computer systems spit out, at least in the accounting domain.

Plausibly you can trash all your customers’ data without affecting the company’s accounts, so I guess there is room for improvement.

wruza on Dec 24, 2019 | | | [–]

That Warren would fit USSR nomenclature perfectly.

lsb on Dec 24, 2019 | | [–]

> The culprit was insufficient testing

The culprit was not insufficient testing!

The culprit was insufficient testing, plus lacks of restorable backups, on multiple levels.

Last level of backup: there's 5 million records, use a few dozen reams of paper and print out the account totals for every single account.

For all of the complex systems you can think of (planes, living organisms, etc), the reason things usually go so well is that there are multiple levels of checks and balances. Everything is usually veering towards entropy, and fail-safe systems try to ensure when things go wrong that they fail safely.

bjohnson225 on Dec 24, 2019 | | [–]

Blaming testing feels like a cop-out - it's easier to say 'if we tested a little more everything would have been fine' rather than 'our leadership was too incompotent to recognise that the plan was fundamentally flawed from day one, that releasing in the way we did maximized the risk to customers, or that we decided not to listen to any concerns which would have compromised the timeline'

coleca on Dec 24, 2019 | | | [–]

Exactly. Had they done all the proper testing, they would have likely discovered all these issues, it's not a stretch to think that the management that allowed the project to get to the state it was in would have also pushed forward with the go-live despite the bugs that were discovered in testing. And just because they allocated time to testing doesn't mean they would have allocated sufficient time to remediation and re-testing. I've seen it happen where bugs were found late stage in a big-bang style migration, they were remediated, but there wasn't enough time left to conduct another massive mock go-live to make sure it worked. So you hope and pray...

ars on Dec 24, 2019 | | | [–]

It's not possible to restore from backup when the transaction is between your bank and a different bank.

You might restore from backup, but the other bank won't.

The issues here are far more complex than just keeping track of the current balance: a first year CS student could do that part.

avereveard on Dec 24, 2019 | | | [–]

Banks handle balance discrepancies all the time all day long, all operations that cross a bank boundary are subject to reconciliation further down the line, with massive expert systems handling most of the obvious cases automatically and operators handling the flagged transactions

mcny on Dec 24, 2019 | | | | [–]

Wait. I don’t understand. If you have the opening balance and have all the transaction logs, what are you missing?

I don’t see why it shouldn’t be possible to feed data to both the new system and the old system for a while...

jyounker on Dec 24, 2019 | | | [–]

But that requires forethought, planning, and a dedication to testing.

Forge36 on Dec 24, 2019 | | | | [–]

From the article it appears even the naive first year approach would have significantly reduced the scope of the problem.

3fe9a03ccd14ca5 on Dec 24, 2019 | | | [–]

The problem was insufficient technical leadership and engineering rigor across the board.

darkerside on Dec 24, 2019 | | | [–]

Testing and restoring backups are both mitigations of problems, not the actual problems.

rullopat on Dec 24, 2019 | | [–]

Working as a test automation engineer, I'm not surprised at all. There is small to no investment in testing in most companies because it's just considered a cost. I cannot count the times where I asked for better servers for testing, put more time and people in creating a better testing infrastructure, but the answers was always "we will do it! We definitely need to get better at testing!". Still waiting for any action. Most of managers I met just want to meet the deadline they set and ask: "can't you test it in another way? How is possible that it takes so much time?". They don't give a damn about testing, they just want a pat on the shoulder that everything is fine and give the responsibility to someone else if something goes wrong.

mtm7 on Dec 24, 2019 | | [–]

Scientist: “I’ve created [new medicine]!”

Manager: “Perfect. Let’s sell it!”

Scientist: “Well, we’ll need to test it first. We should run some trials —“

Manager: “We definitely need to test more often! But I already agreed to have it out by next Tuesday, so maybe we can just release it and aim for more testing next time.”

mdpopescu on Dec 25, 2019 | | | [–]

I call these "real scientists", because they only exist in books and movies.

In reality, the scientist will say "let's publish! who cares about testing, publish or perish!".

rullopat on Dec 25, 2019 | | | [–]

Publish or you'll be the trouble maker that will never have a career, that's what you mean?

orangepenguin on Dec 24, 2019 | | [–]

> "Perhaps the easiest way to avoid outages is to simply to make fewer changes."

The exact opposite is true. When a process happens infrequently, it's more likely that the people involved will make mistakes and overlook steps. The correct answer is to make changes more often, and to develop robust processes and automation around those changes.

aledalgrande on Dec 24, 2019 | | [–]

I think you are implicitly referring to change size too here: smaller changes are better. Can probably be included in "robust processes".

einrealist on Dec 24, 2019 | | | [–]

Wanted to state the same. I get angry when I have clients talking about their "major release days" every three months and how its a good operating procedure to have this dedicated day to change all their systems at once.

Robust IT needs decoupling and small incremental changes. This will outperform any coordinated release scheduling in terms of reliability, if implemented correctly.

fartbagxp on Dec 24, 2019 | | [–]

Ah, the classic Enterpise big bang migration.

Enterprise tend to favor big bang migrations on a specific date because somebody higher on up set a particular date and everything falls into place with a Gantt chart running waterfall. The reality is that it falls onto a few technical folks to triage a large amount of teams, including the ones from the company they're trying to break away from (which introduces friction). This includes significant risk to the project.

"TSB chose April 22 for the migration because it was a quiet Sunday evening in mid-spring."

This might've gone better if TSB chose months prior to April 22nd for the long duration migration and testing to be completed, and a period of weeks or months for going live post migration. The F5 load balancer (hardware commodity) could've slowly cut over the traffic 10% at a time to the new migration site to get a feel for user experience. Coordination with the TSB network team would be necessary to accomplish that.

It is a tough spot though, I hope the team learned something from that.

orthoxerox on Dec 24, 2019 | | [–]

> The F5 load balancer (hardware commodity) could've slowly cut over the traffic 10% at a time to the new migration site to get a feel for user experience.

I doubt an F5 load balancer would work in this specific case. But there definitely should've been a software router-adapter that routed requests to two systems and converted their replies to a single format. This would've let them migrate their customers in batches instead of a big bang cutover.

That's what I've always done when migrating data to a new banking system.

jazzkingrt on Dec 24, 2019 | | | [–]

What I've seen work well, for multiple migrations, (at an admittedly smaller scale) is using shadow writes/reads and a source-of-truth toggle.

The API layer made requests to both the old DB and a new DB that had been populated during a small window of scheduled downtime.

We spent a couple weeks/months running checks in production that the old DB and new DB were returning identical results, though still returning the old DB's results as source of truth. Eventually, we flipped the source of truth to the new DB, and some time later decomissioned the old DB.

Mouse47 on Dec 25, 2019 | | | [–]

Great approach imo. Once you get flipped over to the new source of truth you have to make sure the business prioritizes decommissioning the old DB, though. I've seen certain departments (looking at you, BI) treat the old database as a permanent backwards compatibility layer.

8note on Dec 25, 2019 | | | [–]

when the old database stops getting writes, it stops being useful pretty quickly

makomk on Dec 24, 2019 | | | [–]

I think part of the problem is that TSB had extremely limited access to the existing system, which affected their understanding of how that existing system worked, the kinds of testing they could do, and their ability to do incremental migrations.

fartbagxp on Dec 24, 2019 | | | [–]

That might've been the case, but it can't be the excuse. They were spending €100mil a year on maintaining the current infrastructure. The team's incentive is to save a large portion of that through the migration, and somebody higher up should've given them unlimited access to do so.

makomk on Dec 24, 2019 | | | [–]

They were paying €100mil a year to a direct competitor who owned and operated the current infrastructure for them. The amount of money they were paying was, if anything, a direct incentive not to give them the access required for the migration.

hoffs on Dec 24, 2019 | | | | [–]

> They were spending €100mil a year on maintaining the current infrastructure. As I understood they were made to pay that as the infrastructure was being used of another bank

Rainymood on Dec 24, 2019 | | [–]

The lack of testing is a symptom of a deeper root cause: cost cutting. The audacious thinking that you can just shuffle around developers that have been on the project for multiple years, outsource them to cheaper countries, count that as a win because the balance sheet looks better for that year, and then don't expect this to happen. Honestly, there is a lot of hidden knowledge that NEVER gets fully transferred in these kind of knowledge transfers and you'll only notice it once shit really starts hitting the fan.

planetjones on Dec 24, 2019 | | [–]

I have been out of big bank IT for eight years now. Clearly I don’t miss it. For one programme we had to produce upfront estimates. Clearly anything we produced would be wildly inaccurate. But we tried. The IT director then cut the estimate in half and added another hundred people to it. Cue complete chaos and a massive delay. That was the calibre of senior management back then. I doubt it’s much better today.

rsecora on Dec 24, 2019 | | [–]

A classic: "Adding manpower to a late software project makes it later". Fred Brooks - The Mythical Man-Month.

jyounker on Dec 24, 2019 | | | [–]

Nope nothing changed. I've just spent a few years going through this at a much smaller scale.

a012 on Dec 24, 2019 | | [–]

I didn't read everything, but their erronous mistake is the resemblance in "The Phoenix project" book[0], it's awful but realistically can happen at many corporates.

0 - https://www.goodreads.com/book/show/17255186-the-phoenix-pro...

alexfromapex on Dec 24, 2019 | | [–]

Having done a large data migration for medical records, I would be sweating bullets doing this with presumably fluctuating transactional bank data. Out of curiosity, does anyone have recommendations for their favorite data migration tool?

andreskytt on Dec 24, 2019 | | [–]

IMHO it is unreasonable to expect these sorts of things to be possible to do using a big bang approach. The complexity exceeds cognitive capabilities and disaster strikes. Therefore, it is not optimal to see these projects as data migration but functionality migration ones. You move some customers to another set of functionality. Thus, my recommended tool is to use none

aledalgrande on Dec 24, 2019 | | | [–]

100% - what I always do is start the new logic/system so it processes in parallel with the old one and at the same time backfill the old data into the new system. When everything is proven working, feature flag to switch the new system as the main one, old one keeps running. After a period of quarantine, switch off the old system.

I don't understand the requirement of doing the "nuclear button" migration, except maybe a shortsighted way of trying to save costs.

joshschreuder on Dec 24, 2019 | | | [–]

We do this too and I think it's a great approach. One thing I will say is you should try and decommission the old system as soon as feasible - running both has additional testing and development costs, particularly if for some reason you need to add new features to both.

We generally aim to leave the old system toggled off for a release or two (allowing us to switch back in the case of a serious defect) and then rip all that old code out in the subsequent release.

teddyuk on Dec 24, 2019 | | | | [–]

This, they should have migrated test accounts or some employee accounts first and then slowly migrated other accounts.

lmm on Dec 24, 2019 | | | [–]

Don't do migration in one go. Do it gradually. Make sure you can run both systems at the same time. It will take longer than you thought doing it in one go would take, but less time than actually doing it in one go and fixing all the problems.

stevekemp on Dec 24, 2019 | | | [–]

It's a bit glib, because data varies so much, but my personal favourite data migration tool is Perl.

I've migrated a lot of "stuff" from one place to another, from filesystems, to databases, to backups, and there is usually some perl running in the background.

Of course the scale is very different. The biggest thing I've personally migrated was in the low Tb of data. Certainly nothing so large/necessary as banking stuff.

PebblesHD on Dec 24, 2019 | | | [–]

Depends significantly on the back-end data store, for some our in-house systems we’ve had good experiences with GoldenGate [1] to do a continuous sync between systems to enable a rollback scenario, and there are equivalent packages available for lots of datastores. Some of the larger finance packages also include export and snapshot functionality but this wont help with maintaining parallel systems.

The risk however is evaluating consistency between the systems before doing the rollback which in this case would probably require a more advanced testing capability than what was available...

[1] - https://www.oracle.com/au/middleware/technologies/goldengate...

peheje on Dec 24, 2019 | | | [–]

I got pushed a graphical flow migration took down my throat by my manager. On the argument that in that way others would be able to read at reuse it because it was standard. In reality it was used once and I suffered all its quirks, also no version control was possible.

jacquesm on Dec 24, 2019 | | | [–]

> does anyone have recommendations for their favorite data migration tool?

Backups!

tarsinge on Dec 24, 2019 | | [–]

The article lacks substance, it just repeat « it’s a very complex system, it was not tested enough » 10 times. I was hoping some details.

enriquto on Dec 24, 2019 | | [–]

There's not a lot of details in that article. But it seems that they were dealing with a moderate amount of data (about 100 GB ?), which is trivial to backup many times. They do not explain what's the big deal with that.

mongol on Dec 24, 2019 | | [–]

Very interesting article. But I miss a story about what happened in the days after the disastrous migration. Could customers find a correct balance on their accounts eventually?

xenihn on Dec 24, 2019 | | [–]

Potentially dumb question: I've done front-end (iOS) my entire career. If I wanted to go from 0 to being able to do a SQL DB migration in production without disaster, what should I do?

I saw https://www.masterywithsql.com/ posted on HN a few months ago, and I'm planning to finally start it over break. But what do I use after that?

tzmudzin on Dec 24, 2019 | | [–]

That‘s a good start, but may even be an overkill for your goal.

The trick is not in the application of highly complex SQL. The trick is in the PROCESS — making it robust, testable, traceable, and reversible for every object despite the complex dependencies. And then indeed validating every step went well, with the validation depth depending on the objects criticality (OK return code vs check of record counts vs check of totals on fields vs check of totals per category etc)

If you‘re really into it, consider the use of automation and tracing IDs for individual loads.

If you‘re worried about the time it takes despite automation: work on the bulk of the data, but use DB triggers to record delta from your snapshot, then treat that „sidecar“ accordingly.

Then: do at least one end-to-end test.

Learn from the mistakes at every single stage, and act on what you‘ve learned.

readonly on Dec 24, 2019 | | | [–]

- pick a relational database (ie: mysql or postgres)

- learn how to backup the database (I suggest via the official documentation)

- restore that backup (on a different environment)

- validate the restore is complete (compare the data)

- if the backup and restore are good, now you can start learning how to migrate without the stress that you'll lose historical data

interblag on Dec 24, 2019 | | | [–]

This is 100% the way to learn this, and a great answer to the original question. Should also emphasize though that for a migration of the scale discussed in this article doing a SQL backup + restore is almost certainly not the right approach.

At this kind of scale + uptime requirement, you'd want to do a migration like this gradually - hydrating the new system with data and keeping it up-to-date with changes for a period of weeks/months while doing testing + validation (and ideally also doing a gradual cutover, although that might not be realistic given banking infrastructure/application design).

Probably the relevant approaches to look into after reviewing basic backup/restore are things like log-shipping or change-data-capture, although choosing the right approach would be highly dependent on the underlying technology, architecture, and requirements...

tzmudzin on Dec 24, 2019 | | | | [–]

You‘re absolutely right, but we all can be talking about different subjects here:

- database restore - database migration (your case) - data migration to a different platform and presumably different data model (the bank‘s case, or my answer here)

Not sure which case the question was reffering to, though...

xenihn on Dec 24, 2019 | | | [–]

How about all 3 :)

hrgiger on Dec 24, 2019 | | | [–]

Probably someone will throw a better answer for DB migration, but the problem is not DB migration here, TBH db part seems to me the easiest part but what they did is migrating entire stack not by part by part, without pilots and using poor testing. Migrating whole stack at one click is nuclear I believe poor planning had a great role to force them to get it done.

duelingjello on Dec 25, 2019 | | [–]

A bank fails at risk management. Hmm.

Commonly-held factoid statistic says they have a 50% chance of going out-of-business within 6 months since they lost so much vital customer data. I hope they made backups and didn’t just rely on snapshots, database transactions or journaling.

IT professional prime directive #0: don’t lose vital data.

crdoconnor on Dec 24, 2019 | | [–]

This ought to be an object lesson in not cheaping out on hiring good developers and architects to run sensitive projects like this.

Alas instead it's seen as an opportunity to push the risk on to developers.

With this kind of response and IR35 coming up I can foresee more TSBs happening.

fierarul on Dec 24, 2019 | | [–]

Management was doing all right: the project was late and over budget. So, money was there.

But I suspect money never reached the developers. Until the late panic stages when it was too late to correct course.

peter_retief on Dec 24, 2019 | | [–]

I have bad dreams about this very thing, I cant even read the article

de_watcher on Dec 24, 2019 | | [–]

Don't worry, there aren't any interesting details in it.

netsharc on Dec 24, 2019 | | | [–]

Such a weak article with so much filler. The first several paragraphs were interesting and then he goes on to say in the 60s only employees used the bank computers, but nowadays (to paraphrase) "there's this thing called online banking".

You don't say... (Insert Nic Cage meme here)

warent on Dec 24, 2019 | | | [–]

I admit feeling bewildered at that random flashback and scrolling endlessly just to get back to the point of the article

blendo on Dec 24, 2019 | | [–]

Article quoted 2500 person years, over a period of a couple of years, to move from a Lloyds system to a clone of an existing Banco Sabadell system.

Does that seem like a lot of effort for 5 million accounts? Maybe worth it since they were already paying EUR 100 million per year to license the old Lloyds system.

I’d bet they moved from one big honking RDBMS to another. Curious if old and new were different DB vendors, if anyone here has insight.

masklinn on Dec 24, 2019 | | [–]

> Does that seem like a lot of effort for 5 million accounts?

It's… close to a person-hour per account… At that point you might as well migrate each account by hand.

jacquesm on Dec 24, 2019 | | [–]

Git 'blame' would take on a whole new meaning.

OliverJones on Dec 24, 2019 | | [–]

Important safety tip. If you hear any vendor executive brag about the large number of people it took to do an IT project, seek out a different vendor.

I don't get how small-value transactions could turn into large-value transactions on migration, though. Hollerith cards? Wrong column numbers in the COBOL program? Decimal point / comma confusion? Whisky Tango Foxtrot?

tgsovlerkhgsel on Dec 24, 2019 | | [–]

There was a recent post about in-flight entertainment/information systems on the front page (clickbaited with "widewine" in the title), which showed a (presumably somewhat recently designed) JSON object containing flight data information.

I think the date was "00YYMMDD" or something like that, time zone offsets were represented as a float-respresented-as-string, but to indicate a negative offset they added 80000 or so...

So if that's what happens to systems designed today, imagine how legacy cruft from decades ago must look. I would not be surprised if the answer to "Hollerith cards? Wrong column numbers in the COBOL program? Decimal point / comma confusion?" was "all of the above, and then some".

OliverJones on Jan 3, 2020 | | | [–]

Oh, terrific, some ancient program in C, where somebody cobbled in a poorly designed and untested C version of JSON.stringify(). That is 21st century craniorectal inversion, not decades-old tech debt.

I started out working in that era (no COBOL, but all the rest of it). At least some of us were suspicious of data-in-a-few-characters and character column-number based records (ummm, punchcards). Some of us checked all kinds of things on input to try to reject garbage. Spaces in the middle of numbers? BOOM. Something unexpected in the "record type" letter? BOOM. The card reader actually had a diversion output tray where we could spit out the rejected cards.

But, still, lots of bad stuff got through.

masonic on Dec 26, 2019 | | [–]

  Then, in 1967, the world’s first automated teller machine (ATM) was installed outside a bank in north London

I wonder how much rework was needed when decimalisation came a few years later.

ausbah on Dec 24, 2019 | | [–]

Best software practices for safety critical systems seem essential, I don't think it's crazy to think that some sort of regulation or licensing would help enforce those practices.

DaiPlusPlus on Dec 24, 2019 | | [–]

Is banking “safety critical” in a legal sense, though?

StreamBright on Dec 24, 2019 | | [–]

It is not specific to finance, I would say the majority of IT operations are like this. Why? Because this is a new profession and ignorance.

mikece on Dec 24, 2019 | | [–]

This article reads like a dystopian, horror version of the rollout of "Phoenix" in the book The Phoenix project.

fock on Dec 24, 2019 | | [–]

seems like consultants are aiming right for the three 9s, but it's the cloud, so I guess 2 nines are ok eventually.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact