Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Japan The Almighty Buck Hardware

How One Piece of Hardware Took Down a $6 Trillion Stock Market (bloomberg.com) 26

An anonymous reader quotes a report from Bloomberg on how a data storage and distribution device brought down Tokyo's $6 trillion stock market: At 7:04 a.m. on an autumn Thursday in Tokyo, the stewards of the world's third-largest equity market realized they had a problem. A data device critical to the Tokyo Stock Exchange's trading system had malfunctioned, and the automatic backup had failed to kick in. It was less than an hour before the system, called Arrowhead, was due to start processing orders in the $6 trillion equity market. Exchange officials could see no solution. The full-day shutdown that ensued was the longest since the exchange switched to a fully electronic trading system in 1999. It drew criticism from market participants and authorities and shone a spotlight on a lesser-discussed vulnerability in the world's financial plumbing -- not software or security risks but the danger when one of hundreds of pieces of hardware that make up a trading system decides to give up the ghost.

The TSE's Arrowhead system launched to much fanfare in 2010, billed as a modern-day solution after a series of outages on an older system embarrassed the exchange in the 2000s. The "arrow" symbolizes speed of order processing, while the "head" suggests robustness and reliability, according to the exchange. The system of roughly 350 servers that process buy and sell orders had had a few hiccups but no major outages in its first decade. That all changed on Thursday, when a piece of hardware called the No. 1 shared disk device, one of two square-shaped data-storage boxes, detected a memory error. These devices store management data used across the servers, and distribute information such as commands and ID and password combinations for terminals that monitor trades. When the error happened, the system should have carried out what's called a failover -- an automatic switching to the No. 2 device. But for reasons the exchange's executives couldn't explain, that process also failed. That had a knock-on effect on servers called information distribution gateways that are meant to send market information to traders.

At 8 a.m., traders preparing at their desks for the market open an hour later should have been seeing indicative prices on their terminals as orders were processed. But many saw nothing, while others reported seeing data appearing and disappearing. They had no idea if the information was accurate. At 8:36 a.m., the bourse finally informed securities firms that trading would be halted. Three minutes later, it issued a press release on its public website -- although only in Japanese. A confusingly translated English release wouldn't follow for more than 90 minutes. It was the first time in almost fifteen years that the exchange had suffered a complete trading outage. The Tokyo bourse has a policy of not shutting even during natural disasters, so for many on trading floors in the capital, this experience was a first.
After trading was called off for the day, four TSE executives held a press conference, "discussing areas such as systems architecture in highly technical terms," reports Bloomberg. "They also squarely accepted responsibility for the incident, rather than trying to deflect blame onto the system vendor Fujitsu Ltd."

One of the biggest questions that remained unanswered is whether the same kind of hardware-driven failure could happen in other stock markets. "There's nothing uniquely Japanese about this," said Nicholas Smith of CLSA Ltd. in Tokyo. "I think we've just got to put that in the box of 'stuff happens.' These things happen. They shouldn't, but they do."
This discussion has been archived. No new comments can be posted.

How One Piece of Hardware Took Down a $6 Trillion Stock Market

Comments Filter:
  • Initially news outlets were reporting it was a network outage. But I knew it would turn out to be server related. :)
  • by MerlynEmrys67 ( 583469 ) on Friday October 02, 2020 @05:28PM (#60566434)
    Start with a small scale simulator. Randomly pick pieces to turn off, write corrupt data too, whatever you can think to randomly screw the system up. Once you have successfully tested in simulation, apply this to the production system. I bet they had never had a fail over from number 1 to number 2 - this means it doesn't work. If you aren't willing to failover the production system once a month - it isn't true failover.
    • by ShanghaiBill ( 739463 ) on Friday October 02, 2020 @05:47PM (#60566468)

      That level of fault tolerance makes sense for a mission-critical system.

      But temporary trading halts happen all the time and are no big deal.

      The $6 trillion listed in the headline is the total value of the listed equity. The volume traded in a day is a tiny fraction of that and the spreads are an even tinier fraction.

    • by Ichijo ( 607641 )

      If you aren't willing to failover the production system once a month - it isn't true failover.

      Isn't that a bit like pre-testing a box of matches and throwing away the ones that don't light?

      So the failover fails, all trading has halted, and the whole financial world is mad at you. Now what?

    • by hjf ( 703092 )

      Yeah you know what? In very large companies they don't even test the servers to reboot. They just rely on the fact that they have SUCH good hardware and redundancy, that "waiting for the server to boot and start the service once every couple of years when we patch that critical vulnerability" isn't an issue.

      At my company, maintenance managed to fry BOTH redundant supplies in the core network switch.

      The config in the switch was live, and outdated. No one ever bothered to save an updated config to NVRAM. It w

      • "the core network switch"
        Well, there's your first problem, besides all the others you listed.

        • by hjf ( 703092 )

          TBF it's a modular, redundat fabric switch where everything can be hot-swapped.

          • You didn't give the model, but I bet you can't hot swap the chassis and the chassis isn't redundant. Normally the software running the thing isn't redundant either (i.e. if it hits a bug, the whole device can be affected, some do store and load multiple copies).

            It's like RAID. RAID is great, gives good uptime and other benefits, but it's not a backup system by itself. One switch, no matter how internally redundant, improves your uptime, but doesn't create an HA system.

    • by rfunches ( 800928 ) on Friday October 02, 2020 @09:48PM (#60567176) Homepage

      If you aren't willing to failover the production system once a month - it isn't true failover.

      Recognizing that the story is about the TSE...in the US we have Reg SCI [sec.gov] where the government requires exchanges to run periodic DR tests. We actually have one coming up [nyse.com] on Oct 24.

      Here's the catch: the most well-known exchange in the world, the New York Stock Exchange, doesn't actually run "true failover" [nyse.com]. NYSE operates two datacenters, one in New Jersey (Mahwah), the other in Illinois (Chicago/Cermak). When Mahwah is up, they block connections to Cermak, and vice-versa, so customers have to manually switch their trading systems' routing from one set of IPs to the other in coordination with NYSE. (They did this for real this past week -- the NYSE Chicago exchange ran out of Cermak [nyse.com] for live trading, to give New Jersey the middle finger over a proposed financial transaction tax.) For all intents and purposes, if they had to flip to DR, the exchange would likely stay closed for the day to give customers a chance to switch over; trying to do this during the trading day would be messy.

      Sure, TSE's issue looks bad because their failover failed. In reality they're no worse off than the gold-standard exchange (NYSE) which doesn't even have a failover.

      • by west ( 39918 )

        My personal experience is that DR tests are responsible for more outages than actual catastrophes. Any system that affects everything, but is only rarely used, is a huge risk factor. The old comp.risks mailing list was a gold mine for the danger of having and testing DR.

        You can't win, you can't break even, and you can't get out of the game.

  • Ahh, good times (Score:3, Interesting)

    by Krishnoid ( 984597 ) on Friday October 02, 2020 @05:29PM (#60566442) Journal

    Remember when Microsoft was running all those ads about how the London Stock Exchange ran on Windows 2000? Until it no longer did [slashdot.org].

  • Seems like people forgot how to make hardened systems. In the mid-90s I worked at an exchange that used Stratus VOS systems for "ticker distribution" (sending other exchange trading info around to various systems) and OS/390 sysplex clusters for the "book". We had a few trading floor halts when I worked there...always caused by UNIX-based systems that handled the market-maker data.

    We seem to lose a little more skill and integrity in the fin-tech world every year...

    • by gweihir ( 88907 ) on Saturday October 03, 2020 @12:28AM (#60567462)

      Oh, "people" still know how to make hardened systems. For example, I have two long-term reliable systems under my belt now, one with > 1000 users each day the other with 10 years reliable performance of shoveling a lot of data unsupervised. But "managers" have less and less of the ability to hire such people and then listen to them. When you are incapable of planning for more than the next 3 months, you do not have any risks of large disasters in view. And secondary safety nets are missing.

      Here an example from traditional engineering for a secondary safety net: A building housing students had a very badly illuminated fire-escape staircase. Obviously, the contractor that did this was doing it cheaper than possible. The solution was found in a meeting of the inhabitants: One was a volunteer fireman and suggested to show this this staircase to a fire chief from the local station. That was implemented, and only a few days later things were fixed in a hurry. What happened was that said fire chief came the same day he was asked, took one good look and then declared that he would have the building evacuated if this was not fixed within 7 days. And suddenly, things were done right.

      • Did a 20k simultaneous user system for a decade with 3 prod outages, they moved it to AWS 3 prod outages this year.

        • by gweihir ( 88907 )

          Did a 20k simultaneous user system for a decade with 3 prod outages, they moved it to AWS 3 prod outages this year.

          Yup. "Cheaper" can get expensive fast...

  • Eventually, cutting out almost all the redundancies solid engineering would have put in comes back to bite you. Usually, you lose a lot more as a result than doing it right the first time would have cost you. Greed, stupidity and amateurism at work.

    • by west ( 39918 )

      Usually, you lose a lot more as a result than doing it right the first time would have cost you. Greed, stupidity and amateurism at work.

      I'm no longer so certain about that. Over a 40 year career, I've seen a steady degradation of the "number of 9s" required by management for system stability (or security - as similar tradeoff).

      This is caused by a realization in management that customers don't value security and stability. Of course they say they do, but if you give them the a choice between stability or

  • "They also squarely accepted responsibility for the incident, rather than trying to deflect blame onto the system vendor Fujitsu Ltd."

    This sense of honor and responsibility is unnervingly rare in capitalism.
  • Having redundancy without regularly testing it is senseless. It could be tested during hours when the market is not open and with the vendor on site to provide immediate service.

    .

    In other news ... a number of employees of the Tokyo Stock Exchange commit ritual suicide as admission of their failure to properly maintain the system.

"The vast majority of successful major crimes against property are perpetrated by individuals abusing positions of trust." -- Lawrence Dalzell

Working...