More threads by David Baxter PhD

David Baxter PhD

Late Founder
Amazon.com suffers far-reaching outage
CBC
April 22, 2011

Amazon.com struggled Friday morning to restore computers used by other major websites such as Reddit as an outage stretched beyond 24 hours.

Though better known for selling books, DVDs and other consumer goods, Amazon also rents out space on huge computer servers that run many websites and other online services.

The problems began at an Amazon data center near Dulles Airport outside Washington early Thursday. On Friday morning, Amazon's status page said the recovery effort was making progress, but it couldn't say when all affected computers would be restored. Most of the sites that were brought down by the outage on Thursday were back up on Friday, but news-sharing site Reddit was still in "emergency read-only mode," and smaller sites were still reporting trouble. Location-sharing social network Foursquare and HootSuite, which lets users monitor Twitter and other social networks more easily, appeared to have recovered.

Many other companies that use Amazon Web Services, like Netflix Inc. and Zynga Inc., which runs Facebook games, were unscathed by the outage. Amazon has at least one other major U.S. data center that stayed up, in California.

It's not uncommon for internet services to become inaccessible due to technical problems, sometimes for hours or even days. But the outage is notable because Amazon's servers are so commonly used, meaning many sites went down at once. Amazon, which had not responded to requests for comment, has not revealed how many companies use its internet services or how many were affected by the outage. No one knew for sure how many people were inconvenienced, but the services affected are used by millions.

Amazon Web Services provide "cloud" or utility-style computing in which customers pay only for the computing power and storage they need, on remote computers. Seattle-based Amazon has big plans for AWS. Although it now makes up just a few percent of the company's revenue, CEO Jeff Bezos said last year that it could eventually be as large as Amazon's retail business. Competitors include Rackspace Hosting Inc. and Microsoft Corp.'s Azure platform.

Some people consider cloud computing more reliable than conventional hosting services in which a small company might rent a handful of computers in a data center. If one of them malfunctions, the failure can take down a website. But "clouds" like AWS use vast banks of computers. If one fails, the tasks that it performs, such as running a website or a game, can immediately be taken over by others. When a company needs more capacity, maybe because of a surge in visitors to its website, it only takes minutes to rent more computers from Amazon.

But cloud computing isn't immune to failure, either.

Backup system appears to have failed
Lydia Leong, an analyst for the tech research firm Gartner, said that judging by details posted on Amazon's AWS status page, a network connection failed Thursday morning, triggering an automatic recovery mechanism that then also failed.

Amazon's computers are divided into groups that are supposed to be independent of each other. If one group fails, others should stay up. And customers are encouraged to spread the computers they rent over several groups to ensure reliable service. But Thursday's problem took out many groups simultaneously.

Outages with Amazon's services are rare but not unprecedented. In 2008, several companies lost access to their own files for about two hours when one of Amazon's data centers failed. The companies included DigitalChalk Inc., which delivers multimedia training over the Web.

In general, Amazon Web Services have been more reliable and, above all, cheaper than many other hosting systems, said Josh Cochrane, vice president of product development at Palo Alto Software in Eugene, Ore. But the firm's websites and web-based applications that create business plans were all brought down by Thursday's crash. "It's a pretty vulnerable feeling," he said. "This is a really big message to us that we need to revisit our strategy." That might include spreading the applications more widely over Amazon's network, so that problems at one data center won't bring down everything, he said.

Amazon engineers struggled throughout the day to rectify the problem. Leong said the problems are of a type that's not covered by Amazon's money-back guarantees.
 

David Baxter PhD

Late Founder
Re: More Vulerabilities of Cloud Computing

The Amazon.com AWS problem began Thursday, April 21, 2011.

Fans of the Terminator movies starring Arnold Schwarzenegger may recall that, according to the storyline, the Skynet computer network became self-aware on April 19, 2011, and launched its attack against the human race on April 21, 2011.

Life imitating fiction? :panic:
 

David Baxter PhD

Late Founder
Amazon?s Trouble Raises Cloud Computing Doubts

Amazon?s Trouble Raises Cloud Computing Doubts
By STEVE LOHR, New York Times
April 22, 2011

As technical problems interrupted computer services provided by Amazon for a second day on Friday, industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control.

?This is a wake-up call for cloud computing,? said Matthew Eastwood, an analyst for the research firm IDC, using the term for accessing services and information in big data centers remotely over the Internet from anywhere, as if the services were in a cloud. ?It will force a conversation in the industry.?

That discussion, he said, will most likely center on what data and computer operations to send off to the cloud and what to keep inside the corporate walls.

But another issue, Mr. Eastwood said, will be a re-examination of the contracts that cover cloud services ? how much to pay for backup and recovery services, including paying extra for data centers in different locations. That is because the companies that were apparently hit hardest by the Amazon interruption were start-ups that, analysts said, are focused on moving fast in pursuit of growth, and less apt to pay for extensive backup and recovery services.

Amazon set up a side business five years ago offering computing resources to businesses from its network of sophisticated data centers. Today, the company is the early leader in the fast-growing business of cloud computing.

In business, the cloud model is rapidly gaining popularity as a way for companies to outsource computing chores to avoid the costs and headaches of running their own data centers ? simply tap in, over the Web, to computer processing and storage without owning the machines or operating software.

Amazon has thousands of corporate customers, from Pfizer and Netflix to legions of start-ups, whose businesses often live on Amazon Web Services. Those reporting service troubles included Foursquare, a location-based social networking site; Quora, a question-and-answer service; Reddit, a news-sharing site; and BigDoor, which makes game tools for Web publishers.
The problems companies reported varied, but included being unable to access data, service interruptions and sites being shut down.

Amazon has data centers around the world, but the current problems have come from its big center in Northern Virginia, near Dulles airport. Amazon?s Web page on the status of its cloud services said on Friday that matters were improving but were still not resolved. A company spokeswoman said the updates would be Amazon?s only comment for now.

Big companies, that have decided to put crucial operations on Amazon computers are apt to pay up for the equivalent of computing insurance, analysts say. Netflix, the movie rental site, has become a large customer of the Amazon cloud. Most of its Web technology ? customer movie queues, search tools and the like ? runs in Amazon data centers.

Netflix said it had sailed through the last couple of days unscathed. ?That?s because Netflix has taken full advantage of Amazon Web Services? redundant cloud architecture,? which insures against technical malfunctions in any one location, said Steve Swasey, a Netflix spokesman.

BigDoor, a 20-employee start-up in Seattle, was knocked down by Amazon?s travails. It had backup and recovery services with Amazon, said Keith Smith, the chief executive, but only at Amazon?s data center in Virginia. ?There?s always a trade-off,? Mr. Smith said, noting the expenses and developer time that would have been required to do more.

By Friday evening, most services at BigDoor, which makes game and rewards features for online publishers, were back up, but its Web site was still down.

The long-term toll to cloud computing, if any, is uncertain. Corporate cloud computing is expected to grow rapidly, by more than 25 percent a year, to $55.5 billion by 2014, IDC estimates.

Major technology suppliers are aggressively promoting different cloud offerings ? some emphasizing a utility-style service, like Amazon, and others focusing more on selling big companies the hardware and software to more efficiently juggle computing workloads. The latter use the cloud technology, but companies own and control them ? so-called private clouds.

The Amazon interruption, said Lew Moorman, chief strategy officer of Rackspace, a specialist in data center services, was the computing equivalent of an airplane crash. It is a major episode with widespread damage. But airline travel, he noted, is still safer than traveling in a car ? analogous to cloud computing being safer than data centers run by individual companies.

?Every day, inside companies all over the world, there are technology outages,? Mr. Moorman said. ?Each episode is smaller, but they add up to far more lost time, money and business.?

The Amazon setback, he said, should prove to be a learning experience. ?We all have an interest in Amazon handling this well,? said Mr. Moorman, whose company is a competitor in the cloud business.
 

David Baxter PhD

Late Founder
Amazon.com explains recent cloud computing outage that took down Foursquare and Reddi

Amazon.com explains recent cloud computing outage that took down Foursquare and Reddit
By Larry Greenemeier, Scientific American
Apr 29, 2011

Amazon Web Services LLC (AWS), the cloud computing arm of online marketplace Amazon.com, on Friday explained what happened during last week's service outage, which disrupted many of its customers' Web sites. AWS, formed by Amazon in 2006 to capitalize on the cloud computing hype, ran into problems on April 21 with a network configuration change that took several days to fix, slowing or disabling access to sites run by location-based social network Foursquare, fellow cloud service provider Engine Yard, social news outlet Reddit, and several others.

"The trigger for this event was a network configuration change," the company confirmed in a message on its Web site. "We will audit our change process and increase the automation to prevent this mistake from happening in the future."

During AWS's disruption the company's so-called "elastic block" data storage (EBS) became unable to perform certain functions. This storage consists of computer clusters that store, manage and back up customer data. The clusters themselves are made up of individual node computers, and these nodes are connected via two networks?a primary high-bandwidth network that manages normal traffic and a lower-capacity backup network. The problem began on April 21 while Amazon was attempting to upgrade capacity in the network serving the eastern U.S. The company incorrectly shifted network traffic from the primary network to the backup network, which could not adequately handle the volume of activity.

Once the error was realized and traffic was shifted back to the primary network, the storage nodes on the primary were overwhelmed by the barrage of data and could not find enough space to hold it all. Like a game of musical chairs, some data was left in limbo, continuously searching for free storage space. This backed up new requests for storage space coming into the system, slowing or shutting down parts of Web sites using Amazon's services.

The company corrected this by disabling new storage requests, but the damage was already done. Overwhelmed nodes began to fail, exacerbating the problem of having too much data and not enough available storage space. AWS was able to address this over the next few days by adding storage capacity to the network and tweaking its storage management software.
 
Replying is not possible. This forum is only available as an archive.
Top