Our network and systems are running normal at this time.
NY1 Region Networking Issue
Posted on 05/15/13 at 2:43PM EST
At approximately 12:43PM EST there was a networking issue which caused connectivity to be lost for virtual servers in the NY region. The issue affected approximately 25% of customers in the region. Our technical support staff immediately escalated the matter to both our network engineers and datacenter technical staff. Our staff was on location at the time to investigate and remedy the issue as quickly as possible.
Resolution
The issue was caused by a networking change which caused several switches to no longer respond correctly, requiring them to drop off from the core network. This required the affected switches to be rebooted and re-converge on to the network.
Impact
Because the switches fell off from the core public network, this caused a loss of connectivity for the customers affected and was resolved when we were able to reboot the switches and re-converge them back on to the core network. All switches had re-converged back on to the network by 1:05PM EST. The issue lasted approximately 22 minutes.
Next Steps
Due to this issue that we experienced, we are going to be replacing the top of rack switches in the NY1 region. We will be working with Cisco to bring in new equipment and coordinating maintenances with all customers during off peak hours to move all affected hypervisors on to the new switching equipment. Our ETA is to have this transition completed within the next 45 days. At this time we do not believe that the specifics of the hardware involved are the cause of the issue that we experienced. We will be pro-actively taking the necessary steps nonetheless.
Scheduled Maintenance for DigitalOcean.com
Posted on 05/08/13 at 2:50PM EST
We will be performing a scheduled maintenance on Sunday, May 12th between 9pm-10pm EST.
During this window we're going to perform an upgrade of our production database. This will provide quicker event processing, and better performance from our website. For a brief period during this window, digitalocean.com will be unavailable while the database content transfers to the new server. We will not be processing new events during that time (power off / reboot / snapshot / etc). Currently running droplets will be unaffected by the maintenance.
We expect that the database will only be unavailable for 10-15 minutes sometime during the window.
Cloud66 Issue
Posted on 05/07/13 at 1:44PM EST
Today at approximately 12:20PM EST we received two tickets regarding customers servers missing. We immediately began investigating the issue and noticed that the common element was that both customers were users of Cloud66 as identified by hostnames that begin with "c66".
We did not see anything suspicious on our end so we immediately began to deactivate portions of the API and specifically the destroy endpoint so that no further destroy events would be processed. Because Cloud66 requires your API credentials, any destroy requests would come in through that access point.
We manually attempted to also stop any running destroys that were running on the cloud irregardless of the state and the user account as another pre-emptive measure. We reached out to Cloud66 and saw that they had a maintenance page up which confirmed that it was an issue on their side and then we began opening communication with Khash Sajadi to understand the scope. We also inquired to see if Cloud66 had other were Providers that were also affected and he confirmed that they were.
Given that users on multiple providers were affected, we do not believe that the API keys and Client IDs were leaked as then requests for each provider would need to be crafted and we do not know if it was a security issue or a code issue on Cloud66's side.
What We're Doing Now
We are looking through all of the affected customers virtual servers to see if we can find a rescue snapshot that we can place back into your account so that you can spin up new server from it. We try to perform a snapshot before processing a destroy and leave that temporarily in case a customer may have mistakenly processed a destroy. We are looking into this now and will provide an update if it's available, however there is no further information on this part.
Given that Cloud66 had API keys and Client IDs we will be resetting your API key and Client ID as a preventative measure, you will then need to login to the control panel and regenerate your API key to have a new one generated.
Status Update: 2:54PM EST
AMS Region Snapshots Restored
We have finished the rescue snapshot restoration process for the AMS1 region, if one was recovered for your account it should appear in your snapshots list with the old hostname of your virtual server.
You will be able to create a new server from the snapshot however the old IP may not be available so it may be necessary to update your DNS settings to get your websites fully functional.
If you had an AMS region server and it was not recovered please open a ticket with the subject: "Cloud66 Issue - AMS Region - Virtual Server Snapshot Not Found"
And we can dive deeper and see if there is a rescue snapshot that may be available for restoration.
Status Update: 2:40PM EST
NY1 Region Snapshots Restored
As part of our destroy process we create a rescue snapshot in case a customer mistakenly processes a destroy event. Because we process a large number of restore events this is done pro-actively on the backend as a best-case effort. If we have an image available we are able to restore it back to the storage system and create a snapshot in the customers account which they could then use to create a new server from.
We have finished the restoration process for customers that had virtual servers running in the NY1 region and if we were able to locate this rescue snapshot it has been moved to the storage system and the appropriate snapshot has been created in your account.
To recover the server please navigate to the droplets create page, click on "My Images" and you will see the snapshot there labelled with the old hostname of the server. Please create a new server and confirm that it is working.
After that we recommend powering off the server from the command line ( # shutdown ) and then creating another snapshot for safe-keeping as well. Then you can boot the server back up and you should be good to go.
Unfortunately you may have a new IP address - we reserve your original IP during destroys but depending on the number of creates per day it may have been cycled through. So if you have a new IP address you will need to update your DNS accordingly.
We are actively working on the AMS region now to see if we are able to recover additional servers there.
If you were affected by this Cloud66 Issue and your server was in NY and you do not have a snapshot available please open a ticket with the following subject: "Cloud66 Issue - NY1 Region - Snapshot Unavailable"
And we will do our best to see if we can recover an image for the server.
Status Update: 2PM EST
If you have any questions we recommend that you use the support system to communicate with us as all of our staff is looking at tickets and working on this issue to see if we can help mitigate.
Any customer that we could verify as Cloud 66 users have had their API encrypted key reset. All customers that have had their API keys reset have received a confirmation email. If you are a Cloud66 user and haven't received this email, then you would need to manually reset your API key through our control panel.
We will also be resetting the client ID and we'll be sending a separate email as well.
Please Note:
If you are a customer and not using Cloud66 and you received this email in error I apologize, we felt it was more important to get this update out to customers that were using Cloud66 so they had information rather than thoroughly screen the list of affected customers and delay communication to those affected.
NY1 Region Networking Issue
Posted on 04/08/13 at 6:00PM EST
At approximately 3:05PM EST we experienced an issue on our core network in the NY1 Region, which occurred while we were performing normal networking configuration updates. The issue affected approximately 18% of customer virtual machines in the NY1 region.
Resolution
We immediately reversed the change to return the network to it's original networking configuration and then tracked down the issue to the redundant connection between the two core routers. The update to the network caused the redundant link to go down and both routers to improperly route traffic which was what affected network connectivity for some customers.
The redundant link needed to be torn down and reset in order for the core routers to once again re-establish communication and route traffic accordingly.
Impact
Because the issue occurred on the core network it resulted in a loss of connectivity for affected customers. Our troubleshooting and resolution took 20 minutes which was the duration of the downtime for the affected customers.
Vendor Escalation
Given that it was a regular networking update and not a scheduled maintenance that caused the issue, we have escalated this to our network vendor for review to see if there is potentially a bug in the current version of the OS we are running and also to further troubleshoot why this simple change caused the issue to rapidly escalate and breakdown the routing fabric.
We are providing them the necessary logs for review and if a core network upgrade is required, we will open up a maintenance window to process those changes. Given that there are two core routers in a redundant setup the upgrade, should it be necessary, will not have any customer facing impact.
NY1 Snapshot Issue & Resolution
Posted on 3/18/13 at 5:00pm EST
On Sunday March 17 at 3:07PM EST one of our backup and snapshot servers suffered a hard RAID failure causing the loss of some user backups and snapshots. We tried to recover the affected files but were unable to successfully restore them from the failed device. This resulted in the loss of specific backups and snapshots for certain users. We have emailed all affected users the specific snapshots and or backups that are no longer available.
When we originally built the snapshot/backup system our cloud was still in private beta and only a single copy of each snapshot was stored. Overtime we improved the system when customers spun up servers in different regions from a snapshot that snapshot would get copied to a secondary NAS in a different geographic region.
Looking back, we've learned that we need to be more diligent in promoting new features and services out of our labs into production and ensure that there could be no data loss or customer impact while accounting for multiple levels of failure.
Snapshot and Backup Roadmap
1. Offsite Snapshots and Backups Storage on Amazon Glacier
We've already begun the initial framework for storing all snapshots and backups on Amazon Glacier to ensure that there is a copy of each snapshot and backup that is stored on another provider. This provides an added layer of redundancy that is completely outside of DigitalOcean's network, ensuring that a single failure will not lead to any dataloss for customers.
Snapshots and backups will begin syncing to Amazon Glacier on Tuesday and we will provide an update when the sync of all existing snapshots and backups are complete and all new snapshots and backups will be automatically synced starting Wednesday.
This will allow us to always pull snapshots and backups out of Glacier and make them available for customers if for any reason one of our NAS systems experience an issue.
2. Snapshot and Backup Downloads
Customers have already been requesting that we provide a way for them to directly download their snapshots and backups so they can store them locally and allow for data portability. We will be implementing this as soon as possible. While this may seem like a trivial item to add, the complexity in rolling out this feature is to ensure security in the rollout of this feature. Because data will need to be made available from the backend NAS systems which are completely off of the public network currently.
3. Communication
As a startup often times we develop rapidly and there are a lot of internal conversations that we have about the merits of particular features and their development. We try to push this communication back to customers through our UserVoice forum but we also need to curate this conversation better to keep customers better informed of our overall product roadmap. To that end we will be using our blog and updating more frequently not only with feature announcements but also development updates to let everyone know what planned changes we are currently working on.
4. Backup Pricing
Part of being a startup is sometimes admitting when you've made a mistake, correcting it, and improving the overall service having learned from those issues. When we initially launched we planned to offer pricing for bandwidth, snapshots, and backups, however we simply were more focused on developing the core functionality than introducing those pricing guidelines. This is a mistake that we ran into with Bandwidth as well and we are looking to correct it now. Our official pricing for backups will be 20% of the cost of the virtual server. So if you want to enable backups for a $5/mo virtual server, the cost for backups will be $1/mo.
The main reason we are introducing pricing for backups and snapshots is to ensure that we can build out a robust backend storage solution as well as pay for the costs for off-site backups which are $0.01 per GB and thus ensure customer data is safe and redundant with multiple providers at all times.
These pricing changes will go into effect June 1st, giving customers two months to adjust any of their service selections accordingly, and the first time they will see an invoice item for backups will be on the July 1st invoice.
5. Snapshot Pricing
Accordingly we will also be introducing pricing for snapshots to ensure that we can provide the level of service that customers expect including data redundancy and offsite backups as well. The price for snapshots will be $0.02 per GB of snapshot storage. These rates will also go into effect as of June 1st, also giving customers two months to adjust their snapshot usage as they like without incurring any fees in the interval.
Core Network US1 Region Issue & Resolution
Posted on 1/27/13 at 8:43am EST
At approximately 6:45AM EST we experienced an issue on our core network in NY1 Region, the core network router that we were in the process of replacing last week experienced a hard failure where it completely lost is BGP session and had a hard failure which prevent failover to the secondary router which was part of our new pair that we were rolling out.
The original router was rebooted and rejoined the network unfortunately reconvergence of network devices did not go smoothly. Half of the devices did not reconverge and required manually intervention to have them reconnect. One top of rack switch had a hard time reconverging and resulted an extended period of network unavailability for the customers that were on those hypervisors.
The core network was back online at approximately 7:15AM, the majority of network devices rejoined the network at 7:30AM, and the final remaining top of rack switch that continued to have an issue rejoined the core network at approximately 7:50AM.
We will be automatically issuing SLA credits to all affected customers and servers.
Network Maintenance
As many customers know we were performing a network maintenance last week that caused several hiccups and network unavailability that last 2-5 minutes a piece. This was a result of the failing core router that was still part of the original network having hardware issues.
The original failover core router was replaced with the first of the pair of new network cores however the old core was showing abnormal behavior which was the cause of the hiccups that we experience during the network maintenance that we were performing.
We were planning to complete the core network maintenance this week to completely pull out the old cores and replace them with the two newer cores but the hardware issue that the old core was experiencing became progressively worse until it caused a complete hardware failure.
Future and Solution
We will continue to monitor how the old core is performing as it is currently still part of the core network and review our planned network maintenances that are scheduled for this week to complete work on removing the old core.
Currently everything is up and running and as a result of the hard failure many network devices are now using the new core as their primary routing point, which actually moves our network maintenance ahead so it will mean that the remaining network maintenance will require less work. As a result of this issue the network maintenance is now further along than it was at the end of this past week.
We will be opening up additional network maintenance windows this week to complete the network maintenance and remove the old core entirely and replace it with the second core network. The new core networking gear will enable growth of our NY1 region for the next 4 years and we plan all roll outs of datacenters with core routers that support a 4+ year life span at which point we normally enter a maintenance phase to review their performance and replace them if necessary.
AMS1 Region
The AMS1 region was unaffected by this issue and the core network in AMS was replaced over late November 2012 with new core networking gear which means we do not have any planned maintenances for that region is it is already running on new core infrastructure.
We rolled out the new core networking gear in AMS1 as it is our secondary location first and were in the process of rebuilding the core network in NY1 unfortunately it seems that our planned maintenance in removing the original core network in NY was causing issues on the gear as the new configurations that we were enabling between the new and old core was part of what caused the situation to worsen.
